Snapshots.md (5834B)
1 # Snapshots in Shared Games 2 3 Snapshots are a move-log compaction and replay acceleration mechanism. Instead 4 of reconstructing a game by replaying every move from an empty grid, a client can 5 start from a stored grid snapshot and replay only the moves not included in that 6 snapshot. 7 8 The current scalar `upToLamport` model is not safe for shared games. It assumes 9 there is one global move sequence: 10 11 ```text 12 Snapshot(upToLamport: 42) 13 ``` 14 15 Replay then skips every move whose Lamport is `<= 42`. That is only correct if 16 all devices assign Lamports from a single globally ordered stream. In a shared 17 or offline-capable game, two devices can independently create moves with the 18 same or overlapping Lamport values. A snapshot from device A at Lamport 42 may 19 not include device B's move at Lamport 41. If replay treats B's move as covered, 20 that move can disappear from the reconstructed grid. 21 22 ## Proposed Model 23 24 Use per-replica move clocks and snapshot coverage vectors. 25 26 Each move should identify the replica that produced it and that replica's local 27 sequence number: 28 29 ```text 30 Move(replicaID: "device-a", sequence: 12) 31 Move(replicaID: "device-b", sequence: 7) 32 ``` 33 34 A snapshot should store the grid plus the highest included sequence for each 35 replica: 36 37 ```text 38 Snapshot( 39 grid: ..., 40 coverage: [ 41 "device-a": 12, 42 "device-b": 7 43 ] 44 ) 45 ``` 46 47 Replay can then skip a move only when the selected snapshot explicitly covers 48 that move: 49 50 ```swift 51 let coveredSequence = snapshot.coverage[move.replicaID] ?? 0 52 let isCovered = move.sequence <= coveredSequence 53 ``` 54 55 This prevents one device's snapshot from accidentally hiding another device's 56 move. A move is skipped only if the snapshot proves that the move was folded 57 into the snapshot grid. 58 59 ## Replay and Conflict Ordering 60 61 Snapshot coverage answers one question: which moves are included in this 62 snapshot? 63 64 Conflict ordering answers a separate question: when multiple moves affect the 65 same square, which move wins? 66 67 Those should not be represented by the same scalar value. Replay should: 68 69 1. Choose a snapshot. 70 2. Start from the snapshot grid. 71 3. Apply only moves not covered by that snapshot. 72 4. Use a deterministic conflict ordering for same-cell writes. 73 74 The conflict ordering can be last-writer-wins with a stable tie-breaker, for 75 example `(createdAt, replicaID, sequence)` or another explicit ordering chosen 76 for product semantics. The important part is that coverage remains per-replica. 77 78 The shipped tie-break (`GridStateMerger.shouldReplace`) is 79 `(updatedAt, authorID, deviceID)`: per-cell last-writer-wins on the writer's 80 wall-clock `updatedAt`, then lex-min `authorID`, then lex-min `deviceID`. 81 82 ### Accepted trade-off: cross-device wall-clock skew 83 84 The merge, the completion seal (`merge(notAfter: completedAt)`), the pause-push 85 diff window, and presence all compare raw device wall clocks across users — 86 there is no logical clock and no server-time normalisation on apply. A device 87 whose clock runs fast therefore wins every cell conflict regardless of real 88 typing order, and its legitimate pre-win letters can be dropped by the 89 completion seal when their future-dated `updatedAt` reads as past the latch. 90 91 This is accepted as-is rather than mitigated, because the realistic incidence is 92 low. iOS defaults to "Set Automatically" (NTP / cellular network time), which 93 holds devices within sub-second of true time; quartz drift on a merely-offline 94 device is seconds per week. Sub-second skew almost never inverts a merge outcome, 95 since that requires two writers to the *same* cell within a real-time gap smaller 96 than the skew, and the losing letter in such a tie is usually plausible anyway — 97 it is a coin-flip on a shared cell, not silent destruction of unique work. The 98 one regime where skew actually bites is a user who has turned off automatic time 99 or set the clock manually (minutes to days off); that is self-inflicted 100 misconfiguration, and since `CKShare` participants are invited friends, a 101 malicious far-future clock is griefing by someone you chose to play with, not a 102 cheating-for-advantage vector — there is nothing zero-sum to win in a cooperative 103 solve. 104 105 If field reports ever surface a "my answer disappeared" trace to a wrong clock, 106 the cheap mitigation is to clamp each inbound cell's `updatedAt` to 107 `record.modificationDate + ε` on apply, bounding every writer's skew by 108 CloudKit's server clock (which all devices share). It is deliberately not built 109 speculatively: it adds a permanent "inbound timestamps are rewritten on apply" 110 behaviour to reason about, for an edge no user is known to hit. 111 112 ## Pruning 113 114 Pruning old moves is the risky part. With multiple replicas, snapshots can be 115 incomparable: 116 117 ```text 118 Snapshot A covers device-a: 20, device-b: 5 119 Snapshot B covers device-a: 10, device-b: 30 120 ``` 121 122 Neither snapshot fully replaces the other. If moves are deleted merely because 123 they are covered by some snapshot, a later replay from a different snapshot may 124 miss required moves. 125 126 The conservative implementation path is: 127 128 1. Add per-replica coverage to snapshots. 129 2. Use snapshots only as replay accelerators. 130 3. Do not prune moves initially. 131 4. Add pruning later only for moves covered by a retained canonical snapshot. 132 133 A newer snapshot can safely replace a canonical snapshot when its coverage 134 dominates the previous one: for every known replica, the newer snapshot covers at 135 least as high a sequence number. Until that rule exists and is tested, retaining 136 the move log is safer than compacting it incorrectly. 137 138 ## Practical Recommendation 139 140 For now, disable snapshot pruning for shared games. It is also reasonable to 141 disable shared-game snapshot creation entirely until the snapshot format records 142 per-replica coverage. The cost is a larger move log and slower replay for very 143 long games; the benefit is avoiding data loss or disappearing squares.