crossmate

A collaborative crossword app for iOS
Log | Files | Refs | LICENSE

Snapshots.md (5834B)


      1 # Snapshots in Shared Games
      2 
      3 Snapshots are a move-log compaction and replay acceleration mechanism. Instead
      4 of reconstructing a game by replaying every move from an empty grid, a client can
      5 start from a stored grid snapshot and replay only the moves not included in that
      6 snapshot.
      7 
      8 The current scalar `upToLamport` model is not safe for shared games. It assumes
      9 there is one global move sequence:
     10 
     11 ```text
     12 Snapshot(upToLamport: 42)
     13 ```
     14 
     15 Replay then skips every move whose Lamport is `<= 42`. That is only correct if
     16 all devices assign Lamports from a single globally ordered stream. In a shared
     17 or offline-capable game, two devices can independently create moves with the
     18 same or overlapping Lamport values. A snapshot from device A at Lamport 42 may
     19 not include device B's move at Lamport 41. If replay treats B's move as covered,
     20 that move can disappear from the reconstructed grid.
     21 
     22 ## Proposed Model
     23 
     24 Use per-replica move clocks and snapshot coverage vectors.
     25 
     26 Each move should identify the replica that produced it and that replica's local
     27 sequence number:
     28 
     29 ```text
     30 Move(replicaID: "device-a", sequence: 12)
     31 Move(replicaID: "device-b", sequence: 7)
     32 ```
     33 
     34 A snapshot should store the grid plus the highest included sequence for each
     35 replica:
     36 
     37 ```text
     38 Snapshot(
     39   grid: ...,
     40   coverage: [
     41     "device-a": 12,
     42     "device-b": 7
     43   ]
     44 )
     45 ```
     46 
     47 Replay can then skip a move only when the selected snapshot explicitly covers
     48 that move:
     49 
     50 ```swift
     51 let coveredSequence = snapshot.coverage[move.replicaID] ?? 0
     52 let isCovered = move.sequence <= coveredSequence
     53 ```
     54 
     55 This prevents one device's snapshot from accidentally hiding another device's
     56 move. A move is skipped only if the snapshot proves that the move was folded
     57 into the snapshot grid.
     58 
     59 ## Replay and Conflict Ordering
     60 
     61 Snapshot coverage answers one question: which moves are included in this
     62 snapshot?
     63 
     64 Conflict ordering answers a separate question: when multiple moves affect the
     65 same square, which move wins?
     66 
     67 Those should not be represented by the same scalar value. Replay should:
     68 
     69 1. Choose a snapshot.
     70 2. Start from the snapshot grid.
     71 3. Apply only moves not covered by that snapshot.
     72 4. Use a deterministic conflict ordering for same-cell writes.
     73 
     74 The conflict ordering can be last-writer-wins with a stable tie-breaker, for
     75 example `(createdAt, replicaID, sequence)` or another explicit ordering chosen
     76 for product semantics. The important part is that coverage remains per-replica.
     77 
     78 The shipped tie-break (`GridStateMerger.shouldReplace`) is
     79 `(updatedAt, authorID, deviceID)`: per-cell last-writer-wins on the writer's
     80 wall-clock `updatedAt`, then lex-min `authorID`, then lex-min `deviceID`.
     81 
     82 ### Accepted trade-off: cross-device wall-clock skew
     83 
     84 The merge, the completion seal (`merge(notAfter: completedAt)`), the pause-push
     85 diff window, and presence all compare raw device wall clocks across users —
     86 there is no logical clock and no server-time normalisation on apply. A device
     87 whose clock runs fast therefore wins every cell conflict regardless of real
     88 typing order, and its legitimate pre-win letters can be dropped by the
     89 completion seal when their future-dated `updatedAt` reads as past the latch.
     90 
     91 This is accepted as-is rather than mitigated, because the realistic incidence is
     92 low. iOS defaults to "Set Automatically" (NTP / cellular network time), which
     93 holds devices within sub-second of true time; quartz drift on a merely-offline
     94 device is seconds per week. Sub-second skew almost never inverts a merge outcome,
     95 since that requires two writers to the *same* cell within a real-time gap smaller
     96 than the skew, and the losing letter in such a tie is usually plausible anyway —
     97 it is a coin-flip on a shared cell, not silent destruction of unique work. The
     98 one regime where skew actually bites is a user who has turned off automatic time
     99 or set the clock manually (minutes to days off); that is self-inflicted
    100 misconfiguration, and since `CKShare` participants are invited friends, a
    101 malicious far-future clock is griefing by someone you chose to play with, not a
    102 cheating-for-advantage vector — there is nothing zero-sum to win in a cooperative
    103 solve.
    104 
    105 If field reports ever surface a "my answer disappeared" trace to a wrong clock,
    106 the cheap mitigation is to clamp each inbound cell's `updatedAt` to
    107 `record.modificationDate + ε` on apply, bounding every writer's skew by
    108 CloudKit's server clock (which all devices share). It is deliberately not built
    109 speculatively: it adds a permanent "inbound timestamps are rewritten on apply"
    110 behaviour to reason about, for an edge no user is known to hit.
    111 
    112 ## Pruning
    113 
    114 Pruning old moves is the risky part. With multiple replicas, snapshots can be
    115 incomparable:
    116 
    117 ```text
    118 Snapshot A covers device-a: 20, device-b: 5
    119 Snapshot B covers device-a: 10, device-b: 30
    120 ```
    121 
    122 Neither snapshot fully replaces the other. If moves are deleted merely because
    123 they are covered by some snapshot, a later replay from a different snapshot may
    124 miss required moves.
    125 
    126 The conservative implementation path is:
    127 
    128 1. Add per-replica coverage to snapshots.
    129 2. Use snapshots only as replay accelerators.
    130 3. Do not prune moves initially.
    131 4. Add pruning later only for moves covered by a retained canonical snapshot.
    132 
    133 A newer snapshot can safely replace a canonical snapshot when its coverage
    134 dominates the previous one: for every known replica, the newer snapshot covers at
    135 least as high a sequence number. Until that rule exists and is tested, retaining
    136 the move log is safer than compacting it incorrectly.
    137 
    138 ## Practical Recommendation
    139 
    140 For now, disable snapshot pruning for shared games. It is also reasonable to
    141 disable shared-game snapshot creation entirely until the snapshot format records
    142 per-replica coverage. The cost is a larger move log and slower replay for very
    143 long games; the benefit is avoiding data loss or disappearing squares.