crossmate

A collaborative crossword app for iOS
Log | Files | Refs | LICENSE

SyncRedesign.md (18082B)


      1 # Sync Redesign Plan
      2 
      3 ## Context
      4 
      5 On 2026-04-21, completing a puzzle on device surfaced two distinct failures:
      6 
      7 1. **Data loss**: after filling every cell and seeing the win screen, the puzzle list's preview was missing a square. Re-opening the puzzle confirmed a previously-filled cell had reverted to blank.
      8 2. **CloudKit throttling**: diagnostic logs showed the container hitting a device-level throttle (`x-cloudkit-error=THROTTLED`, `CKRetryAfter=22`, `ThrottleLabel=Device throttle: container=iCloud.net.inqk.crossmate.v2, type=210/211`) with five distinct `OperationID`s inside a ~2-second window.
      9 
     10 These are separate problems with a shared amplifier.
     11 
     12 ## Status (as of 2026-04-26)
     13 
     14 **Phases 1, 2, 3a, 3b+3c, and 4 are complete.** Only the conditional Phase 5 (WebRTC) remains, gated on measured live-latency being inadequate.
     15 
     16 ### Done
     17 
     18 - **Phase 1** — _was_: fetch-side LWW guard on `RecordSerializer.applyCellRecord`. Now deleted: Cell records no longer exist in the move-log model.
     19 - **Phase 2** — _was_: `PushDebouncer`, single-flight push, `CKRetryAfter` handling. Now deleted: replaced by `CKSyncEngine`'s built-in batching and retry.
     20 - **Phase 3a** — data model and replay plumbing: `MoveEntity`, `SnapshotEntity`, `GameEntity.lamportHighWater`, `MoveLog.swift`, `RecordSerializer` Move/Snapshot builders. Tests in `MoveLogTests.swift`.
     21 - **Phase 3b+3c** — full move-log + CKSyncEngine adoption (all 9 sub-tasks done):
     22   1. `MoveBuffer.swift` — coalescing actor, lamport assignment, `MoveEntity` persistence, debounce. Tests in `MoveBufferTests.swift`.
     23   2. `SyncEngine.swift` — rewritten around `CKSyncEngine`. Delegate handles incoming `Move`/`Snapshot` records, replays `CellEntity` cache, fires `onRemoteMoves`.
     24   3. `GameMutator.swift` — now emits `Move` to `MoveBuffer`; no direct `CellEntity` writes; `RemoteCellChange` / `Origin` / `applyRemoteCell` deleted.
     25   4. `GameStore.swift` — restores from move-log replay; `createGame` no longer seeds `CellEntity`; `repairSyncedGamesIfNeeded` / `applyRemoteChanges` deleted.
     26   5. `AppServices.swift` — constructs `MoveBuffer`, wires into `GameStore`, wires `onRemoteMoves` → `store.refreshCurrentGame()`.
     27   6. `CrossmateApp.swift` — no changes needed; `syncOnBackground()` already calls `moveBuffer.flush()`.
     28   7. `OutboxRecorder.swift`, `PushDebouncer.swift`, `PendingChangePayload.swift`, `PendingChange+Helpers.swift` deleted; `PendingChangeEntity` and obsolete token fields removed from data model.
     29   8. `PushDebouncerTests.swift`, `PendingChangeTests.swift` deleted; `ApplyCellRecordLWWTests` / `PendingChangePayloadTests` suites removed from `RecordSerializerTests.swift`; `GameMutatorTests.swift` pruned to in-memory mutation tests only.
     30   9. `MoveBufferTests.swift` — 7 tests all passing.
     31 
     32 - **Phase 4** — CKShare via per-game zones and dual sync engines, plus shared player identity:
     33   1. Per-game zones (`game-<UUID>`) with one `CKShare` rooted on the `Game` record; `Move`/`Snapshot`/`Player` records sit in the same zone via `parent` reference, so share inclusion is automatic.
     34   2. Dual sync engines: one driving `privateCloudDatabase` (owned games), one driving `sharedCloudDatabase` (joined games), parameterised through the same delegate path.
     35   3. `ShareController` for invitation creation; share-accept handler in `AppServices` accepts via `CKAcceptSharesOperation` and wires the new zone into the shared engine.
     36   4. `AuthorIdentity` stamps every `Move` with the local iCloud user record name, so multi-author replay attributes correctly.
     37   5. Share revocation: `markAccessRevoked(gameID:)` flips the local game read-only when the participant is removed; resolves the open question from earlier drafts.
     38   6. **Player identity** (deviated from the original `namesByAuthor`-on-Game JSON-blob plan in `.claude/plans/shared-player-identity.md`): names ride on per-(game, author) `Player` records (`player-<gameID>-<authorID>`) through the same engine. Each author owns their own record name → no write-write conflicts, no JSON merge primitive, no "must start from latest server record" hazard. `NameBroadcaster` fans out renames to every shared/joined game (debounced 250 ms); `PlayerRoster` resolves entries via the local `PlayerEntity` rows → `CKShare.Participant` name components → email → `"Player"` fallback chain. Colours are device-local in `GamePlayerColorStore`, with collision reassignment when the local user picks a remote colour.
     39   7. Snapshot pruning correctness fix landed during Phase 4: covered moves are now deleted only after `CKSyncEngine` confirms the snapshot save (`needsPruning` flag on `SnapshotEntity`), preventing data loss when an upload fails permanently.
     40 
     41 ### Design decisions already locked in conversation
     42 
     43 - **Lamport timestamps** — simple per-game monotonic counter stored in `GameEntity.lamportHighWater`. Bumped on local emit and on receiving a move with a higher lamport. No cross-participant total order; undo-across-participants stays out of scope.
     44 - **Snapshot cadence** — on game completion + opportunistic when the move log exceeds 200 moves. Starting point; tunable later.
     45 - **No migration** — user deletes existing games before launching the new version, so the new code can assume the move-log schema from first run.
     46 - **`CellEntity` stays** — as a replay-driven cache, not source of truth. Views that read `CellEntity` keep working.
     47 
     48 ### Gotchas for a fresh session
     49 
     50 - The Xcode project is generated by XcodeGen from `project.yml`. Run `xcodegen generate` after adding new `.swift` files or the build target won't see them.
     51 - The test target is named `Crossmate Unit Tests` (with spaces). Use `-only-testing:'Crossmate Unit Tests/<SuiteName>'`.
     52 - SourceKit diagnostics in-session regularly claim "Cannot find type X in scope" for Core Data generated classes and cross-file types. These are noise — trust `xcodebuild build` for ground truth.
     53 - Build target: `'platform=iOS Simulator,name=iPhone 17 Pro'`.
     54 - The user commits changes themselves — never run `git commit` unsolicited.
     55 
     56 ### Root cause: data loss
     57 
     58 `RecordSerializer.applyCellRecord` (`Crossmate/Sync/RecordSerializer.swift:129`) unconditionally overwrites the local `CellEntity`'s `letter`, `markKind`, `checkedWrong`, `updatedAt`, `letterAuthorID` with the server's values. There is no LWW check against local `updatedAt` and no check for an outstanding `PendingChangeEntity`. Only `handlePushError`'s `.serverRecordChanged` branch does LWW; the plain fetch path does not.
     59 
     60 Reproduction:
     61 1. User types final letter; local `letter="A"`, pending change queued.
     62 2. Push is throttled; pending change remains in the outbox.
     63 3. Interleaved fetch (log: `fetch: changedZoneIDs count=1`) pulls the server's stale copy of that cell (`letter=""`).
     64 4. `applyCellRecord` overwrites local `letter="A"` with `""`. UI renders blank.
     65 5. Pending change would eventually re-push `"A"`, but during the window the cell genuinely reads blank.
     66 
     67 ### Root cause: throttling
     68 
     69 The sync engine writes one `CKRecord` per cell and fires pushes aggressively:
     70 - No client-side debounce: every keystroke enqueues a `PendingChangeEntity`.
     71 - No serialisation: multiple `pushChanges()` invocations can produce overlapping `CKModifyRecordsOperation`s (hence the five distinct `OperationID`s in one second).
     72 - No `CKRetryAfter` handling: on 429, `handlePushError`'s default branch silently leaves the pending change in the outbox, and the next trigger immediately hits CloudKit again, tightening the device cooldown.
     73 
     74 ### Why data model is central
     75 
     76 Throttling is ultimately about request rate and payload volume, not record shape per se. But the per-cell-record shape makes debouncing and coalescing hard (each cell has its own identity and history), and it doesn't compose with any real-time side-channel — each keystroke is always a record write. A better-chosen data shape removes the hardest constraint before we start tuning knobs.
     77 
     78 ## Decision
     79 
     80 **Adopt a CloudKit + move-log + CKShare architecture, with WebRTC as deferred optional future work.**
     81 
     82 ### Data model: append-only move log
     83 
     84 The source of truth for a game's cell state becomes an append-only collection of `Move` records, each describing a single cell mutation:
     85 
     86 - `recordName`: `move-<gameUUID>-<lamportOrUUID>`
     87 - `parent`: reference to the parent `Game` record (so CKShare inclusion is automatic)
     88 - Fields: `row`, `col`, `letter`, `markKind`, `checkedWrong`, `authorID`, `createdAt`, and a monotonic ordering key (Lamport timestamp or similar).
     89 
     90 The grid displayed in the UI is the deterministic replay of the move log. `CellEntity` becomes a local cache populated by replaying moves into Core Data; it is not the source of truth.
     91 
     92 Periodic **snapshots** keep the move log bounded: every N moves (or on game completion), write a `Snapshot` record capturing the full grid state and delete moves older than the snapshot. New joiners load the latest snapshot plus the tail of moves since it.
     93 
     94 ### Why move log over per-cell-LWW or whole-game blob
     95 
     96 | Model | Throttling | Multiplayer correctness | Composes with real-time | Conflict code |
     97 |---|---|---|---|---|
     98 | Per-cell records (current) | Bad: per-keystroke write, hard to coalesce | Requires LWW on fetch (not present today) | Awkward: same keystroke would need to flow over two transports | Must implement on every client |
     99 | Whole-game blob with LWW | Good: coalesces trivially | **Broken**: Bob's blob without Alice's write wipes her cells | Awkward | N/A (wrong) |
    100 | Whole-game blob with per-cell LWW merge | Good | Works if merge is disciplined | Awkward | Non-trivial merge code on every client |
    101 | **Move log + snapshots** | **Good**: moves are small, append-only, debouncable | **Works**: each author writes only their own records, no `serverRecordChanged` races on shared cells | **Clean**: the same move unit ships over CloudKit or any side-channel | Deterministic replay; no merge code |
    102 
    103 The move log's key property for CKShare is **no write-write conflicts between participants** — Alice and Bob only ever create new records, each with a unique `recordName`. That removes the hardest correctness problem from the table.
    104 
    105 ### How coalescing works in the move log
    106 
    107 "Coalescing" in this design is really a debounced in-memory buffer that dedupes by cell before anything becomes a record. A keystroke does **not** become a `Move` record at the moment it happens — it sits in an in-memory buffer keyed by `(gameID, row, col)`. A new keystroke in the same cell replaces the pending entry's value. At flush time, each buffer entry becomes one `Move` record with a fresh Lamport timestamp, and all records from one flush go into a **single** `CKModifyRecordsOperation`.
    108 
    109 So if the user types `A` → `B` → `C` in the same cell inside the debounce window, only one `Move` record is written, with `letter = "C"`. The intermediate states never exist as records. Flush triggers are: the debounce timer (1–2s), cell-change (flush the old cell before the new one starts accumulating, so the log's causal order matches the user's editing order), app background, and game completion.
    110 
    111 This is safe specifically because moves are append-only and each author only writes their own records — local coalescing never has to reason about server state or another participant's writes, which is what breaks the equivalent coalescing in the per-cell-record model.
    112 
    113 ### Sync mechanics: CKShare for both async and sync
    114 
    115 - Each `Game` record is the root of a `CKShare`. Participants accept the share URL via `UICloudSharingController` and the OS's share-accept handler.
    116 - `Move` and `Snapshot` records set `parent = Game`, so they are part of the share automatically.
    117 - The engine drives both `privateCloudDatabase` (owner) and `sharedCloudDatabase` (participant) through the same code paths, parameterised by database handle.
    118 - `CKDatabaseSubscription` on both databases delivers silent pushes on change; the app wakes, fetches, replays moves into the local cache.
    119 - Async solving: partner's moves arrive on next fetch; 1–5s latency after push is fine.
    120 - Sync solving: same mechanism; feels like near-real-time at 1–3s in good conditions.
    121 
    122 ### Throttle-safety rules (apply regardless of model)
    123 
    124 These are properties the engine must have, independent of the data model choice:
    125 
    126 1. **Debounce writes.** Buffer move events in memory; flush at most every 1–2 seconds (or on cell change / app background).
    127 2. **Single in-flight push.** Serialise `pushChanges()` via an actor or explicit lock so overlapping triggers coalesce rather than fan out.
    128 3. **Respect `CKRetryAfter`.** On `CKError.serviceUnavailable` / `.requestRateLimited` / `.zoneBusy`, sleep for the returned delay before the next push attempt. Do not break out of drain on transient errors.
    129 4. **Adopt `CKSyncEngine`.** It handles batching, retry-after, and send-state tracking natively. Re-implementing these primitives has been the source of the current bugs. The minimum deployment target for this redesign is iOS 18 (or iOS 26), so `CKSyncEngine` — including its shared-database support — is available unconditionally.
    130 
    131 ### Deferred: WebRTC data channel
    132 
    133 If 1–3s live latency proves inadequate in practice, add a WebRTC data channel between live participants:
    134 
    135 - Signaling: small records written to the shared CloudKit zone (Alice writes SDP offer; Bob fetches via push, writes answer; ICE candidates exchanged the same way). Signaling latency is irrelevant since it's one-time per session.
    136 - STUN: public free server (`stun:stun.l.google.com:19302` or similar).
    137 - TURN: free tier (Cloudflare Calls, Xirsys) or paid (~$0.40/GB, Twilio) for the ~15–20% of sessions that fail direct P2P. Graceful fallback: if P2P fails, use the existing CloudKit move-log path.
    138 - Implementation: `WKWebView`-hosted WebRTC using WebKit's built-in `RTCPeerConnection` / `RTCDataChannel`. Avoids the ~20 MB `libwebrtc` bundle cost. Swift ↔ JS bridge via `WKScriptMessageHandler` + `evaluateJavaScript`. The app already ships `WKWebView` for NYT auth, so the framework cost is zero.
    139 
    140 This is **out of scope for the initial redesign.** CloudKit + move log + CKShare should be shipped and evaluated first. Add WebRTC only if measured live latency is genuinely unacceptable.
    141 
    142 ### Rejected alternatives
    143 
    144 - **GameKit** (`GKMatch` / `GKTurnBasedMatch`). Requires Game Center authentication, which adds user friction, fails for users who have dismissed the sign-in prompt, and introduces a second identity layer separate from iCloud. The transport advantages do not outweigh the user-facing costs.
    145 - **Custom backend** (Firebase, Supabase, Convex, Cloudflare Durable Objects, Ably). Violates the no-backend constraint. Kept in mind as a future option if pure-CloudKit proves inadequate at scale.
    146 - **SharePlay as the primary transport.** Requires an active FaceTime/Messages session, which is not guaranteed. Still a candidate as a future augmentation for call-based live sessions, but cannot replace the async path.
    147 
    148 ## Scope of the redesign
    149 
    150 ### In scope (initial implementation)
    151 
    152 - New `Move` and `Snapshot` record types, with serialisation and replay logic.
    153 - `CellEntity` becomes a derived cache populated by replay; not the source of truth.
    154 - No migration. The user is the sole participant and will manually delete existing games before launching the new version. The new codebase is free to assume the move-log schema from the first run.
    155 - `SyncEngine` rewritten around `CKSyncEngine`, driving both the private and shared databases.
    156 - Dual-database support: `privateCloudDatabase` for owned games, `sharedCloudDatabase` for joined games.
    157 - CKShare creation, invitation flow via `UICloudSharingController`, share-accept handler in the app delegate.
    158 - Per-record `authorID` from `CKUserIdentity` (iCloud identity), so moves attribute correctly in multiplayer.
    159 - Fix the fetch-side overwrite bug: `applyCellRecord` (or its replacement in the move-log model) must not clobber state without LWW.
    160 
    161 ### Out of scope (explicitly deferred)
    162 
    163 - WebRTC data channel and signaling.
    164 - SharePlay integration.
    165 - MultipeerConnectivity for co-located play.
    166 - Presence indicators (who is currently in the puzzle) and cursor tracking.
    167 - Operation-based undo across participants.
    168 - Conflict UX for the pathological "both type in the same cell in the same second" case — accept Lamport-ordered LWW silently for now.
    169 
    170 ## Open questions
    171 
    172 _All originally listed questions have been resolved:_
    173 - _Share revocation: resolved by `markAccessRevoked(gameID:)` — local replay-derived state stays readable, no new moves accepted._
    174 - _Snapshot cadence and move ordering: resolved (see "Design decisions already locked")._
    175 
    176 ## Phasing
    177 
    178 Proposed, subject to refinement:
    179 
    180 - **Phase 1: Fix the immediate data-loss bug in place.** Teach `applyCellRecord` to check for a pending outbox change or LWW `updatedAt` before overwriting. Keep the per-cell record model. Ships as a patch release.
    181 - **Phase 2: Throttle-safety.** Implement debounce, serialise pushes, respect `CKRetryAfter` on top of the existing hand-rolled `CKModifyRecordsOperation` engine. Still per-cell records. Ships as the next release; validates that pure-CloudKit with good hygiene is acceptable. The hand-rolled throttle code is understood to be short-lived — Phase 3 replaces it wholesale with `CKSyncEngine`.
    182 - **Phase 3: Move-log data model + `CKSyncEngine` adoption.** _Landed._ New `Move`/`Snapshot` records, `CellEntity` as cache, replay engine. Transport rewritten around `CKSyncEngine` so batching, retry-after, and send-state tracking come from the framework rather than hand-rolled primitives.
    183   - **3a.** _Landed._ Data model & replay plumbing.
    184   - **3b + 3c.** _Landed._ Live wiring + Phase 2 primitive deletion.
    185 - **Phase 4: CKShare.** _Landed._ Per-game zones, dual-engine sync (private + shared), invitation flow via `ShareController`, share-accept handler, per-author `Player` records for identity, share-revocation handling.
    186 - **Phase 5 (deferred, conditional):** WebRTC if measured live latency is inadequate. Not justified yet — needs real-world signal.