crossmate

A collaborative crossword app for iOS
Log | Files | Refs | LICENSE

commit aa10881dc18e29131a9aef0b79c0302d9f4b0c89
parent 96cacd24bd851bd39908c7d8aebabca08528df86
Author: Michael Camilleri <[email protected]>
Date:   Tue, 16 Jun 2026 12:23:59 +0900

Add development notes

Diffstat:
M.gitignore | 1-
ANotes/AppServices.md | 157+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ANotes/Engagements.md | 89+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ANotes/FriendColours.md | 90+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ANotes/GameView.md | 194+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ANotes/Journal.md | 189+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ANotes/MacVersion.md | 126+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ANotes/PresenceTracking.md | 92+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ANotes/PushWorker.md | 214+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ANotes/ReadHorizons.md | 165+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ANotes/ServerSecurity.md | 237+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ANotes/Snapshots.md | 143+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ANotes/SyncRedesign.md | 186+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
13 files changed, 1882 insertions(+), 1 deletion(-)

diff --git a/.gitignore b/.gitignore @@ -10,6 +10,5 @@ CLAUDE.md cookies.txt Generated/ examples/ -notes/ private_keys/ symlinks/ diff --git a/Notes/AppServices.md b/Notes/AppServices.md @@ -0,0 +1,157 @@ +# AppServices Architecture Note + +I would not try to "remove the god object" by scattering construction +everywhere. `AppServices` is doing something valuable: it gives the app one +obvious ownership point for long-lived services, keeps SwiftUI environment setup +simple, and makes startup ordering explicit. Those are real advantages. + +The thing I would change is not the centralisation itself, but the number of +responsibilities inside that central class. + +Right now `AppServices` is both: + +1. The composition root: constructs `PersistenceController`, `GameStore`, + `SyncEngine`, `NYTAuthService`, etc. +2. The lifecycle coordinator: handles startup, foreground/background sync, + remote notifications, share acceptance, reset, and import URLs. +3. The wiring layer: connects `GameStore` callbacks to `SyncEngine`, connects + `SyncEngine` callbacks back to `GameStore`, installs tracing, and refreshes + identity. +4. A small service locator used by views. + +That is why it feels dense. The problem is less "many properties" and more +"many behavior loops in one file." + +## Recommendation + +Keep `AppServices` as the composition root. + +I would still have something like: + +```swift +@MainActor +final class AppServices { + let persistence: PersistenceController + let store: GameStore + let sync: SyncCoordinator + let sharing: SharingCoordinator + let imports: ImportCoordinator + let nytAuth: NYTAuthService + let nytFetcher: NYTPuzzleFetcher + let driveMonitor: DriveMonitor +} +``` + +So the app still has one root object. SwiftUI can still receive `services`. +Tests and previews still have a single place to swap dependencies later. +Startup is still discoverable. + +## Extract Orchestration By Domain + +The first extraction I would make is a `SyncCoordinator`. + +It would own the current sync lifecycle behavior: + +- `start(preferences:)` +- `syncOnForeground()` +- `syncOnBackground()` +- `handleRemoteNotification()` +- identity refresh +- `SyncMonitor` tracing +- `SyncEngine` callback wiring +- `MoveBuffer` construction/wiring, if you want to group move flushing with sync + +That would remove the densest closure wiring from `AppServices` while preserving +the existing behavior. + +Then `AppServices.start(...)` becomes mostly: + +```swift +func start(appDelegate: AppDelegate, preferences: PlayerPreferences) async { + guard !started else { return } + started = true + + nytAuth.loadStoredSession() + driveMonitor.start() + + appDelegate.onRemoteNotification = { await self.sync.handleRemoteNotification() } + appDelegate.onAcceptShare = { await self.sharing.acceptShare(metadata: $0) } + + await sync.start(preferences: preferences) +} +``` + +That is a good shape for a composition root: it says what starts, not how every +subsystem talks internally. + +## Extract CloudKit Share And Reset Operations + +`acceptShare(metadata:)`, `deleteAllPrivateZones()`, and `leaveAllSharedZones()` +are a coherent CloudKit-sharing/admin concern. I would put those in something +like `SharingCoordinator` or `CloudKitMaintenanceService`. + +That class would depend on: + +- `CKContainer` +- `SyncEngine` +- `SyncMonitor` +- maybe `GameStore` for reset + +This keeps the dangerous operations, especially reset and zone deletion, out of +the general app service bag. + +## Extract URL Import + +`handleOpenURL(_:)` is also separable. It coordinates: + +- file security scope +- XD file loading +- Drive import +- duplicate game lookup +- game creation + +That could become: + +```swift +@MainActor +final class GameImportService { + func importGame(from url: URL) -> UUID? +} +``` + +This is a nice extraction because it has a clear input/output and little +lifecycle complexity. + +## Avoid Over-Abstracting The Simple Services + +I would not introduce protocols everywhere yet. For example, I would not +immediately create `GameStoreProtocol`, `SyncEngineProtocol`, `DriveMonitoring`, +`PuzzleFetching`, etc. unless tests or previews need them. That would make the +code feel more "architected" without reducing the actual complexity. + +I would also avoid splitting into tiny factories like `PersistenceFactory`, +`SyncFactory`, or `AuthFactory`. The construction code is not the main problem. +The callback graph is. + +## Target Shape + +A reasonable end state would be: + +- `AppServices`: owns and exposes app-wide services; starts coordinators. +- `SyncCoordinator`: owns sync lifecycle, move buffering, sync callbacks, + identity, and monitoring. +- `SharingCoordinator`: accepts shares and maybe handles CloudKit zone + cleanup/reset support. +- `GameImportService`: handles `.xd` URLs and Drive import. +- `PlayerNamePublisher`: stays as-is. +- `GameStore`: still stores callback hooks, though over time those could become + a more explicit event sink. + +The important design rule: `AppServices` can know about everything, but +everything should not require new business logic inside `AppServices`. + +So I would treat the current class as acceptable but past the point where new +responsibilities should keep landing there. The next time you touch sync +lifecycle, share acceptance, reset, or file import, extract that domain then. I +would not do a big architecture-only rewrite unless you already have bugs around +startup ordering or sync callbacks. diff --git a/Notes/Engagements.md b/Notes/Engagements.md @@ -0,0 +1,89 @@ +# Engagements Design Note + +## Context + +CloudKit push delivers updates in roughly 1-3 seconds. That is fine for durable +game state, but it is too laggy for two live-play interactions: + +1. Collaborator cursor/selection movement. +2. In-progress letter taps before the authoritative `Move` records land. + +An **engagement** is Crossmate's opportunistic low-latency channel for those +live interactions. It is distinct from a play session and is never the source of +truth. CloudKit remains authoritative. + +## Architecture + +- **Transport:** native iOS WebSockets using `URLSessionWebSocketTask`. +- **Relay:** Cloudflare Worker + one Durable Object per room. +- **Bootstrap:** `.hail` pings in the shared game zone carry room credentials. +- **Payload:** existing `EngagementMessage` JSON envelopes: debug text, realtime + cell edits, and selection updates. +- **Fallback:** any socket failure silently returns to CloudKit-only behavior. + +## Hail Payload + +`.hail` now carries a v2 room bootstrap payload: + +```json +{ + "ver": 2, + "roomID": "<uuid>", + "secret": "<base64url 256-bit secret>", + "createdAt": "<date>", + "expiresAt": "<date>" +} +``` + +Only CloudKit share participants can read the hail. The secret is a room secret, +not a durable account credential. + +## Room Auth + +The client connects to: + +```text +/rooms/:roomID/socket +``` + +with: + +- `authorID` +- `deviceID` +- `timestamp` +- `nonce` +- `secret` +- `signature = HMAC-SHA256(secret, roomID|authorID|deviceID|timestamp|nonce)` + +The Durable Object verifies the timestamp, rejects nonce replay, verifies the +HMAC, and stores only a hash of the room secret. This first implementation sends +the room secret during the WSS upgrade so the object can verify the HMAC; if that +becomes uncomfortable, move auth into a short-lived token minting step. + +## Lifecycle + +1. Existing peer-presence logic notices another player in the puzzle. +2. Deterministic initiator creates `roomID + secret`, writes a directed `.hail`, + and connects. +3. Recipient receives the hail, validates addressee/expiry, connects, and + deletes the hail. +4. `EngagementCoordinator` marks the game live only after socket open. +5. Socket close clears transient engagement selection state and may retry via + the normal presence path. + +Coordinator states are intentionally small: + +- `idle` +- `connecting` +- `live` + +## Worker Duties + +- Authenticate socket upgrades. +- Bind one Durable Object to each room ID. +- Broadcast incoming messages to every other socket in the room. +- Keep nonce replay state with a short TTL. +- Expire rooms and close sockets after the room TTL. + +The Worker does not know puzzle contents, CloudKit tokens, or durable game +state. diff --git a/Notes/FriendColours.md b/Notes/FriendColours.md @@ -0,0 +1,90 @@ +# Plan: Manually set a colour for a friend + +## Status + +Not implemented. This document records the research for a future pass — +estimated at roughly a day of work, with no architectural blockers. + +## How friend colours work today + +A friend's colour is **derived, never stored**. The whole scheme rests on +`PlayerColor.stableColor(forAuthorID:reserved:)` (`Models/PlayerColor.swift`), +a pure function: FNV-1a hash of the `authorID` into the nine-colour palette, +then a linear probe forward to the first colour that is more than +`similarHueThreshold` (90°) of hue away from every reserved colour, degrading +to exact-match avoidance and finally to the raw hashed colour when the +palette is too crowded. + +Two consumers: + +1. **Avatars** — `FriendAvatar.avatar(for:)` (`Views/FriendAvatarView.swift`) + colours the friend-list/picker/share-sheet avatar with + `stableColor(forAuthorID:reserved: [])`. The symbol is hashed off a + distinct seed string (`"symbol-\(authorID)"`) so symbol and colour vary + independently. +2. **In-game roster** — `PlayerRoster.applyRoster` + (`Models/PlayerRoster.swift`) walks the participants in sorted-authorID + order, threading a running `taken` set seeded with the local user's + preferred colour (`PlayerPreferences.color`, synced across the user's + devices via iCloud KVS). The result is a pure function of + (participant set, preferred colour), both identical across the user's + devices — so every device derives the same colour for a friend **with no + persisted or synced mapping**, and a friend keeps their colour across + games unless a lower-sorted collaborator collides with them. + +That "no persisted colour state" invariant is the one real cost of the +feature: once overrides exist, roster output depends on local data plus sync +timing, so two devices can briefly disagree after a pin until the sync lands. + +## The pieces + +### Easy + +- **Storage.** Add an optional `colorID: String` attribute to `FriendEntity` + (`CrossmateModel.xcdatamodeld`). An added optional attribute is handled by + automatic lightweight migration. +- **Picker UI.** A colour submenu in the friend row's ellipsis menu in + `FriendsView`. `PlayerColor.palette` is already picker-ordered (sorted by + hue) and `PuzzleView` (~line 947) has an existing colour-picker menu to + crib from, including the checkmark-on-current convention. +- **Avatar override.** `FriendAvatar.avatar(for:)` takes an optional + override colour; the call sites that should honour it (`FriendsView`, + `FriendPickerView`, `GameShareItem`) all have the `FriendEntity` in hand. + +### Substantive + +- **Roster pinning.** `applyRoster` fetches overrides from Core Data and + treats them as *pinned*: seed `taken` with the local colour plus all pins, + assign pinned friends their colour directly, and let the existing probe + walk the unpinned friends around them. Design decisions to settle: + - An explicit pin should win even when it collides with the local user's + own colour or another friend's pin — the user did it on purpose. + - Two friends pinned to the same colour is the user's prerogative; don't + fight it. + - Pinning one friend deterministically bumps some auto-assigned friends to + new colours (their probe now starts from a different `taken` set). This + is inherent and fine, but worth knowing. +- **Cross-device sync.** Without sync, iPhone and iPad disagree about a + friend's colour. The precedent is the block flag: it syncs across the + user's own devices as a `Decision` record in the account zone + (`SyncEngine.enqueueDecision(kind: "block", key: authorID)`), projected + back onto `FriendEntity` by the `kind` switch in + `RecordSerializer.applyDecisionRecord`. A `"color"` kind is maybe 30 lines + following that template: + - `enqueueDecision` already accepts a `payload: String?` — carry the + `colorID` there. + - The Decision record name is kind+key, so a re-pin for the same friend + overwrites the prior decision — last write wins, which is the behaviour + we want. + - Clearing a pin (back to automatic) can ride + `enqueueDecisionDeletion(kind:key:)`, which also already exists. + +## Suggested order + +1. Model attribute + `applyDecisionRecord` `"color"` case (so the projection + exists before anything writes it). +2. Roster pinning in `applyRoster`, with `PlayerRosterTests` coverage for + the pin-wins and bump-the-others cases. +3. Avatar override plumbing. +4. Picker UI in `FriendsView`, writing the entity and enqueuing the + Decision. diff --git a/Notes/GameView.md b/Notes/GameView.md @@ -0,0 +1,194 @@ +# Per-keystroke hitches on populated databases + +## Symptom + +Solving a puzzle on iPhone 14 Pro Max produces visible hitches as letters +appear in the grid. The same build on a freshly-set-up iPhone 13 Mini (new +Apple Account, empty store) is smooth. Disabling iCloud sync in Settings does +not help. Disabling ProMotion (removing `CADisableMinimumFrameDurationOnPhone` +from `Info.plist`) does not help. Running "Reset Database" from the +diagnostics screen restores responsiveness immediately. + +The shared property of the three things that *did* affect responsiveness +(brand-new account, reset database) is **how much data the local Core Data +store contains**, not network or display behaviour. + +## Diagnosis + +Per-keystroke trace: + +1. The user types a letter. `GameMutator.setLetter` updates the in-memory + `Game` and calls `emitMove`, which schedules + `MoveBuffer.enqueue(...)` on the `MoveBuffer` actor. +2. If the new edit targets a different cell than the previous one, + `enqueue` triggers `performFlush()` for Lamport ordering. +3. `performFlush` calls `persistAndAssignLamports`, which on a background + context inserts a `MoveEntity`, updates the `GameEntity` + (`lamportHighWater`, `updatedAt`), updates a `CellEntity`, and saves. +4. `viewContext.automaticallyMergesChangesFromParent = true` + (`PersistenceController.swift:39`) merges the background save into the + main context. +5. `GameListView` declares + `@FetchRequest<GameEntity>(sortDescriptors: [], animation: .default)` + (`GameListView.swift:11-15`). The merge invalidates the fetch request, + which marks `GameListView` dirty. +6. SwiftUI recomputes `GameListView.body`. The body reads two computed + properties: + + ```swift + private var inProgress: [GameSummary] { + games.compactMap(GameSummary.init(entity:)) + .filter { $0.completedAt == nil } + .sorted { ... } + } + private var completed: [GameSummary] { + games.compactMap(GameSummary.init(entity:)) + .filter { $0.completedAt != nil } + .sorted { ... } + } + ``` + + Each calls `games.compactMap(GameSummary.init(entity:))` independently. +7. `GameSummary.init(entity:)` (`GameStore.swift:34-77`) calls + `XD.parse(source)`, constructs a full `Puzzle`, and walks every + `CellEntity` to build the thumbnail array. + +So per keystroke, on the main thread: + +``` +GameCount × 2 × (XD.parse + Puzzle init + thumbnail build) +``` + +With one game in the store the cost is invisible. With many games it is +proportional to library size and lands on every cell-change keystroke. + +## Why this matches every observation + +- **Empty DB after reset → fast.** The `@FetchRequest` returns nothing, so + the body has no per-game work to do. +- **iCloud disabled didn't help.** The work is local: a Core Data save in + the view context is enough to invalidate the fetch request. Network + state is irrelevant. +- **ProMotion disabled didn't help.** The cost is real and on the main + thread. Capping the refresh rate only changes whether the dropped frame + is visible, not whether the work happens. +- **iPhone 13 Mini (new account) was fast.** Same code, but the library was + empty, so the FetchRequest had nothing to iterate. +- **Got worse over time.** Every game added to the library multiplies the + per-keystroke cost. + +## Why `GameListView` is involved while `PuzzleView` is on screen + +`NavigationStack` does **not** unmount the root view when a destination is +pushed; it only covers it visually. The root remains in the view hierarchy, +its property wrappers stay live, and SwiftUI continues to invalidate and +re-evaluate its body whenever a tracked dependency changes. `@FetchRequest` +is a Core Data observer; it fires on context changes regardless of whether +the view that owns it is currently visible. SwiftUI does not skip body +evaluation for "covered" views, because visibility is a render-tree concept +while body evaluation is a data-graph concept — it has to compute the new +body to know whether anything visible changed (toolbar, title, navigation +state, transitions, etc.). + +`animation: .default` on the fetch request also wraps each invalidation in +an animation transaction, which can amplify the perceived hitch. + +## Secondary issue (not the root cause of the hitch) + +`MoveEntity` rows are only pruned after CloudKit confirms the snapshot save. +`AppServices.swift:71` wraps the entire `afterFlush` callback in +`guard isEnabled` — meaning that with iCloud sync disabled, snapshots are +never created and moves are never pruned. The move table grows unbounded. +This is not what is producing the per-keystroke hitch (the dominant cost is +the FetchRequest + `XD.parse`), but it is a real correctness/space concern +worth fixing separately so users running with sync off don't accumulate +forever. + +## Verification + +Drop a one-line probe at the top of `GameListView.body`: + +```swift +let _ = print("GameListView body") +``` + +Run a build against the populated database. If the print fires on every +cell-change keystroke, the diagnosis is correct. + +## Fixes + +### Recommended: gate `@FetchRequest` on visibility + +Stop doing the work entirely while the puzzle is on screen. The +`@FetchRequest` only observes Core Data because the view that owns it is +live. Removing the owning view from the hierarchy tears down the underlying +`NSFetchedResultsController`. `NavigationStack` fires `.onDisappear` on the +root when a destination is pushed, which is the signal we need. + +```swift +struct GameListView: View { + let store: GameStore + let shareController: ShareController + @Binding var navigationPath: NavigationPath + + @State private var isActive = true + + var body: some View { + Group { + if isActive { + GameListContent( + store: store, + shareController: shareController, + navigationPath: $navigationPath + ) + } else { + Color.clear + } + } + .onAppear { isActive = true } + .onDisappear { isActive = false } + } +} + +private struct GameListContent: View { + // current GameListView body, including: + @FetchRequest(sortDescriptors: [], animation: .default) + private var games: FetchedResults<GameEntity> + // ... +} +``` + +Push `PuzzleView` → `.onDisappear` → `isActive = false` → +`GameListContent` is removed → `@FetchRequest` is torn down → keystrokes no +longer have a SwiftUI subscriber to invalidate. Pop back → `.onAppear` → +`isActive = true` → fresh fetch, list reflects whatever changed. + +Caveats: + +1. `.onDisappear` also fires when a sheet covers the list (`NewGameSheet`, + `SettingsView`). That's fine — list is not visible then either — but + quick sheet dismissals will flap `isActive`. If that becomes a problem, + gate on debounce or on `navigationPath.count > 0` instead. +2. The first frame after popping back has to fetch and build summaries. + For a small-to-moderate library this should be imperceptible; if it + isn't, render a cached lightweight snapshot in the inactive branch. +3. This addresses the "while typing" case only. Opening the library on a + populated database still triggers the same `XD.parse`-per-game work + during scrolling and resizing. If that becomes a problem, the longer- + term fix is to cache derived summary fields (title, publisher, date, + gridWidth, gridHeight, blockMask) directly on `GameEntity` so the list + path never parses XD. + +### Alternative: cache derived puzzle data on `GameEntity` + +Pre-compute and store everything `GameSummary` needs (title, publisher, +date, width, height, block mask) on `GameEntity` at creation time. `XD.parse` +disappears from the hot path entirely. This is a bigger change (schema + +migration + initialisation) but fixes both the "while typing" case and the +"opening a populated library" case. + +### Cheap partial: deduplicate the iteration + +`inProgress` and `completed` both iterate `games.compactMap(...)`. Compute +the summaries once per body and partition. Halves the constant but stays +O(games) per keystroke; not a real fix. diff --git a/Notes/Journal.md b/Notes/Journal.md @@ -0,0 +1,189 @@ +# Move Journal — Design + +A local, append-only log of every grid move, plus undo/redo built on top of it. +Phase 1 is entirely local (Core Data only, no CloudKit). Phase 2a uploads each +device's journal at game completion so the whole game can be replayed; it is +**built**. Phase 2b (the replay viewer that merges all devices' journals) and +the undo completion-lockout are still deferred — see "Remaining work" below. + +## Goals + +1. **Undo / redo during play.** A player can walk back their own moves, with no + fixed depth limit. Undo applies a *forward* mutation through the normal + `GameMutator` path, so it syncs to collaborators like any other edit — it is + not a time machine. +2. **Replay after completion (Phase 2).** Merging every device's uploaded + journal by timestamp reconstructs the full game, including corrections. + +**Every grid-changing move is recorded** (so replay is faithful), but only a +subset is **undoable**: letter `input` (adds/single-cell deletes) and `clear` +(the bulk clear gesture). `check` and `reveal` are recorded for replay but never +offered as undo steps — they're help actions, not edits you rewind. + +## Why a new structure (not the Moves record) + +`MovesValue.cells` (`Sync/Moves.swift`) is a *compacted current-state* map — +`[GridPosition: TimestampedCell]`, only the latest touch per cell survives. It +carries no history, so neither undo nor replay can be reconstructed from it. +Snapshotting the whole Moves value per move is O(moves × cells) — tens of MB for +a single game — and is also collaboratively wrong (restoring a snapshot rewrites +every cell, clobbering peers). Instead we log **one entry per changed cell**, in +the Moves cell shape, and recover the "before" value via a back-pointer. + +## Data model — `JournalEntity` + +Local Core Data entity, cascade-deleted with its `GameEntity`. Append-only; +rows are never mutated or deleted in normal operation. + +| Field | Type | Purpose | +|---|---|---| +| `gameID` | UUID | game scope | +| `seq` | Int64 | monotonic per game; stable local ordering + undo walk | +| `timestamp` | Date | wall-clock; Phase 2 replay ordering + cross-device merge | +| `row`, `col` | Int16 | cell | +| `letter` | String | **after**-state letter | +| `markCode` | Int16 | after-state `CellMark` as one lossless code (see `CellMarkCodec.code`) | +| `cellAuthorID` | String? | preserved letter author (mirrors `emitMove`) | +| `actingAuthorID` | String? | iCloud user who made the move | +| `kind` | Int16 | 0 input / 1 check / 2 reveal / 3 clear / 4 undo / 5 redo (`input` + `clear` are undoable) | +| `targetSeq` | Int64? | for undo/redo, the entry it acts on | +| `batchID` | UUID? | groups one gesture (bulk check/reveal/clear, or one undo/redo op) into a single undo step | +| `prevSeqAtCell` | Int64? | seq of the previous entry at this `(row,col)`; `nil` = cell was empty before | + +There is **no stored before-state**. To undo entry `E`, follow `E.prevSeqAtCell` +to one prior entry and apply *its* after-state (or empty if `nil`). O(1) keyed +lookup, no scan; after-states stay the single source of truth and the pointer +can never go stale because the log is immutable. + +Adding the entity is an additive, automatic lightweight migration. Unlike the +synced Moves wire format (which flattens the mark into `markKind` + two check +bools), the journal stores the whole `CellMark` as one `markCode` — the +two-bool form is a wart we chose not to propagate into new local storage. The +shared `CellMarkCodec` now owns both encodings. + +## `MovesJournal` + +`@MainActor` class, a shared singleton beside `MovesUpdater`. Holds the current +game's entries in memory (loaded lazily from Core Data on first use, so undo +survives relaunch), persists each new entry on a background context (the +in-memory list is authoritative for the session, so async persistence is fine). + +- `record(...)` — assigns `seq` (= max+1), looks up & updates a + `[GridPosition: Int64]` last-seq map to set `prevSeqAtCell`, appends in memory, + persists. +- `planUndo(gameID:)` / `planRedo(gameID:)` — derive the next step and return + per-cell restore instructions; pure, no mutation of state. +- `canUndo` / `canRedo`. + +### Undo/redo derivation + +One pass over the in-memory entries (seq order), grouping consecutive entries +that share a `batchID`/kind into *operations*, then a stack machine: + +``` +for op in operations: + userEdit -> redoStack = []; liveStack.append(op) // new edit cuts redo branch + undo -> if let s = liveStack.popLast() { redoStack.append(s) } + redo -> if let s = redoStack.popLast() { liveStack.append(s) } +``` + +`liveStack.last` is the next thing to undo; `redoStack.last` the next to redo. +Undone entries are never deleted — replay keeps the true history — the +derivation just stops offering them once a new edit cuts the branch. + +### The superseded guard + +Each restore instruction carries the entry's after-state as `expectedCurrent`, +which `GameMutator` compares (by **letter only**) to the live `game` cell. If +the letter differs, a collaborator (or a later edit) changed the cell since, so +we **skip that cell** — this stops undo from ever clobbering someone else's +current letter. Only the letter is compared so that a peer's check/reveal (a +mark-only change) doesn't block undoing one's own letter. In a batch, superseded +cells are skipped individually; the step still counts as undone. + +## `GameMutator` / `Game` hooks + +- `emitMove` gains `journalKind` + `batchID` + `targetSeq` and, alongside the + existing sync enqueue, calls `movesJournal.record(...)` with the after-state. + A move is recorded only when the cell's state actually changed. +- `setLetter`/`clearLetter` → `kind = .input`, `batchID = nil` (one step each). + `checkCells`/`revealCells`/`clearCells` → `kind = .check`/`.reveal`/`.clear`, + one shared `batchID`. Only `.input` and `.clear` are undoable; `.check`/ + `.reveal` are recorded for replay but inert to the stack machine. +- `undo()` / `redo()`: ask the journal for a plan, apply surviving instructions + via a new low-level `Game.applyCellState(...)` (sets entry + mark + author + directly; can override the reveal lock), emitting each as `kind = .undo`/ + `.redo` under one op `batchID`. +- `canUndo` / `canRedo` drive the UI control's enabled state. + +## Open questions / notes + +- **Undo control UI.** Buttons in `PuzzleView` are deferred to follow-up. +- **`recordedEntries(gameID:)`** exposes the log read-only; it's a seam Phase 2b + replay can read locally (the cross-device merge reads uploaded assets). + +## Phase 2a — journal upload (built) + +At game completion (win or resign, in `GameStore`), each device encodes its +whole local `JournalEntity` log via `JournalCodec` (JSON) and pushes it as a +`CKAsset` on a `Journal` record — one per `(game, authorID, deviceID)`, named +`journal-<gameID>-<authorID>-<deviceID>`, written into the game's existing zone. +It mirrors the `Moves` record path and is **write-once with no system-fields +archive** (like `Ping`/`Decision`): `buildRecord` reconstructs the asset from the +durable `JournalEntity` rows, so a pending upload survives an app kill, and +`MovesJournal.flush()` is awaited before enqueue to beat the journal's async +persistence. Inbound `Journal` records are deliberately ignored (`case +"Journal": break`) — 2b fetches them on demand instead of applying peers' logs. + +`JournalCodec` carries the mark as the single lossless `markCode` (matching +`JournalEntity`), not the `Moves` two-bool flattening, and decodes optionals +leniently for forward-compat. Key files: `Persistence/Journal.swift`, +`Sync/RecordSerializer.swift`, `Sync/RecordBuilder.swift`, `Sync/SyncEngine.swift` +(`enqueueJournalUpload`), `Persistence/GameStore.swift`, `Services/AppServices.swift`. + +CloudKit schema: the `Journal` record type (`authorID` String, `deviceID` +String, `updatedAt` Date/Time, `entries` Asset) was added manually and deployed +to Development and Production on 2026-05-30. No new zone. + +## Phase 2b — cross-device replay data layer (built) + +The read side of replay. `SyncEngine.fetchReplay(forGameID:)` (in +`Sync/CloudQuery.swift`) runs two zone-scoped `CKQuery`s against the finished +game's zone — one for `Journal` records (decoding each `entries` asset via +`JournalCodec`) and one for `Moves` records (keys only) — and returns a +`JournalReplayFetch { journals, expectedDevices }`. It's a plain query, so it +never disturbs the sync engine's change token; inbound `Journal` records stay +ignored in the delegate path. + +`Persistence/JournalReplay.swift` holds the pure, unit-tested core: +- `ReplayTimeline(merging:)` flattens every device's log and orders it by + `timestamp` (ties break on `actingAuthorID` then `seq`, so fetch order can't + change the result). `state(through:)` folds the first *n* steps last-write- + per-cell into `[GridPosition: JournalCellState]`. Forward replay needs only + each entry's after-state, so device-local `seq`/`prevSeqAtCell` are unused; + undo/check/reveal rows replay naturally as timestamped restores. +- `ReplayAssembler.assemble(fetch:localKey:localEntries:)` overlays this + device's *live* log over any uploaded copy of itself, then enforces **strict + completeness**: `.ready(timeline)` only when a journal is present for every + expected device, else `.waiting(missing:)`. `AppServices.loadReplay(gameID:)` + composes `fetchReplay` + `GameStore.localReplaySource` + the assembler and + maps an unreachable zone to `.unavailable`. + +**Strict completeness needs late devices to upload.** Completion learned purely +via sync used to set `completedAt` without uploading the local journal, so a +device away at finish time would block replay forever. `applyGameRecord` now +signals a not-completed → completed transition (`onCompletedTransition`), and +both inbound apply paths enqueue this device's journal upload — so any +contributing device that ever syncs the finished game converges. Residual: a +device that wrote cells and is then permanently gone keeps replay `.waiting` +indefinitely (accepted trade-off; a "replay anyway" affordance could relax it). + +## Remaining work + +- **Phase 3 — scrubber UI.** A scrubber inside the finish banner (above the + victory image + scoreboard in `SuccessPanel`), head starting hard-right at the + finished grid and dragging left to rewind. Drives the main `GridView` from + `ReplayTimeline.state(through:)`; disabled with a "waiting to sync history" + overlay while `loadReplay` returns `.waiting`. +- **Completion lockout.** Gate undo/redo off once the game's `completedAt` is + set, so a finished game can't be rewound. diff --git a/Notes/MacVersion.md b/Notes/MacVersion.md @@ -0,0 +1,126 @@ +# Bringing Crossmate to the Mac + +Notes from an exploration in June 2026. We did not ship anything — we evaluated +two routes, prototyped the Catalyst one far enough to launch, then decided it +was not worth pursuing right now. This records what we learned so a future +attempt does not start from zero. + +## The two routes + +**Mac Catalyst** — almost free to get building, a poor long-term fit. +**Native macOS (SwiftUI)** — the right shape, but a multi-week effort. + +Catalyst is genuinely ~2 lines of `project.yml` plus a couple of fixups. Native +is a real project: a Mac UI layer, sharing rewritten off `UICloudSharingController`, +the notification extension re-targeted, NYT re-plumbed. The honest framing is +that Crossmate is a collaborative, touch-first crossword app, and the Mac +audience may not justify either the Catalyst compromises or the native cost. +That is why we stopped. + +## What we confirmed about the codebase (the encouraging part) + +The engine is already cleanly separated and almost entirely UIKit-free, which is +the precondition for *either* route: + +- `Crossmate/Models` (21 files), `Crossmate/Sync` (21), `Crossmate/Persistence` + (5): **zero `import UIKit`**. The few `import SwiftUI` (PlayerColor, + PlayerPreferences, SyncEngine) are fine — SwiftUI is cross-platform. +- Core puzzle subviews `CellView.swift`, `GridView.swift`, `ClueList.swift` are + **pure SwiftUI** — reusable on macOS as-is. +- UIKit coupling is concentrated in `Crossmate/Views` and 4 services. The + background-task coupling (`UIApplication.beginBackgroundTask` / + `UIBackgroundTaskIdentifier`) is already abstracted behind closures in + `PuzzleSession` (the `beginBackgroundAssertion`/`endBackgroundAssertion` + pair), with the UIApplication implementations in `SessionCoordinator` and + `AppServices.ensureInBackground`. On macOS the replacement is + `ProcessInfo.performExpiringActivity`. +- Engine constructors are light: `SyncEngine(container:persistence:)` and + `GameStore(persistence:movesUpdater:authorIDProvider:…closures)`. A minimal + Mac composition root is feasible **without** reusing the monolithic + `AppServices` (which eagerly builds push/NYT/engagement and `import UIKit`s). + +`../Listless` is the working reference for the native route — same +CloudKit/CKShare stack, shipped natively on macOS. + +## Catalyst route — what it actually took + +For the record, getting Catalyst to *build* was: + +1. `SUPPORTS_MACCATALYST: YES` in `project.yml` settings. +2. Remove `LSSupportsOpeningDocumentsInPlace: false` from the iOS Info.plist — + macOS rejects an explicit `NO`; absent defaults to the same behaviour. +3. Guard three iOS-26-only SwiftUI APIs with `#available(iOS 26.0, *)` + + fallbacks: `glassEffect` (in `SuccessPanel` ×2 and `PuzzleView` pencil + button) and `ToolbarSpacer` (in `GameListView`). These guards are worth + keeping regardless of route — they make the version-gating explicit. + +### The deployment-target trap (important, non-obvious) + +Catalyst derives its **minimum macOS from the iOS deployment target** via a fixed +Apple mapping (iOS 26 → macOS 26 Tahoe; iOS 18 → macOS 15). Crossmate targets +iOS 26, so a stock Catalyst build demands macOS 26 and Xcode refuses "My Mac" as +a run destination on anything older. + +- To run on macOS 15 you must lower the **iOS-equivalent availability floor**, + not the link-time minimum. The lever is + `IPHONEOS_DEPLOYMENT_TARGET[sdk=macosx*] = 18.0` (Catalyst compiles under the + `macosx` SDK), which keeps the real iOS app at 26 while the Catalyst variant + targets iOS 18 / macOS 15. +- **Do not** instead set `MACOSX_DEPLOYMENT_TARGET = 15.0`. That silences the + compiler (it still uses the iOS-26 floor for availability) so iOS-26 API + compiles, but the binary then *crashes at runtime* on macOS 15 when it hits a + symbol that does not exist there. We verified this: it "builds" with zero + errors and would have shipped a crash. +- Verify the result with `vtool -show-build <binary>`: a Catalyst binary reports + `platform MACCATALYST` and `minos` in **iOS** numbers (18.0 == macOS 15). + +With the floor lowered, "My Mac (Mac Catalyst)" became an eligible destination +and the only remaining blocker was code signing (no Mac Catalyst provisioning +profiles — a portal/account step, not code). + +### The runtime wall we hit + +Catalyst launched and ran. Opening the **puzzle picker** (`NewGameSheet`) then +crashed: + +``` +No Observable object of type NYTAuthService found. +``` + +`NewGameSheet` reads `@Environment(NYTAuthService.self)`, injected once at the +root in `CrossmateApp`. On iOS, sheets inherit that environment; **on Catalyst +the `@Observable` object did not propagate across the sheet boundary.** Several +sheets read root-injected observables (`SettingsView` reads `nytAuth` too), so +this is systemic, not one view. The fix would be to re-inject the needed +observables on each presented sheet from the presenting side (you cannot read +them inside the sheet — that is the broken environment). We did not finish it; +this quirk is a good example of why Catalyst is "runs but fights you." + +## If we revisit: recommended shape (native) + +We had settled on, and still recommend, a **scaffolding-slice** first increment: + +- SwiftUI App lifecycle (not a full AppKit shell like ListlessMac — far less + code; add AppKit only where needed, e.g. sharing). +- **XcodeGen source globs, no file moves**: a new `Crossmate macOS` target whose + `sources` curate the clean engine folders + a whitelist of portable views + + a new `CrossmateMac/` UI layer. Prefer **excluding** iOS-only files over + `#if`-guarding them — keeps the diff small and the engine untouched. +- A **minimal Mac composition root** wiring `PersistenceController → + MovesUpdater → GameStore (+ SyncEngine)` with no-op change closures, rather + than reusing `AppServices`. Factor `AppServices` into a shareable base only + later. +- Set `deploymentTarget.macOS` to match whatever the dev host runs (we used + 15.0 against a macOS 15.7 host) so the native target builds **and runs/tests + locally** — the native route sidesteps the Catalyst deployment-target trap + entirely. +- First slice = launches + game list + **read-only** puzzle (reuse `GridView`/ + `CellView`/`ClueList`). Defer, each as its own increment: CKShare sharing UI + (`UICloudSharingController` → `ShareLink`/AppKit), push + `NotificationService` + extension, NYT import/auth, and text entry (no on-screen keyboard on Mac — + input is already hardware-keyboard-driven via `HardwareKeyboardInputView` + + `UIKeyCommand`, with `InputMonitor` hiding the custom keyboard when a hardware + keyboard is present, which on a Mac is always). + +The full draft plan lived only in the planning tool and was not saved to the +repo; this note is the durable summary. diff --git a/Notes/PresenceTracking.md b/Notes/PresenceTracking.md @@ -0,0 +1,92 @@ +# Plan: Split the presence lease from the read watermark + +## Status + +Implemented. This document records the design and the rationale. + +## The problem + +A single field — the CloudKit `Player.readAt` (locally +`GameEntity.lastReadOtherMoveAt`, peer copy `PlayerEntity.readAt`) — was read +with two incompatible meanings: + +1. **Presence lease.** While a puzzle is on screen the app writes + `readAt = now + 10 min` (a forward-dated horizon). `PeerPresence.isPresent`, + the engagement-room teardown, `soonestPresentPeer` and `hasPresentPeer` all + treat a future `readAt` as "this player is here right now." +2. **Read watermark.** The session-end ("stopped solving — filled N") push and + the unread badge treat `readAt` as "this account has *seen* moves through + T." + +The forward-dating makes both work *while you're looking*. It breaks the +moment you stop: the lease stays in the future for up to 10 minutes after you +background, so + +- a **peer** computing what you've missed sees your future `readAt`, concludes + you've already read their later moves, and **suppresses the summary push**; + and +- your **own unread badge** compares "latest other move" against the future + `readAt` and stays clear, **swallowing moves that arrived after you left.** + +Observed in a diagnostics log: Bob solved (+17 fills) at 00:26 while Alice was +backgrounded with a lease valid to 00:34. The moves synced in and surfaced only +as the in-app session-summary banner on reopen — no push, no badge. The lease +frequently does *not* collapse cleanly on background (`self … readAt=00:34:20` +long after leaving), so "collapse the lease harder" is not a fix. The watermark +has to be its own value. + +## The fix + +Two separate fields: + +| Concept | Local (`GameEntity`) | Peer (`PlayerEntity`) | Wire (`Player`) | +|---|---|---|---| +| Presence lease (forward-dated) | `lastReadOtherMoveAt` | `readAt` | `readAt` | +| Read watermark (never future) | `readThroughAt` | `readThrough` | `readThrough` | + +- **`readAt` keeps its exact current meaning and behaviour** — the forward-dated + presence lease. Presence is the fragile, real-time subsystem and is what + behaves correctly today, so it is left untouched. +- **`readThrough` is new**: the latest other-author move time the account has + *actually observed*. Monotonic, never leased into the future. + +### Who writes the watermark +- **On open** (`GameStore.markOtherMovesRead`): advance to `latestOtherMoveAt`. +- **While viewing** (`GameStore.noteIncomingMovesUpdate` lockstep): advance to + the latest inbound move, gated on `NotificationState.isSuppressed` (eyes on + the grid). +- **On leave/background** (`publishReadCursor(.currentTime)`): advance to `now` + — present right up to here — *in lockstep with* collapsing the lease. + +`publishReadCursor(.activeLease)` keeps writing only the lease. The two no +longer fight over one field. + +### Who reads the watermark +- **Session-end push**: `pushPlan` sources `PushRecipient.readThrough` from the + peer's `PlayerEntity.readThrough`; `SessionPushPlanner` windows the tally on + `max(readThrough, notifiedThrough)`. Presence reads are untouched. +- **Unread badge**: `GameSummary.computeHasUnread` compares `latestOtherMoveAt` + against `readThroughAt`. + +### Wire + multi-device +- `RecordBuilder`/`RecordSerializer` send both fields; `CloudQuery` desiredKeys + request `readThrough`; `RecordApplier` adopts the peer copy, and for our own + record coming back from a sibling device advances the shared + `GameEntity.readThroughAt` monotonically (so the badge clears on every device + once any device reads). The watermark rides the Player record sync; the fast + `accountSeen` push still carries only the lease. + +## Compatibility + +The app is unreleased (two testers), so **no backward compatibility was kept**: +old records simply have a `nil` `readThrough` (tallies their whole backlog, +self-healing on the next read). Core Data migration is lightweight-additive +(two optional `Date` attributes); `PersistenceController.recreateStore` is the +fallback if a store can't be opened. + +## Follow-up (v4) + +`readAt` is now purely a presence horizon, so its name is misleading. A `TODO(v4)` +note sits at `GameStore.setReadCursor` and `RecordSerializer.playerRecord`: +rename the `readAt` CloudKit field (and `lastReadOtherMoveAt`) to something like +`presenceLeaseUntil` when the schema next breaks. diff --git a/Notes/PushWorker.md b/Notes/PushWorker.md @@ -0,0 +1,214 @@ +# Push Worker + Sync Simplification + +## Context + +After the engagement Worker landed (collaborative live-solving over WebSockets, 2026-05), it became natural to use the same Cloudflare runtime for push notifications. Pings were unreliable for time-critical user-facing events, and the receiver-side heuristics that had grown around them (`SessionMonitor.presentBegins`, the 3-minute quiescence-window scheduler) were brittle. This pass moves the user-facing event notifications onto a second Cloudflare Worker that signs APNs JWTs and pushes alerts directly, while folding the per-cell state for check / reveal / resign into the Moves stream so it propagates collaboratively. + +## What landed (2026-05-27) + +## App Attest migration (2026-06) + +The push worker no longer relies on a build-baked `PUSH_BEARER` for normal +clients. Updated apps enroll a per-install App Attest key through +`GET /attest/challenge` + `POST /attest/register`, then sign every +`/register`, `/register DELETE`, and `/publish` request with that key. + +The worker stores only the attested public key and assertion counter keyed by +`deviceID` + `keyID`. This preserves the no-Crossmate-account model: CloudKit +still owns collaboration identity, and the worker only proves that a request +came from an attested Crossmate install. Existing games keep working because +their CloudKit `pushAddress` values are unchanged; updated clients simply +re-register the same addresses with App Attest request auth. + +Deploy requirements: + +- `APP_TEAM_ID`, `APP_BUNDLE_ID`, and `APP_ATTEST_ENVIRONMENT` are configured + in `wrangler.push.toml`. +- `APP_ATTEST_ROOT_CERT_PEM` must be set as a Worker secret or dashboard + variable. +- `ALLOW_LEGACY_PUSH_BEARER = "1"` may be set temporarily for rollback, but + normal deployment should leave it unset. + +### Worker + +- New `Workers/push-worker.js` + `Workers/wrangler.push.toml`. +- Single Durable Object class `PushRegistry`, sharded by `idFromName("registry")` (one global instance — Crossmate's scale doesn't need per-author sharding). +- Three push endpoints, all App-Attest-auth gated: + - `POST /register` — `{deviceID, token, environment, addresses}` upsert. Idempotent. + - `DELETE /register` — `{deviceID, addresses}` — unregisters dropped address bindings. + - `POST /publish` — `{kind, addressees, gameID, fromAuthorID, title, alertBody}` — fans out to every addressee's registered devices. +- APNs JWT (ES256) signed in-Worker via Web Crypto. JWT cached in DO instance memory and refreshed every ~40 min (APNs' rate-limit floor is ~20 min; the 1-hour ceiling is the upper bound). +- Per-token `environment` field picks sandbox vs production endpoint; iOS reports `"sandbox"` for Debug builds and `"production"` for TestFlight/App Store. +- Deployed at the push Worker's `*.workers.dev` URL with `preview_urls = false`. + +### Secrets + +Loaded via `wrangler secret put --config wrangler.push.toml`: + +- `APNS_KEY` — the `.p8` (piped via `< AuthKey_*.p8` stdin redirect; not pasted, because the interactive prompt eats newlines). +- `APNS_KEY_ID`, `APNS_TEAM_ID` — both 10-char. +- `APP_ATTEST_ROOT_CERT_PEM` — public Apple App Attest root certificate used + by the worker to verify the attestation chain. + +APNs key is topic-restricted to the app's bundle ID, both environments enabled. Apple caps at 2 APNs keys per Team, so a single key serving both endpoints leaves headroom. + +### iOS plumbing + +- `Crossmate/Services/PushClient.swift` — `@MainActor` class. Reads + `CrossmatePushBaseURL` from `Bundle.main` and signs worker requests via + `PushRequestAuthenticator`. +- Idempotent: `updateAPNsToken(_:)` on every `didRegisterForRemoteNotificationsWithDeviceToken` callback, `updateAuthorID(_:)` on every `identity.refresh(...)` site. Worker side dedupes unchanged triples. +- Account switch: explicit `DELETE /register` for the previous `(authorID, deviceID)` before re-registering. +- Failures go to `syncMonitor.note(...)` for diagnostics; never user-surfaced. + +### Push kinds (sender-formatted strings, no `loc-key` — Crossmate is English-only) + +| Kind | Trigger | Body | +|------|---------|------| +| `win` | `sendCompletionPings(resigned: false)` after the local player completes a puzzle | "Alice solved the puzzle 'X'" | +| `resign` | `sendCompletionPings(resigned: true)` after the local player gives up | "Alice resigned the puzzle 'X'." | +| `play` | `PuzzleDisplayView.updateActiveNotificationPuzzleID` when scenePhase becomes `.active` | "Alice is solving the puzzle 'X'" | +| `pause` | Same handler when scenePhase leaves `.active`, plus the selection-clear block at puzzle close | "Alice added 12 letters and cleared 3 letters in the puzzle 'X'" | + +Pause uses `LocalSessionTracker` (`Crossmate/Services/LocalSessionTracker.swift`) — a small `@MainActor` counter wired into `store.onLocalCellEdit` that resets on every `begin(gameID:)` and drains on `consume(gameID:)`. The "skip if both counts are zero" guard inside `publishSessionEndPush` handles dedup between scenePhase-background and puzzle-close naturally. + +### Pings that stay + +- `.friend` — friendship handshake. +- `.invite` — re-invite to a game (uses `addressee`; surfaces in the Invited section). +- `.hail` — engagement room bootstrap. +- `.join` — the *invite-accept* one (one-shot, when a collaborator first accepts a share). Distinct from the new session-start APN, which is the `play` kind on the push side specifically to avoid this naming collision. + +`PingScope` and `PingScopePayload` are gone entirely (only `.check`/`.reveal` used them). + +### Check/reveal/resign in Moves + +- New `Crossmate/Models/CheckResult.swift` — `enum CheckResult: Sendable, Equatable, Codable { case right, case wrong }`. Used as `CheckResult?`; `nil` is the canonical "unchecked" value (no `.none` case). +- `CellMark.pen` / `.pencil` now carry `checked: CheckResult?` instead of `checkedWrong: Bool`. +- `Game.checkCells` now stamps `.right` on correct entries (previously erased them to `.none`). Wrong entries continue to stamp `.wrong` — pen-vs-pencil style is preserved across the check. +- Wire format and Core Data keep two parallel `checkedRight` / `checkedWrong` booleans (translation lives in `GameMutator.encodeMark` / `GameStore.decodeMark`). `MovesCodec.Payload.Entry` gets a custom `init(from:)` that defaults `checkedRight` to false so records written by older clients still decode cleanly. +- Resign reveals propagate via the existing Moves reveal mechanism — peer grids fill in cooperatively. No new `Game.resignedBy` field; a resigned game and a reveal-all-puzzle game look identical at cold start (accepted trade-off — the moment-of-resign signal lives in the APN). + +### SessionMonitor cleanup + +Slimmed to what the on-open catch-up banner actually needs: + +- Kept: `ingest(_ deltas:)` (bucket accumulation), `consumeOnOpen(gameID:)`, `cancel(gameID:authorID:)`, `bodyText(playerName:puzzleTitle:added:cleared:)`. +- Removed: `presentBegins`, `scheduleEnd`, the 3-minute quiescence timer, `Bucket.scheduledFor`, `SessionNotificationScheduling` protocol + `RecordingNotificationScheduler` test recorder, `bodyText(for: Session)`, the end-/begin-identifier helpers, the `notificationCenter` and `notificationAuthorization` constructor params. + +The on-open banner still says "Alice added 12 letters; Bob added 5 letters" — that path is independent of the new APNs and remains useful as a catch-up when the user opens an unread puzzle. + +## Design decisions + +- **Sender-formatted strings, not `loc-key`/`loc-args`.** Crossmate has no `Localizable.strings` / `.lproj` yet. Setting up the localization scaffolding just to feed receiver-side formatting that would produce the same English text is more work than it's worth for a single-locale app. Switch to `loc-key` when a second language ships. +- **One Durable Object, not per-author sharding.** Single DO instance can comfortably hold thousands of token registrations; `/publish` resolves targets in one `state.storage.list({prefix: ...})` per addressee. +- **JWT caching is mandatory.** APNs rejects rapidly-rotated provider tokens with `TooManyProviderTokenUpdates`. Cache in DO instance memory (sticky region) — no storage RTTs. +- **`CellMark` as enum, parallel-Bool storage.** The conceptual model is the optional `CheckResult` enum; the on-disk encoding is two booleans. Translation at the boundary keeps the API clean without requiring a Core Data migration model. +- **No `Game.resignedBy` distinguisher.** Cold-state conflation of resign vs reveal-all-puzzle is acceptable; the moment-of-resign signal lives in the APN body. +- **No `lastAction` field on Moves; no Actions record.** Replay is a future feature that will design its own log format with the consuming UX in mind. Designing the granularity now without the consumer would be a guess. +- **Push registration is fire-and-fetch-error.** Worker is idempotent, so the next launch's APNs callback retries naturally. Failures log to the diagnostics monitor but don't surface to the user. +- **`preview_urls = false`.** Each preview URL carries the same secret bindings; gratuitous attack surface for no benefit. + +## Badge and sibling-device read horizons (2026-06) + +Badge state is now split between durable-ish local truth and best-effort push +transport: + +- The Notification Service Extension and the app share an App Group + `BadgeState` ledger. It is a per-game horizon map, not a set: `unreadAt` + advances when a badge-worthy APN is delivered, `seenAt` advances when the + user opens/clears the game, and a game counts only when `unreadAt > seenAt`. + This lets stale APNs lose to a newer open without needing CloudKit to arrive + first. +- The app icon badge is computed as `BadgeState.unreadGameIDs()` unioned with + Core Data's unread-other-moves query. Core Data unread games are also seeded + back into the ledger as `unreadAt` horizons, each stamped with that game's + newest unseen other-author move time (`latestOtherMoveAt`). The NSE can't + reach Core Data, so without this seed a push landing on a suspended app would + re-stamp the badge from the ledger alone and drop any game whose unread state + arrived purely via CloudKit sync. The seed is safe precisely because the + ledger is a horizon map and not a set: re-seeding a game the user has since + opened is a no-op, because its newer `seenAt` still wins. (The old set-based + ledger couldn't express that, which is why the write-back was dropped when + the horizon map first landed — and why it is safe to reinstate now.) +- Ledger entries are maintained by lifecycle event, not reconciled by absence + (a ledger-only game is ambiguous: it could be push-ahead-of-sync, which must + count, or a deleted game, which must not). So `BadgeState.forget(gameID:)` is + called on the two hard-removal hooks — local `onGameDeleted` and sync-driven + `onGameRemoved` — and `BadgeState.reset()` on the diagnostics data wipe. This + matters because a deleted game has nothing left to open, so a stale + `unreadAt > seenAt` entry could never be cleared by a seen horizon and would + badge forever. +- The ledger key is `badge.ledger.v2`. The bump from `v1` is a deliberate + one-shot discard: the `v1` migration off the pre-horizon set stamped + `unreadAt = now` for every legacy entry, fabricating fresh-unread phantoms + that no read state could defeat (a future-dated `unreadAt` beats any past + `seenAt`). `loadLedger` now drops the legacy stores without fabricating, and + rebuilds from Core Data via the seed above. +- Each account has an account-scoped push address, published as the existing + `Decision` record `decision-account-pushAddress` with `kind = account` and + the address in the payload. This avoids adding CloudKit schema fields while + giving all devices on the same iCloud account a shared silent-push route. +- When a device joins a shared game, it sends `accountJoined` to the account + address. Sibling devices ignore self-sends, fetch/sync, mint their own + per-game push address if needed, and register with the Worker so they can + receive that game's future pushes. +- When a device clears/sees a game, it sends `accountSeen` with + `senderDeviceID` and `readAt`. Sibling devices ignore self-sends, apply the + supplied `readAt` to their local read cursor and badge ledger, withdraw any + matching delivered notifications, and refresh the app badge. Inbound + `accountSeen`, CloudKit `Player.readAt` updates and ping deletions are + non-echoing so the account push path remains one-hop rather than a feedback + loop. +- If the game is actively visible, the seen push uses the same future + read-lease horizon as `Player.readAt`. That prevents an account-seen message + from racing after an active-lease update and accidentally collapsing + presence on another device. + +The push server is still not durable truth for badges. It only transports +same-account hints quickly. If a silent push is dropped, the next app launch or +CloudKit fetch recomputes from local Core Data plus whatever the NSE delivered +on that device. + +## Open items / gotchas + +- **Schema deployment is not needed.** Moves records ship the cells as an opaque binary blob; the new `checkedRight` boolean rides inside that blob. No CloudKit dashboard field to add (unlike, say, `Player.readAt`). The `CellEntity.checkedRight` Core Data attribute auto-migrates locally on first launch (Boolean default `NO`). +- **No visual indicator UI yet.** The cell state is now propagating; rendering it on the grid is separate UX work. Plan was tri-state: subtle dot for `.right`, distinct marker for `.wrong`, cleared when the cell's letter changes. +- **The bearer in the binary is the only abuse gate.** That matches your stated comfort level (users aren't adversarial). If the binary is ever reverse-engineered the bearer leaks; rotate via `wrangler secret put` if that happens. +- **Background-time publish.** Backgrounding while in a puzzle fires `publishSessionEndPush` from a `Task` that runs as the app suspends. iOS gives ~5–10 seconds; in practice a small POST to a Cloudflare Worker completes well within that window. If we see drops in the field, switch to a `BGTaskScheduler` background task or a background `URLSession`. +- **`onPlayerEvent` removed from `PlayerSession`** — the 6 fire sites that used to enqueue check/reveal pings are gone. Anyone re-introducing peer-visible per-action notifications would need to wire a new path; the existing one no longer exists. +- **Mixed-version peers.** `MovesCodec.Payload.Entry.init(from:)` defaults `checkedRight` to false, so a TestFlight build that's behind on the schema can still decode records from the Simulator. Older builds drop the new flag when re-encoding (they just don't see it), so a check-correct stamped on the new build that round-trips through an older device's re-write would lose its `.right` state. Acceptable transient: the next check on the new device re-stamps. + +## File-level inventory + +Added: + +- `Workers/push-worker.js`, `Workers/wrangler.push.toml` +- `Crossmate/Services/PushClient.swift` +- `Crossmate/Services/LocalSessionTracker.swift` +- `Crossmate/Models/CheckResult.swift` + +Modified: + +- `project.yml`, `Crossmate/Info.plist`, `Crossmate/Crossmate.entitlements` — push base URL and App Attest entitlement/configuration +- `Crossmate/Services/PushRequestAuthenticator.swift` — per-install App Attest enrollment and request signing +- `Crossmate/Models/CellMark.swift` — enum-based `checked: CheckResult?` +- `Crossmate/Models/Game.swift` — `setLetter` / `checkCells` updated for new mark API +- `Crossmate/Models/CrossmateModel.xcdatamodeld` — `CellEntity.checkedRight` attribute +- `Crossmate/Models/PlayerSession.swift` — `onPlayerEvent` + 6 fire sites removed +- `Crossmate/Sync/Moves.swift` — `TimestampedCell` / `GridCell` / `RealtimeCellEdit` / `Payload.Entry` gain `checkedRight` +- `Crossmate/Sync/Presence.swift` — `PingKind` trimmed, `PingScope` / `PingScopePayload` removed, `Ping.scope` removed +- `Crossmate/Sync/SyncEngine.swift` — `enqueuePing` loses `scope`; `PingPayload` loses `scope` +- `Crossmate/Sync/RecordSerializer.swift` — `pingRecord(scope:)` removed +- `Crossmate/Sync/RecordBuilder.swift` — drop `payload.scope` passthrough +- `Crossmate/Sync/SessionMonitor.swift` — slimmed to bucket accumulation + `consumeOnOpen` +- `Crossmate/Sync/MovesUpdater.swift` — `enqueue(checkedRight:)`, `Pending.checkedRight` +- `Crossmate/Sync/GridStateMerger.swift` — propagate `checkedRight` +- `Crossmate/Persistence/GameMutator.swift` — `encodeMark` returns the triple; cell-edit path passes `checkedRight` +- `Crossmate/Persistence/GameStore.swift` — `decodeMark` takes the triple; `applyCellCache` / `applyRealtimeCellEdit` updated +- `Crossmate/CrossmateApp.swift` — `AppDelegate.onAPNsToken` callback; `updateActiveNotificationPuzzleID` fires the play/pause pushes; selection-clear block fires the pause push; `session.onPlayerEvent` callback removed +- `Crossmate/Services/AppServices.swift` — `PushClient` init wired; `publishCompletionPush` (win + resign), `publishSessionStartPush`, `publishSessionEndPush`, `recipientsAndTitle`; `sendPlayerEventPings` removed; `bodyText(for: Ping)` trimmed; `presentBegins` call site removed + +Tests touched: `RecordSerializerMovesTests`, `PendingEditFlagTests`, `GridStateMergerTests`, `MovesUpdaterTests`, `GameStoreUnreadMovesTests`, `Sync/MovesInboundTests`, `Sync/AuthorDeltaTests`, `Sync/EngagementCoordinatorTests`, `Sync/PendingChangeReapTests`, `Sync/SessionMonitorTests`, `GameMutatorTests`, `XDAcceptTests`, `PuzzleNotificationTextTests`, `RecordSerializerTests`. + +All unit tests pass. No CloudKit Dashboard changes required. diff --git a/Notes/ReadHorizons.md b/Notes/ReadHorizons.md @@ -0,0 +1,165 @@ +# Problem: cross-device read-horizon / presence collapse + +This document captures an unresolved design problem in the read-cursor / +unread-badge sync path. It is written to be picked up cold in a new session. +**No fix is committed for it yet that works** — see "Current state" below. + +## TL;DR + +The account's read horizon (`GameEntity.lastReadOtherMoveAt`) and its +"a device is actively present" lease are squeezed into a **single +last-writer-wins scalar** carried on **one Player record per (game, account)**. +That structure cannot represent "device A has left but device C is still +present," so any attempt to stop a leaving device from collapsing another +device's active lease is unsolvable with the data we currently store. Two +successive fixes attempted it and both were wrong. We need to decide between +(a) accepting the bounded, self-healing limitation, or (b) making presence +per-device, which is a schema + sync change. + +## Background: how the read horizon works today + +- `GameEntity.lastReadOtherMoveAt` is the per-account "how far we've read other + authors' moves" horizon. The unread badge / session nudge fires when a peer + (different iCloud account) has a `MovesEntity.updatedAt` later than this + horizon. See `GameStore.unreadOtherMovesGameCount()` and `GameSummary`. +- A **future-dated** horizon is an *active-session lease*: "a device of this + account is in the puzzle right now, suppress badges/nudges until this time." + `publishReadCursor(.activeLease)` writes `now + readLeaseDuration` (10 min) + and refreshes only when less than `readLeaseRefreshFloor` (5 min) remains + (`AppServices.swift`). Exit/background writes `now` (`.currentTime`), + intentionally closing the lease. +- The horizon is shipped to other devices on the Player record's `readAt` + field. **Outbound `readAt` is built from `entity.game?.lastReadOtherMoveAt`** + (`RecordBuilder.swift:135`) — i.e. every device publishes the single shared + account scalar, not its own lease. +- Inbound: `RecordApplier.applyPlayerRecord` adopts `readAt` (and the + `sessionSnapshot` catch-up baseline) **above** the selection `updatedAt` LWW + guard but **under** the per-record etag freshness guard, then calls + `onReadCursor` → `GameStore.noteIncomingReadCursor` for our own account's + records only (`authorID == localAuthorID`). This hoist was commit `3835f0f` + ("Converge the catch-up baseline regardless of cursor recency"), which fixed a + real duplicate-catch-up-banner bug. + +## The decisive fact (why the recent fixes fail) + +**Player records are one per `(game, author)`, with no device component:** + +``` +RecordSerializer.swift:62 + static func recordName(forPlayerInGame gameID: UUID, authorID: String) -> String { + "player-\(gameID.uuidString)-\(authorID)" // <-- no deviceID + } +``` + +All of an account's devices share the *same* `authorID`, so they all write the +*same* Player record. Consequences: + +1. CloudKit resolves concurrent device writes by **last-writer-wins**. The + record's `readAt` is simply whoever wrote most recently. +2. Locally there is exactly **one** `PlayerEntity` row per `(game, authorID)`. + The per-record etag guard (`applyPlayerRecord`, the + `incomingIsAtLeastAsFresh` check) rejects older CloudKit server snapshots + than the local etag, so the `readAt` that reaches `noteIncomingReadCursor` + is from the accepted/current server version of that one Player record. That + does **not** mean its `updatedAt` value is semantically freshest; the + catch-up-baseline bug existed precisely because the accepted record could + carry a stale `updatedAt`. +3. There is **no stored representation of an individual device's lease.** "A + left, C is still here" is unrepresentable; the scalar holds only the last + write. + +## The original concern and why it is inherent, not a bug we introduced + +Red-team round 1 worried that a leaving device's record could move +`lastReadOtherMoveAt` backward and collapse another device's active future +lease, briefly reopening unread badges / pause nudges. With the one-record LWW +model this is just **last-writer-wins on a shared scalar**: if device A closes +(writes `now`) after device C leased `now+10m`, the record now says `now`. Other +devices adopt `now`. It is bounded and **self-healing**: the still-active device +re-leases within ≤5 min (`readLeaseRefreshFloor`), and a foreground device marks +inbound moves read on arrival anyway. It cannot be "fixed" without representing +C's lease separately from A's close. + +## What was attempted (both wrong) + +1. **`0ccbe90` — `protectActiveLease` floor (committed).** Made + `noteIncomingReadCursor` refuse any incoming `readAt` below an unexpired + future horizon. **Broke clean closes:** a real close from another device is + indistinguishable from a stale echo (both are "a value below the current + future horizon"), so legitimate closes were ignored and a stale future lease + was held — suppressing badges/nudges for up to the 10-min lease. + +2. **`e2d863a` — recompute from sibling rows (committed).** Replaced the + floor with `horizon = max(incoming, min(current, now), max sibling + PlayerEntity.readAt)`. **Two fatal flaws, both correct per red-team:** + - **P1:** "max over sibling `PlayerEntity` rows" assumes per-device rows that + do not exist (one row per author). In production it degenerates to the + single LWW row and distinguishes nothing. The accompanying test fabricated + `player-A`/`player-B` rows with the same author — impossible. + - **P1:** `min(current, Date())` floors at wall-clock *processing* time, not + the actual departure time. A late-processed close (device left 10:02, + received 10:05, old lease 10:10) advances the horizon to 10:05 and marks + peer moves made 10:02–10:05 as read even though no device saw them. + +## Current state + +- Committed: `3835f0f` (good, keep), `0ccbe90` (broken floor), and `e2d863a` + (broken recompute). +- `PLAN.md` is untracked documentation. The working tree may otherwise be clean + depending on whether this file has been added. +- The unit suite currently passes, but only because the test was written to the + fictional model. Passing is not meaningful here. + +## Options to decide between + +**Option A — Revert to the plain-adopt baseline (recommended as the floor).** +Restore `noteIncomingReadCursor` to simply +`_ = setReadCursor(gameID: gameID, readAt: readAt)` (the pre-`0ccbe90` behavior), +i.e. revert both `0ccbe90` and `e2d863a` back to the `3835f0f` state. This is +the LWW-consistent choice: trust the account's latest resolved +`readAt`, allow backward movement. It has **no false-negative** (a close to +10:02 yields horizon 10:02; later moves badge correctly — Finding 2 gone) and +**no fictional data dependency** (Finding 1 gone). Its only cost is the inherent +round-1 collapse, which is bounded (≤5 min) and self-healing. The catch-up +baseline fix from `3835f0f` is preserved. + +**Option B — Make presence per-device (the real fix, larger).** Represent each +device's lease separately so the account horizon can be a true max over *live* +device leases. Sketch: +- Give the Player record a `deviceID` (like Moves/Journal/Ping already do: + `moves-<game>-<author>-<device>`), so each device owns its own row, OR add a + separate lightweight per-device presence/lease record. +- Compute the suppression horizon as the max **unexpired** lease across the + account's own device rows; let truly-read progress be a separate monotonic + value that never moves backward. +- This is what the round-2/round-3 attempts *wanted* to do but couldn't, because + the rows didn't exist. It is a schema + sync-path change with migration + considerations and should be scoped deliberately. Note CloudKit production + cannot be inspected/edited via the dashboard, so verify via on-device + diagnostics. + +**Option C — Status quo (`0ccbe90`/`e2d863a` as-is).** Not recommended: the +committed follow-up still relies on per-device Player rows that do not exist +and can mark unseen moves read via the wall-clock processing-time floor. This +is worse than the occasional spurious nudge Option A allows. + +## Recommendation + +Take **Option A now** to get back to a correct, simple baseline, then decide +separately whether the round-1 collapse is worth **Option B**. My read: the +collapse is minor and self-healing, so Option A may be sufficient on its own and +Option B is only worth it if real-world reports show spurious badges/nudges +while a sibling is demonstrably active. + +## Key references + +- `Crossmate/Sync/RecordSerializer.swift:62` — Player record name (the root fact) +- `Crossmate/Sync/RecordBuilder.swift:135` — outbound `readAt` = shared scalar +- `Crossmate/Sync/RecordApplier.swift` (`applyPlayerRecord`) — etag guard, the + `3835f0f` hoist, `onReadCursor` +- `Crossmate/Persistence/GameStore.swift` — `noteIncomingReadCursor`, + `setReadCursor`, `noteIncomingMovesUpdate`, `markOtherMovesRead` +- `Crossmate/Services/AppServices.swift` — `publishReadCursor`, + `readLeaseDuration` (10 min), `readLeaseRefreshFloor` (5 min), the + `accountSeen` push path (~2208), the incoming-cursor drain (~486) +- Commits: `3835f0f` (keep), `0ccbe90` and `e2d863a` (revert) diff --git a/Notes/ServerSecurity.md b/Notes/ServerSecurity.md @@ -0,0 +1,237 @@ +# Crossmate Audit + +Date: 2026-06-12 (updated 2026-06-16) + +Scope: Crossmate iOS/iPadOS app, shared app/extension code, tests, scripts, and +Cloudflare Workers used by Crossmate. Crossmake was intentionally excluded. + +## Summary + +The highest-risk issue was the push worker's authentication model. The app +previously shipped a shared push bearer in its bundle configuration, and the +worker treated that bearer as the only authorization gate for registering +devices and publishing pushes. Anyone who extracted that value from a real app +build could use the worker as Crossmate. That issue has been addressed by +replacing shared bearer authentication with per-install App Attest enrollment and +signed worker requests. A temporary legacy-bearer fallback behind an +environment flag was subsequently removed and the legacy push-bearer secret +deleted from the worker (see Finding 1), so App Attest is +now the only authentication path. + +The rest of the audit found lower-risk operational and robustness issues around +push environment handling, local secret storage, engagement room authentication, +and destructive local-store recovery behavior. All of these have since been +addressed. The most substantive was the engagement room authentication design +(Finding 4): the secret-in-URL symptom sat on top of a trust-on-first-use +room-auth scheme whose HMAC signature added no security. Rooms are now +registered with the worker ahead of connecting (secret in a POST body, never a +URL), and the connect signature is verified against the worker-held secret. + +## Findings + +### 1. Push worker used a bearer secret embedded in the app + +Status: Fixed. + +Impact: High. + +The app read `CrossmatePushBearer` from its bundle configuration and sent it to +the push worker. The Cloudflare worker accepted that shared value as sufficient +authorization for `/register`, `/publish`, and unregister operations. Because the +same bearer was present in every distributed build, extracting one app binary +would expose the credential for all clients. + +The implemented replacement keeps Crossmate account-free while removing the +global shared secret. Updated clients now enroll a per-install App Attest key +with the worker, store the key ID locally, and sign each protected push request. +The worker verifies Apple App Attest attestation, stores only the public key and +assertion counter, and accepts push operations only when signed by the registered +key. Existing collaborative games remain compatible for updated clients because +CloudKit game state and push address derivation did not change; clients simply +re-register the same push addresses under the new authentication scheme. + +The worker initially retained a legacy bearer fallback behind an +environment flag as a rollback window. Resolved +2026-06-12: the fallback branch was removed from `Workers/push-worker.js`, the +updated worker was redeployed, and the legacy push-bearer secret was +deleted from the worker. App Attest is now the only authentication path. + +Deployment notes: + +- `APP_ATTEST_ROOT_CERT_PEM` has been uploaded to Cloudflare from + Apple's published App Attest root certificate. +- `crossmate-push` has been deployed with App Attest environment and app identity + variables. +- The regenerated `Crossmate iOS Distribution` provisioning profile includes + `com.apple.developer.devicecheck.appattest-environment`. +- App Attest enrollment and signed worker requests have been validated on a real + client after the initial implementation session. + +### 2. Push environment handling can diverge between APNs entitlement and client registration + +Status: Fixed. + +Impact: Low (debug builds only). + +`Crossmate.entitlements` hardcoded `aps-environment: production`, while +`PushClient` selected `sandbox` under `#if DEBUG` and `production` otherwise. +TestFlight and App Store builds are release-signed, so both sides agreed on +production there; the mismatch only affected local device debug builds, where the +token's APNs environment and the environment registered with the worker could +disagree and pushes silently fail. That failure mode was a diagnosability +annoyance during development rather than a user-facing risk. + +Resolved 2026-06-12: a single `APS_ENVIRONMENT` build setting in `project.yml` +(`development` for Debug, `production` for Release) is now substituted into both +the `aps-environment` entitlement and a `CrossmateAPSEnvironment` Info.plist +key. `PushClient` reads the Info.plist key instead of inferring from +`#if DEBUG`, and disables push (failable init, logged to diagnostics) if the +value is missing or unrecognised. Because the entitlement and the worker +registration now derive from the same variable, they cannot diverge. This also +corrects the Debug entitlement itself: development-signed builds previously +claimed `production` while their provisioning profile said `development`. + +### 3. Account push secret storage and generation path is weakly guarded + +Status: Fixed. + +Impact: Low. + +The account push secret is stored in local defaults, and random generation +failure was not treated as a hard failure: `generatePushSecret()` in +`AccountPushCoordinator` discarded the `SecRandomCopyBytes` result, so on a +(vanishingly unlikely) failure the minted secret would be the base64 of 32 zero +bytes. + +Moving the local copy to the Keychain would change little, because the secret is +deliberately replicated across the account's devices as a CloudKit `Decision` +record — the private CloudKit database is already its dominant home. The +severity is therefore Low rather than Medium. + +The boundary this finding originally documented — the secret is the HMAC key +from which per-game push addresses are derived, those addresses acted as bearer +capabilities, and the worker's `handlePublish` did not check that a caller +participates in the game it pushed to, so any App Attest-enrolled client that +learned a derived address could publish to that account — has since been closed +(participation gating, below). + +Resolved 2026-06-12 (secret hardening): `generatePushSecret()` now checks the +`SecRandomCopyBytes` status with a `precondition` (crashing is preferable to +durably converging a predictable HMAC key across the account's devices) and +reuses the existing `Data.base64URLEncodedString()` helper. + +Resolved 2026-06-16 (participation gating): publishing to a game now requires +proving participation, mirroring the engagement room-secret scheme (Finding 4). +Each shared game carries a per-game push credential (`GamePushCredentials` — an +unguessable `credID` plus a 256-bit secret) in the Game record's `notification` +field, synced only to CKShare participants. Devices register their per-game push +addresses with the worker keyed under that `credID`, and a publish is signed +with the game secret (HMAC over the App Attest request's already-validated body +hash, timestamp, and nonce). The worker verifies the signature against the +secret it holds for the credID and resolves targets only among addresses +registered under it — so a caller can reach a game's participants only if it +holds that game's secret, i.e. it is a participant. A signature alone would not +have sufficed (the worker can't tell which addresses belong to which game), so +the device address registration is credID-bound too; that binding is the part +that actually closes the hole. Account-scoped sibling pushes +(accountJoined/accountSeen) carry no credID and stay on the App-Attest-only +path; their addresses derive from the account secret in the private database and +were never participant-spoofable. Deployment: the Game record gains a +`notification` String field (Development + Production) and the updated push +worker must be deployed; older clients lose game pushes (the durable CloudKit +path is unaffected), which is acceptable pre-release. + +### 4. Engagement room authentication is trust-on-first-use and its signature adds no security + +Status: Fixed. + +Impact: Medium. + +The secret-in-URL exposure originally flagged here was the symptom; the room +authentication design underneath was the actual weakness. + +The client sent both the room secret and an HMAC-SHA256 signature over +`roomID|authorID|deviceID|timestamp|nonce` as query parameters. The worker +verified that signature using the secret supplied in the same request +(`room-worker.js`), so anyone could mint a valid signature for any secret +they chose — the HMAC added nothing. The real gate was a trust-on-first-use +`secretHash` stored in the Durable Object: the first client to connect to a +room defined its secret, and later connections had to match it. The secret +itself was the bearer credential, and it travelled in the query string, where +it could appear in logs, diagnostics, proxies, and tooling surfaces. + +Resolved 2026-06-12 (code): the worker gained a `POST /rooms/{id}/register` +endpoint that stores the room secret from the request body — first write wins, +re-registration with the same secret is an idempotent success, and a different +secret is refused with 409. The websocket connect no longer accepts a `secret` +parameter at all; the worker verifies the HMAC signature against the +registered secret and refuses unregistered rooms. Timestamp-freshness and +nonce-single-use checks already existed on the worker and are unchanged — with +a server-held key they now carry real authentication. Clients register +idempotently before every connect, which preserves the previous de-facto +behavior that an idle-expired Durable Object room is resurrected by whichever +creds-holder returns first, and removes any ordering race between the minting +peer and peers that receive the creds via CloudKit. A 409 (room ID registered +under a different secret, e.g. squatted after idle expiry) is surfaced as +`EngagementReconcileOutcome.roomRejected`, and `EngagementLifecycle` clears the +Game record's `engagement` field so a fresh room is minted and LWW converges +the peers onto it. + +Deployed 2026-06-12: the updated worker refuses TOFU connects, so clients +running builds that predate this change lose the live engagement channel +(graceful — the durable CloudKit Moves path is unaffected). It was deployed +alongside the updated clients. + +### 5. Core Data store recovery can erase local state + +Status: Fixed. + +Impact: Medium. + +The local persistence recovery path could wipe the Core Data store when store +load or migration fails. The in-code rationale +(`PersistenceController.recreateStore`) was that the store is a rebuildable +cache of CloudKit, so a wipe is recoverable via refetch. That rationale does not +hold in two cases: + +- iCloud sync is user-toggleable (`isICloudSyncEnabled`). For a user with sync + disabled, the local store is the *only* copy of their games, and the wipe is + total, permanent data loss. +- Even with sync on, offline edits not yet uploaded through the sync engine are + lost. + +Resolved 2026-06-12: `recreateStore` now moves the failing store files +(including the `-wal`/`-shm` sidecars) aside under a `.broken-<timestamp>` +suffix before destroying and rebuilding, so the data stays inspectable and +recoverable instead of being erased. The recovery event — including the original +load error and the preserved file names — is surfaced through the diagnostics +`EventLog`. `PersistenceController` gained an injectable store URL for tests, +and targeted tests now cover both the failure path (broken store preserved +byte-for-byte, empty store rebuilt) and the healthy path (reopening an existing +store triggers no recovery). Preserved backups are not auto-pruned; the path +requires a failed migration, so accumulation is not a practical concern. + +## Verification Performed + +After the App Attest push-auth changes, these checks passed: + +- `node --check Workers/push-worker.js` +- `xcodegen generate` +- `bash Scripts/build.sh` +- `bash Scripts/test-unit.sh` + +Cloudflare was also updated operationally: + +- `APP_ATTEST_ROOT_CERT_PEM` secret uploaded. +- `crossmate-push` deployed to its workers.dev URL. + +## Release Validation + +The App Attest challenge/registration path and signed worker requests have been +validated on a real client. The remaining release checks are the broader +end-to-end collaboration and push-delivery behaviors: + +- Fresh install registers for notifications. +- Two updated clients can continue an existing collaborative game. +- Background push delivery still works. +- Worker logs remain clean during normal registration and publish traffic. diff --git a/Notes/Snapshots.md b/Notes/Snapshots.md @@ -0,0 +1,143 @@ +# Snapshots in Shared Games + +Snapshots are a move-log compaction and replay acceleration mechanism. Instead +of reconstructing a game by replaying every move from an empty grid, a client can +start from a stored grid snapshot and replay only the moves not included in that +snapshot. + +The current scalar `upToLamport` model is not safe for shared games. It assumes +there is one global move sequence: + +```text +Snapshot(upToLamport: 42) +``` + +Replay then skips every move whose Lamport is `<= 42`. That is only correct if +all devices assign Lamports from a single globally ordered stream. In a shared +or offline-capable game, two devices can independently create moves with the +same or overlapping Lamport values. A snapshot from device A at Lamport 42 may +not include device B's move at Lamport 41. If replay treats B's move as covered, +that move can disappear from the reconstructed grid. + +## Proposed Model + +Use per-replica move clocks and snapshot coverage vectors. + +Each move should identify the replica that produced it and that replica's local +sequence number: + +```text +Move(replicaID: "device-a", sequence: 12) +Move(replicaID: "device-b", sequence: 7) +``` + +A snapshot should store the grid plus the highest included sequence for each +replica: + +```text +Snapshot( + grid: ..., + coverage: [ + "device-a": 12, + "device-b": 7 + ] +) +``` + +Replay can then skip a move only when the selected snapshot explicitly covers +that move: + +```swift +let coveredSequence = snapshot.coverage[move.replicaID] ?? 0 +let isCovered = move.sequence <= coveredSequence +``` + +This prevents one device's snapshot from accidentally hiding another device's +move. A move is skipped only if the snapshot proves that the move was folded +into the snapshot grid. + +## Replay and Conflict Ordering + +Snapshot coverage answers one question: which moves are included in this +snapshot? + +Conflict ordering answers a separate question: when multiple moves affect the +same square, which move wins? + +Those should not be represented by the same scalar value. Replay should: + +1. Choose a snapshot. +2. Start from the snapshot grid. +3. Apply only moves not covered by that snapshot. +4. Use a deterministic conflict ordering for same-cell writes. + +The conflict ordering can be last-writer-wins with a stable tie-breaker, for +example `(createdAt, replicaID, sequence)` or another explicit ordering chosen +for product semantics. The important part is that coverage remains per-replica. + +The shipped tie-break (`GridStateMerger.shouldReplace`) is +`(updatedAt, authorID, deviceID)`: per-cell last-writer-wins on the writer's +wall-clock `updatedAt`, then lex-min `authorID`, then lex-min `deviceID`. + +### Accepted trade-off: cross-device wall-clock skew + +The merge, the completion seal (`merge(notAfter: completedAt)`), the pause-push +diff window, and presence all compare raw device wall clocks across users — +there is no logical clock and no server-time normalisation on apply. A device +whose clock runs fast therefore wins every cell conflict regardless of real +typing order, and its legitimate pre-win letters can be dropped by the +completion seal when their future-dated `updatedAt` reads as past the latch. + +This is accepted as-is rather than mitigated, because the realistic incidence is +low. iOS defaults to "Set Automatically" (NTP / cellular network time), which +holds devices within sub-second of true time; quartz drift on a merely-offline +device is seconds per week. Sub-second skew almost never inverts a merge outcome, +since that requires two writers to the *same* cell within a real-time gap smaller +than the skew, and the losing letter in such a tie is usually plausible anyway — +it is a coin-flip on a shared cell, not silent destruction of unique work. The +one regime where skew actually bites is a user who has turned off automatic time +or set the clock manually (minutes to days off); that is self-inflicted +misconfiguration, and since `CKShare` participants are invited friends, a +malicious far-future clock is griefing by someone you chose to play with, not a +cheating-for-advantage vector — there is nothing zero-sum to win in a cooperative +solve. + +If field reports ever surface a "my answer disappeared" trace to a wrong clock, +the cheap mitigation is to clamp each inbound cell's `updatedAt` to +`record.modificationDate + ε` on apply, bounding every writer's skew by +CloudKit's server clock (which all devices share). It is deliberately not built +speculatively: it adds a permanent "inbound timestamps are rewritten on apply" +behaviour to reason about, for an edge no user is known to hit. + +## Pruning + +Pruning old moves is the risky part. With multiple replicas, snapshots can be +incomparable: + +```text +Snapshot A covers device-a: 20, device-b: 5 +Snapshot B covers device-a: 10, device-b: 30 +``` + +Neither snapshot fully replaces the other. If moves are deleted merely because +they are covered by some snapshot, a later replay from a different snapshot may +miss required moves. + +The conservative implementation path is: + +1. Add per-replica coverage to snapshots. +2. Use snapshots only as replay accelerators. +3. Do not prune moves initially. +4. Add pruning later only for moves covered by a retained canonical snapshot. + +A newer snapshot can safely replace a canonical snapshot when its coverage +dominates the previous one: for every known replica, the newer snapshot covers at +least as high a sequence number. Until that rule exists and is tested, retaining +the move log is safer than compacting it incorrectly. + +## Practical Recommendation + +For now, disable snapshot pruning for shared games. It is also reasonable to +disable shared-game snapshot creation entirely until the snapshot format records +per-replica coverage. The cost is a larger move log and slower replay for very +long games; the benefit is avoiding data loss or disappearing squares. diff --git a/Notes/SyncRedesign.md b/Notes/SyncRedesign.md @@ -0,0 +1,186 @@ +# Sync Redesign Plan + +## Context + +On 2026-04-21, completing a puzzle on device surfaced two distinct failures: + +1. **Data loss**: after filling every cell and seeing the win screen, the puzzle list's preview was missing a square. Re-opening the puzzle confirmed a previously-filled cell had reverted to blank. +2. **CloudKit throttling**: diagnostic logs showed the container hitting a device-level throttle (`x-cloudkit-error=THROTTLED`, `CKRetryAfter=22`, `ThrottleLabel=Device throttle: container=iCloud.net.inqk.crossmate.v2, type=210/211`) with five distinct `OperationID`s inside a ~2-second window. + +These are separate problems with a shared amplifier. + +## Status (as of 2026-04-26) + +**Phases 1, 2, 3a, 3b+3c, and 4 are complete.** Only the conditional Phase 5 (WebRTC) remains, gated on measured live-latency being inadequate. + +### Done + +- **Phase 1** — _was_: fetch-side LWW guard on `RecordSerializer.applyCellRecord`. Now deleted: Cell records no longer exist in the move-log model. +- **Phase 2** — _was_: `PushDebouncer`, single-flight push, `CKRetryAfter` handling. Now deleted: replaced by `CKSyncEngine`'s built-in batching and retry. +- **Phase 3a** — data model and replay plumbing: `MoveEntity`, `SnapshotEntity`, `GameEntity.lamportHighWater`, `MoveLog.swift`, `RecordSerializer` Move/Snapshot builders. Tests in `MoveLogTests.swift`. +- **Phase 3b+3c** — full move-log + CKSyncEngine adoption (all 9 sub-tasks done): + 1. `MoveBuffer.swift` — coalescing actor, lamport assignment, `MoveEntity` persistence, debounce. Tests in `MoveBufferTests.swift`. + 2. `SyncEngine.swift` — rewritten around `CKSyncEngine`. Delegate handles incoming `Move`/`Snapshot` records, replays `CellEntity` cache, fires `onRemoteMoves`. + 3. `GameMutator.swift` — now emits `Move` to `MoveBuffer`; no direct `CellEntity` writes; `RemoteCellChange` / `Origin` / `applyRemoteCell` deleted. + 4. `GameStore.swift` — restores from move-log replay; `createGame` no longer seeds `CellEntity`; `repairSyncedGamesIfNeeded` / `applyRemoteChanges` deleted. + 5. `AppServices.swift` — constructs `MoveBuffer`, wires into `GameStore`, wires `onRemoteMoves` → `store.refreshCurrentGame()`. + 6. `CrossmateApp.swift` — no changes needed; `syncOnBackground()` already calls `moveBuffer.flush()`. + 7. `OutboxRecorder.swift`, `PushDebouncer.swift`, `PendingChangePayload.swift`, `PendingChange+Helpers.swift` deleted; `PendingChangeEntity` and obsolete token fields removed from data model. + 8. `PushDebouncerTests.swift`, `PendingChangeTests.swift` deleted; `ApplyCellRecordLWWTests` / `PendingChangePayloadTests` suites removed from `RecordSerializerTests.swift`; `GameMutatorTests.swift` pruned to in-memory mutation tests only. + 9. `MoveBufferTests.swift` — 7 tests all passing. + +- **Phase 4** — CKShare via per-game zones and dual sync engines, plus shared player identity: + 1. Per-game zones (`game-<UUID>`) with one `CKShare` rooted on the `Game` record; `Move`/`Snapshot`/`Player` records sit in the same zone via `parent` reference, so share inclusion is automatic. + 2. Dual sync engines: one driving `privateCloudDatabase` (owned games), one driving `sharedCloudDatabase` (joined games), parameterised through the same delegate path. + 3. `ShareController` for invitation creation; share-accept handler in `AppServices` accepts via `CKAcceptSharesOperation` and wires the new zone into the shared engine. + 4. `AuthorIdentity` stamps every `Move` with the local iCloud user record name, so multi-author replay attributes correctly. + 5. Share revocation: `markAccessRevoked(gameID:)` flips the local game read-only when the participant is removed; resolves the open question from earlier drafts. + 6. **Player identity** (deviated from the original `namesByAuthor`-on-Game JSON-blob plan in `.claude/plans/shared-player-identity.md`): names ride on per-(game, author) `Player` records (`player-<gameID>-<authorID>`) through the same engine. Each author owns their own record name → no write-write conflicts, no JSON merge primitive, no "must start from latest server record" hazard. `NameBroadcaster` fans out renames to every shared/joined game (debounced 250 ms); `PlayerRoster` resolves entries via the local `PlayerEntity` rows → `CKShare.Participant` name components → email → `"Player"` fallback chain. Colours are device-local in `GamePlayerColorStore`, with collision reassignment when the local user picks a remote colour. + 7. Snapshot pruning correctness fix landed during Phase 4: covered moves are now deleted only after `CKSyncEngine` confirms the snapshot save (`needsPruning` flag on `SnapshotEntity`), preventing data loss when an upload fails permanently. + +### Design decisions already locked in conversation + +- **Lamport timestamps** — simple per-game monotonic counter stored in `GameEntity.lamportHighWater`. Bumped on local emit and on receiving a move with a higher lamport. No cross-participant total order; undo-across-participants stays out of scope. +- **Snapshot cadence** — on game completion + opportunistic when the move log exceeds 200 moves. Starting point; tunable later. +- **No migration** — user deletes existing games before launching the new version, so the new code can assume the move-log schema from first run. +- **`CellEntity` stays** — as a replay-driven cache, not source of truth. Views that read `CellEntity` keep working. + +### Gotchas for a fresh session + +- The Xcode project is generated by XcodeGen from `project.yml`. Run `xcodegen generate` after adding new `.swift` files or the build target won't see them. +- The test target is named `Crossmate Unit Tests` (with spaces). Use `-only-testing:'Crossmate Unit Tests/<SuiteName>'`. +- SourceKit diagnostics in-session regularly claim "Cannot find type X in scope" for Core Data generated classes and cross-file types. These are noise — trust `xcodebuild build` for ground truth. +- Build target: `'platform=iOS Simulator,name=iPhone 17 Pro'`. +- The user commits changes themselves — never run `git commit` unsolicited. + +### Root cause: data loss + +`RecordSerializer.applyCellRecord` (`Crossmate/Sync/RecordSerializer.swift:129`) unconditionally overwrites the local `CellEntity`'s `letter`, `markKind`, `checkedWrong`, `updatedAt`, `letterAuthorID` with the server's values. There is no LWW check against local `updatedAt` and no check for an outstanding `PendingChangeEntity`. Only `handlePushError`'s `.serverRecordChanged` branch does LWW; the plain fetch path does not. + +Reproduction: +1. User types final letter; local `letter="A"`, pending change queued. +2. Push is throttled; pending change remains in the outbox. +3. Interleaved fetch (log: `fetch: changedZoneIDs count=1`) pulls the server's stale copy of that cell (`letter=""`). +4. `applyCellRecord` overwrites local `letter="A"` with `""`. UI renders blank. +5. Pending change would eventually re-push `"A"`, but during the window the cell genuinely reads blank. + +### Root cause: throttling + +The sync engine writes one `CKRecord` per cell and fires pushes aggressively: +- No client-side debounce: every keystroke enqueues a `PendingChangeEntity`. +- No serialisation: multiple `pushChanges()` invocations can produce overlapping `CKModifyRecordsOperation`s (hence the five distinct `OperationID`s in one second). +- No `CKRetryAfter` handling: on 429, `handlePushError`'s default branch silently leaves the pending change in the outbox, and the next trigger immediately hits CloudKit again, tightening the device cooldown. + +### Why data model is central + +Throttling is ultimately about request rate and payload volume, not record shape per se. But the per-cell-record shape makes debouncing and coalescing hard (each cell has its own identity and history), and it doesn't compose with any real-time side-channel — each keystroke is always a record write. A better-chosen data shape removes the hardest constraint before we start tuning knobs. + +## Decision + +**Adopt a CloudKit + move-log + CKShare architecture, with WebRTC as deferred optional future work.** + +### Data model: append-only move log + +The source of truth for a game's cell state becomes an append-only collection of `Move` records, each describing a single cell mutation: + +- `recordName`: `move-<gameUUID>-<lamportOrUUID>` +- `parent`: reference to the parent `Game` record (so CKShare inclusion is automatic) +- Fields: `row`, `col`, `letter`, `markKind`, `checkedWrong`, `authorID`, `createdAt`, and a monotonic ordering key (Lamport timestamp or similar). + +The grid displayed in the UI is the deterministic replay of the move log. `CellEntity` becomes a local cache populated by replaying moves into Core Data; it is not the source of truth. + +Periodic **snapshots** keep the move log bounded: every N moves (or on game completion), write a `Snapshot` record capturing the full grid state and delete moves older than the snapshot. New joiners load the latest snapshot plus the tail of moves since it. + +### Why move log over per-cell-LWW or whole-game blob + +| Model | Throttling | Multiplayer correctness | Composes with real-time | Conflict code | +|---|---|---|---|---| +| Per-cell records (current) | Bad: per-keystroke write, hard to coalesce | Requires LWW on fetch (not present today) | Awkward: same keystroke would need to flow over two transports | Must implement on every client | +| Whole-game blob with LWW | Good: coalesces trivially | **Broken**: Bob's blob without Alice's write wipes her cells | Awkward | N/A (wrong) | +| Whole-game blob with per-cell LWW merge | Good | Works if merge is disciplined | Awkward | Non-trivial merge code on every client | +| **Move log + snapshots** | **Good**: moves are small, append-only, debouncable | **Works**: each author writes only their own records, no `serverRecordChanged` races on shared cells | **Clean**: the same move unit ships over CloudKit or any side-channel | Deterministic replay; no merge code | + +The move log's key property for CKShare is **no write-write conflicts between participants** — Alice and Bob only ever create new records, each with a unique `recordName`. That removes the hardest correctness problem from the table. + +### How coalescing works in the move log + +"Coalescing" in this design is really a debounced in-memory buffer that dedupes by cell before anything becomes a record. A keystroke does **not** become a `Move` record at the moment it happens — it sits in an in-memory buffer keyed by `(gameID, row, col)`. A new keystroke in the same cell replaces the pending entry's value. At flush time, each buffer entry becomes one `Move` record with a fresh Lamport timestamp, and all records from one flush go into a **single** `CKModifyRecordsOperation`. + +So if the user types `A` → `B` → `C` in the same cell inside the debounce window, only one `Move` record is written, with `letter = "C"`. The intermediate states never exist as records. Flush triggers are: the debounce timer (1–2s), cell-change (flush the old cell before the new one starts accumulating, so the log's causal order matches the user's editing order), app background, and game completion. + +This is safe specifically because moves are append-only and each author only writes their own records — local coalescing never has to reason about server state or another participant's writes, which is what breaks the equivalent coalescing in the per-cell-record model. + +### Sync mechanics: CKShare for both async and sync + +- Each `Game` record is the root of a `CKShare`. Participants accept the share URL via `UICloudSharingController` and the OS's share-accept handler. +- `Move` and `Snapshot` records set `parent = Game`, so they are part of the share automatically. +- The engine drives both `privateCloudDatabase` (owner) and `sharedCloudDatabase` (participant) through the same code paths, parameterised by database handle. +- `CKDatabaseSubscription` on both databases delivers silent pushes on change; the app wakes, fetches, replays moves into the local cache. +- Async solving: partner's moves arrive on next fetch; 1–5s latency after push is fine. +- Sync solving: same mechanism; feels like near-real-time at 1–3s in good conditions. + +### Throttle-safety rules (apply regardless of model) + +These are properties the engine must have, independent of the data model choice: + +1. **Debounce writes.** Buffer move events in memory; flush at most every 1–2 seconds (or on cell change / app background). +2. **Single in-flight push.** Serialise `pushChanges()` via an actor or explicit lock so overlapping triggers coalesce rather than fan out. +3. **Respect `CKRetryAfter`.** On `CKError.serviceUnavailable` / `.requestRateLimited` / `.zoneBusy`, sleep for the returned delay before the next push attempt. Do not break out of drain on transient errors. +4. **Adopt `CKSyncEngine`.** It handles batching, retry-after, and send-state tracking natively. Re-implementing these primitives has been the source of the current bugs. The minimum deployment target for this redesign is iOS 18 (or iOS 26), so `CKSyncEngine` — including its shared-database support — is available unconditionally. + +### Deferred: WebRTC data channel + +If 1–3s live latency proves inadequate in practice, add a WebRTC data channel between live participants: + +- Signaling: small records written to the shared CloudKit zone (Alice writes SDP offer; Bob fetches via push, writes answer; ICE candidates exchanged the same way). Signaling latency is irrelevant since it's one-time per session. +- STUN: public free server (`stun:stun.l.google.com:19302` or similar). +- TURN: free tier (Cloudflare Calls, Xirsys) or paid (~$0.40/GB, Twilio) for the ~15–20% of sessions that fail direct P2P. Graceful fallback: if P2P fails, use the existing CloudKit move-log path. +- Implementation: `WKWebView`-hosted WebRTC using WebKit's built-in `RTCPeerConnection` / `RTCDataChannel`. Avoids the ~20 MB `libwebrtc` bundle cost. Swift ↔ JS bridge via `WKScriptMessageHandler` + `evaluateJavaScript`. The app already ships `WKWebView` for NYT auth, so the framework cost is zero. + +This is **out of scope for the initial redesign.** CloudKit + move log + CKShare should be shipped and evaluated first. Add WebRTC only if measured live latency is genuinely unacceptable. + +### Rejected alternatives + +- **GameKit** (`GKMatch` / `GKTurnBasedMatch`). Requires Game Center authentication, which adds user friction, fails for users who have dismissed the sign-in prompt, and introduces a second identity layer separate from iCloud. The transport advantages do not outweigh the user-facing costs. +- **Custom backend** (Firebase, Supabase, Convex, Cloudflare Durable Objects, Ably). Violates the no-backend constraint. Kept in mind as a future option if pure-CloudKit proves inadequate at scale. +- **SharePlay as the primary transport.** Requires an active FaceTime/Messages session, which is not guaranteed. Still a candidate as a future augmentation for call-based live sessions, but cannot replace the async path. + +## Scope of the redesign + +### In scope (initial implementation) + +- New `Move` and `Snapshot` record types, with serialisation and replay logic. +- `CellEntity` becomes a derived cache populated by replay; not the source of truth. +- No migration. The user is the sole participant and will manually delete existing games before launching the new version. The new codebase is free to assume the move-log schema from the first run. +- `SyncEngine` rewritten around `CKSyncEngine`, driving both the private and shared databases. +- Dual-database support: `privateCloudDatabase` for owned games, `sharedCloudDatabase` for joined games. +- CKShare creation, invitation flow via `UICloudSharingController`, share-accept handler in the app delegate. +- Per-record `authorID` from `CKUserIdentity` (iCloud identity), so moves attribute correctly in multiplayer. +- Fix the fetch-side overwrite bug: `applyCellRecord` (or its replacement in the move-log model) must not clobber state without LWW. + +### Out of scope (explicitly deferred) + +- WebRTC data channel and signaling. +- SharePlay integration. +- MultipeerConnectivity for co-located play. +- Presence indicators (who is currently in the puzzle) and cursor tracking. +- Operation-based undo across participants. +- Conflict UX for the pathological "both type in the same cell in the same second" case — accept Lamport-ordered LWW silently for now. + +## Open questions + +_All originally listed questions have been resolved:_ +- _Share revocation: resolved by `markAccessRevoked(gameID:)` — local replay-derived state stays readable, no new moves accepted._ +- _Snapshot cadence and move ordering: resolved (see "Design decisions already locked")._ + +## Phasing + +Proposed, subject to refinement: + +- **Phase 1: Fix the immediate data-loss bug in place.** Teach `applyCellRecord` to check for a pending outbox change or LWW `updatedAt` before overwriting. Keep the per-cell record model. Ships as a patch release. +- **Phase 2: Throttle-safety.** Implement debounce, serialise pushes, respect `CKRetryAfter` on top of the existing hand-rolled `CKModifyRecordsOperation` engine. Still per-cell records. Ships as the next release; validates that pure-CloudKit with good hygiene is acceptable. The hand-rolled throttle code is understood to be short-lived — Phase 3 replaces it wholesale with `CKSyncEngine`. +- **Phase 3: Move-log data model + `CKSyncEngine` adoption.** _Landed._ New `Move`/`Snapshot` records, `CellEntity` as cache, replay engine. Transport rewritten around `CKSyncEngine` so batching, retry-after, and send-state tracking come from the framework rather than hand-rolled primitives. + - **3a.** _Landed._ Data model & replay plumbing. + - **3b + 3c.** _Landed._ Live wiring + Phase 2 primitive deletion. +- **Phase 4: CKShare.** _Landed._ Per-game zones, dual-engine sync (private + shared), invitation flow via `ShareController`, share-accept handler, per-author `Player` records for identity, share-revocation handling. +- **Phase 5 (deferred, conditional):** WebRTC if measured live latency is inadequate. Not justified yet — needs real-world signal.