ReadHorizons.md (9188B)
1 # Problem: cross-device read-horizon / presence collapse 2 3 This document captures an unresolved design problem in the read-cursor / 4 unread-badge sync path. It is written to be picked up cold in a new session. 5 **No fix is committed for it yet that works** — see "Current state" below. 6 7 ## TL;DR 8 9 The account's read horizon (`GameEntity.lastReadOtherMoveAt`) and its 10 "a device is actively present" lease are squeezed into a **single 11 last-writer-wins scalar** carried on **one Player record per (game, account)**. 12 That structure cannot represent "device A has left but device C is still 13 present," so any attempt to stop a leaving device from collapsing another 14 device's active lease is unsolvable with the data we currently store. Two 15 successive fixes attempted it and both were wrong. We need to decide between 16 (a) accepting the bounded, self-healing limitation, or (b) making presence 17 per-device, which is a schema + sync change. 18 19 ## Background: how the read horizon works today 20 21 - `GameEntity.lastReadOtherMoveAt` is the per-account "how far we've read other 22 authors' moves" horizon. The unread badge / session nudge fires when a peer 23 (different iCloud account) has a `MovesEntity.updatedAt` later than this 24 horizon. See `GameStore.unreadOtherMovesGameCount()` and `GameSummary`. 25 - A **future-dated** horizon is an *active-session lease*: "a device of this 26 account is in the puzzle right now, suppress badges/nudges until this time." 27 `publishReadCursor(.activeLease)` writes `now + readLeaseDuration` (10 min) 28 and refreshes only when less than `readLeaseRefreshFloor` (5 min) remains 29 (`AppServices.swift`). Exit/background writes `now` (`.currentTime`), 30 intentionally closing the lease. 31 - The horizon is shipped to other devices on the Player record's `readAt` 32 field. **Outbound `readAt` is built from `entity.game?.lastReadOtherMoveAt`** 33 (`RecordBuilder.swift:135`) — i.e. every device publishes the single shared 34 account scalar, not its own lease. 35 - Inbound: `RecordApplier.applyPlayerRecord` adopts `readAt` (and the 36 `sessionSnapshot` catch-up baseline) **above** the selection `updatedAt` LWW 37 guard but **under** the per-record etag freshness guard, then calls 38 `onReadCursor` → `GameStore.noteIncomingReadCursor` for our own account's 39 records only (`authorID == localAuthorID`). This hoist was commit `3835f0f` 40 ("Converge the catch-up baseline regardless of cursor recency"), which fixed a 41 real duplicate-catch-up-banner bug. 42 43 ## The decisive fact (why the recent fixes fail) 44 45 **Player records are one per `(game, author)`, with no device component:** 46 47 ``` 48 RecordSerializer.swift:62 49 static func recordName(forPlayerInGame gameID: UUID, authorID: String) -> String { 50 "player-\(gameID.uuidString)-\(authorID)" // <-- no deviceID 51 } 52 ``` 53 54 All of an account's devices share the *same* `authorID`, so they all write the 55 *same* Player record. Consequences: 56 57 1. CloudKit resolves concurrent device writes by **last-writer-wins**. The 58 record's `readAt` is simply whoever wrote most recently. 59 2. Locally there is exactly **one** `PlayerEntity` row per `(game, authorID)`. 60 The per-record etag guard (`applyPlayerRecord`, the 61 `incomingIsAtLeastAsFresh` check) rejects older CloudKit server snapshots 62 than the local etag, so the `readAt` that reaches `noteIncomingReadCursor` 63 is from the accepted/current server version of that one Player record. That 64 does **not** mean its `updatedAt` value is semantically freshest; the 65 catch-up-baseline bug existed precisely because the accepted record could 66 carry a stale `updatedAt`. 67 3. There is **no stored representation of an individual device's lease.** "A 68 left, C is still here" is unrepresentable; the scalar holds only the last 69 write. 70 71 ## The original concern and why it is inherent, not a bug we introduced 72 73 Red-team round 1 worried that a leaving device's record could move 74 `lastReadOtherMoveAt` backward and collapse another device's active future 75 lease, briefly reopening unread badges / pause nudges. With the one-record LWW 76 model this is just **last-writer-wins on a shared scalar**: if device A closes 77 (writes `now`) after device C leased `now+10m`, the record now says `now`. Other 78 devices adopt `now`. It is bounded and **self-healing**: the still-active device 79 re-leases within ≤5 min (`readLeaseRefreshFloor`), and a foreground device marks 80 inbound moves read on arrival anyway. It cannot be "fixed" without representing 81 C's lease separately from A's close. 82 83 ## What was attempted (both wrong) 84 85 1. **`0ccbe90` — `protectActiveLease` floor (committed).** Made 86 `noteIncomingReadCursor` refuse any incoming `readAt` below an unexpired 87 future horizon. **Broke clean closes:** a real close from another device is 88 indistinguishable from a stale echo (both are "a value below the current 89 future horizon"), so legitimate closes were ignored and a stale future lease 90 was held — suppressing badges/nudges for up to the 10-min lease. 91 92 2. **`e2d863a` — recompute from sibling rows (committed).** Replaced the 93 floor with `horizon = max(incoming, min(current, now), max sibling 94 PlayerEntity.readAt)`. **Two fatal flaws, both correct per red-team:** 95 - **P1:** "max over sibling `PlayerEntity` rows" assumes per-device rows that 96 do not exist (one row per author). In production it degenerates to the 97 single LWW row and distinguishes nothing. The accompanying test fabricated 98 `player-A`/`player-B` rows with the same author — impossible. 99 - **P1:** `min(current, Date())` floors at wall-clock *processing* time, not 100 the actual departure time. A late-processed close (device left 10:02, 101 received 10:05, old lease 10:10) advances the horizon to 10:05 and marks 102 peer moves made 10:02–10:05 as read even though no device saw them. 103 104 ## Current state 105 106 - Committed: `3835f0f` (good, keep), `0ccbe90` (broken floor), and `e2d863a` 107 (broken recompute). 108 - `PLAN.md` is untracked documentation. The working tree may otherwise be clean 109 depending on whether this file has been added. 110 - The unit suite currently passes, but only because the test was written to the 111 fictional model. Passing is not meaningful here. 112 113 ## Options to decide between 114 115 **Option A — Revert to the plain-adopt baseline (recommended as the floor).** 116 Restore `noteIncomingReadCursor` to simply 117 `_ = setReadCursor(gameID: gameID, readAt: readAt)` (the pre-`0ccbe90` behavior), 118 i.e. revert both `0ccbe90` and `e2d863a` back to the `3835f0f` state. This is 119 the LWW-consistent choice: trust the account's latest resolved 120 `readAt`, allow backward movement. It has **no false-negative** (a close to 121 10:02 yields horizon 10:02; later moves badge correctly — Finding 2 gone) and 122 **no fictional data dependency** (Finding 1 gone). Its only cost is the inherent 123 round-1 collapse, which is bounded (≤5 min) and self-healing. The catch-up 124 baseline fix from `3835f0f` is preserved. 125 126 **Option B — Make presence per-device (the real fix, larger).** Represent each 127 device's lease separately so the account horizon can be a true max over *live* 128 device leases. Sketch: 129 - Give the Player record a `deviceID` (like Moves/Journal/Ping already do: 130 `moves-<game>-<author>-<device>`), so each device owns its own row, OR add a 131 separate lightweight per-device presence/lease record. 132 - Compute the suppression horizon as the max **unexpired** lease across the 133 account's own device rows; let truly-read progress be a separate monotonic 134 value that never moves backward. 135 - This is what the round-2/round-3 attempts *wanted* to do but couldn't, because 136 the rows didn't exist. It is a schema + sync-path change with migration 137 considerations and should be scoped deliberately. Note CloudKit production 138 cannot be inspected/edited via the dashboard, so verify via on-device 139 diagnostics. 140 141 **Option C — Status quo (`0ccbe90`/`e2d863a` as-is).** Not recommended: the 142 committed follow-up still relies on per-device Player rows that do not exist 143 and can mark unseen moves read via the wall-clock processing-time floor. This 144 is worse than the occasional spurious nudge Option A allows. 145 146 ## Recommendation 147 148 Take **Option A now** to get back to a correct, simple baseline, then decide 149 separately whether the round-1 collapse is worth **Option B**. My read: the 150 collapse is minor and self-healing, so Option A may be sufficient on its own and 151 Option B is only worth it if real-world reports show spurious badges/nudges 152 while a sibling is demonstrably active. 153 154 ## Key references 155 156 - `Crossmate/Sync/RecordSerializer.swift:62` — Player record name (the root fact) 157 - `Crossmate/Sync/RecordBuilder.swift:135` — outbound `readAt` = shared scalar 158 - `Crossmate/Sync/RecordApplier.swift` (`applyPlayerRecord`) — etag guard, the 159 `3835f0f` hoist, `onReadCursor` 160 - `Crossmate/Persistence/GameStore.swift` — `noteIncomingReadCursor`, 161 `setReadCursor`, `noteIncomingMovesUpdate`, `markOtherMovesRead` 162 - `Crossmate/Services/AppServices.swift` — `publishReadCursor`, 163 `readLeaseDuration` (10 min), `readLeaseRefreshFloor` (5 min), the 164 `accountSeen` push path (~2208), the incoming-cursor drain (~486) 165 - Commits: `3835f0f` (keep), `0ccbe90` and `e2d863a` (revert)