crossmate

A collaborative crossword app for iOS
Log | Files | Refs | LICENSE

ReadHorizons.md (9188B)


      1 # Problem: cross-device read-horizon / presence collapse
      2 
      3 This document captures an unresolved design problem in the read-cursor /
      4 unread-badge sync path. It is written to be picked up cold in a new session.
      5 **No fix is committed for it yet that works** — see "Current state" below.
      6 
      7 ## TL;DR
      8 
      9 The account's read horizon (`GameEntity.lastReadOtherMoveAt`) and its
     10 "a device is actively present" lease are squeezed into a **single
     11 last-writer-wins scalar** carried on **one Player record per (game, account)**.
     12 That structure cannot represent "device A has left but device C is still
     13 present," so any attempt to stop a leaving device from collapsing another
     14 device's active lease is unsolvable with the data we currently store. Two
     15 successive fixes attempted it and both were wrong. We need to decide between
     16 (a) accepting the bounded, self-healing limitation, or (b) making presence
     17 per-device, which is a schema + sync change.
     18 
     19 ## Background: how the read horizon works today
     20 
     21 - `GameEntity.lastReadOtherMoveAt` is the per-account "how far we've read other
     22   authors' moves" horizon. The unread badge / session nudge fires when a peer
     23   (different iCloud account) has a `MovesEntity.updatedAt` later than this
     24   horizon. See `GameStore.unreadOtherMovesGameCount()` and `GameSummary`.
     25 - A **future-dated** horizon is an *active-session lease*: "a device of this
     26   account is in the puzzle right now, suppress badges/nudges until this time."
     27   `publishReadCursor(.activeLease)` writes `now + readLeaseDuration` (10 min)
     28   and refreshes only when less than `readLeaseRefreshFloor` (5 min) remains
     29   (`AppServices.swift`). Exit/background writes `now` (`.currentTime`),
     30   intentionally closing the lease.
     31 - The horizon is shipped to other devices on the Player record's `readAt`
     32   field. **Outbound `readAt` is built from `entity.game?.lastReadOtherMoveAt`**
     33   (`RecordBuilder.swift:135`) — i.e. every device publishes the single shared
     34   account scalar, not its own lease.
     35 - Inbound: `RecordApplier.applyPlayerRecord` adopts `readAt` (and the
     36   `sessionSnapshot` catch-up baseline) **above** the selection `updatedAt` LWW
     37   guard but **under** the per-record etag freshness guard, then calls
     38   `onReadCursor` → `GameStore.noteIncomingReadCursor` for our own account's
     39   records only (`authorID == localAuthorID`). This hoist was commit `3835f0f`
     40   ("Converge the catch-up baseline regardless of cursor recency"), which fixed a
     41   real duplicate-catch-up-banner bug.
     42 
     43 ## The decisive fact (why the recent fixes fail)
     44 
     45 **Player records are one per `(game, author)`, with no device component:**
     46 
     47 ```
     48 RecordSerializer.swift:62
     49   static func recordName(forPlayerInGame gameID: UUID, authorID: String) -> String {
     50       "player-\(gameID.uuidString)-\(authorID)"   // <-- no deviceID
     51   }
     52 ```
     53 
     54 All of an account's devices share the *same* `authorID`, so they all write the
     55 *same* Player record. Consequences:
     56 
     57 1. CloudKit resolves concurrent device writes by **last-writer-wins**. The
     58    record's `readAt` is simply whoever wrote most recently.
     59 2. Locally there is exactly **one** `PlayerEntity` row per `(game, authorID)`.
     60    The per-record etag guard (`applyPlayerRecord`, the
     61    `incomingIsAtLeastAsFresh` check) rejects older CloudKit server snapshots
     62    than the local etag, so the `readAt` that reaches `noteIncomingReadCursor`
     63    is from the accepted/current server version of that one Player record. That
     64    does **not** mean its `updatedAt` value is semantically freshest; the
     65    catch-up-baseline bug existed precisely because the accepted record could
     66    carry a stale `updatedAt`.
     67 3. There is **no stored representation of an individual device's lease.** "A
     68    left, C is still here" is unrepresentable; the scalar holds only the last
     69    write.
     70 
     71 ## The original concern and why it is inherent, not a bug we introduced
     72 
     73 Red-team round 1 worried that a leaving device's record could move
     74 `lastReadOtherMoveAt` backward and collapse another device's active future
     75 lease, briefly reopening unread badges / pause nudges. With the one-record LWW
     76 model this is just **last-writer-wins on a shared scalar**: if device A closes
     77 (writes `now`) after device C leased `now+10m`, the record now says `now`. Other
     78 devices adopt `now`. It is bounded and **self-healing**: the still-active device
     79 re-leases within ≤5 min (`readLeaseRefreshFloor`), and a foreground device marks
     80 inbound moves read on arrival anyway. It cannot be "fixed" without representing
     81 C's lease separately from A's close.
     82 
     83 ## What was attempted (both wrong)
     84 
     85 1. **`0ccbe90` — `protectActiveLease` floor (committed).** Made
     86    `noteIncomingReadCursor` refuse any incoming `readAt` below an unexpired
     87    future horizon. **Broke clean closes:** a real close from another device is
     88    indistinguishable from a stale echo (both are "a value below the current
     89    future horizon"), so legitimate closes were ignored and a stale future lease
     90    was held — suppressing badges/nudges for up to the 10-min lease.
     91 
     92 2. **`e2d863a` — recompute from sibling rows (committed).** Replaced the
     93    floor with `horizon = max(incoming, min(current, now), max sibling
     94    PlayerEntity.readAt)`. **Two fatal flaws, both correct per red-team:**
     95    - **P1:** "max over sibling `PlayerEntity` rows" assumes per-device rows that
     96      do not exist (one row per author). In production it degenerates to the
     97      single LWW row and distinguishes nothing. The accompanying test fabricated
     98      `player-A`/`player-B` rows with the same author — impossible.
     99    - **P1:** `min(current, Date())` floors at wall-clock *processing* time, not
    100      the actual departure time. A late-processed close (device left 10:02,
    101      received 10:05, old lease 10:10) advances the horizon to 10:05 and marks
    102      peer moves made 10:02–10:05 as read even though no device saw them.
    103 
    104 ## Current state
    105 
    106 - Committed: `3835f0f` (good, keep), `0ccbe90` (broken floor), and `e2d863a`
    107   (broken recompute).
    108 - `PLAN.md` is untracked documentation. The working tree may otherwise be clean
    109   depending on whether this file has been added.
    110 - The unit suite currently passes, but only because the test was written to the
    111   fictional model. Passing is not meaningful here.
    112 
    113 ## Options to decide between
    114 
    115 **Option A — Revert to the plain-adopt baseline (recommended as the floor).**
    116 Restore `noteIncomingReadCursor` to simply
    117 `_ = setReadCursor(gameID: gameID, readAt: readAt)` (the pre-`0ccbe90` behavior),
    118 i.e. revert both `0ccbe90` and `e2d863a` back to the `3835f0f` state. This is
    119 the LWW-consistent choice: trust the account's latest resolved
    120 `readAt`, allow backward movement. It has **no false-negative** (a close to
    121 10:02 yields horizon 10:02; later moves badge correctly — Finding 2 gone) and
    122 **no fictional data dependency** (Finding 1 gone). Its only cost is the inherent
    123 round-1 collapse, which is bounded (≤5 min) and self-healing. The catch-up
    124 baseline fix from `3835f0f` is preserved.
    125 
    126 **Option B — Make presence per-device (the real fix, larger).** Represent each
    127 device's lease separately so the account horizon can be a true max over *live*
    128 device leases. Sketch:
    129 - Give the Player record a `deviceID` (like Moves/Journal/Ping already do:
    130   `moves-<game>-<author>-<device>`), so each device owns its own row, OR add a
    131   separate lightweight per-device presence/lease record.
    132 - Compute the suppression horizon as the max **unexpired** lease across the
    133   account's own device rows; let truly-read progress be a separate monotonic
    134   value that never moves backward.
    135 - This is what the round-2/round-3 attempts *wanted* to do but couldn't, because
    136   the rows didn't exist. It is a schema + sync-path change with migration
    137   considerations and should be scoped deliberately. Note CloudKit production
    138   cannot be inspected/edited via the dashboard, so verify via on-device
    139   diagnostics.
    140 
    141 **Option C — Status quo (`0ccbe90`/`e2d863a` as-is).** Not recommended: the
    142 committed follow-up still relies on per-device Player rows that do not exist
    143 and can mark unseen moves read via the wall-clock processing-time floor. This
    144 is worse than the occasional spurious nudge Option A allows.
    145 
    146 ## Recommendation
    147 
    148 Take **Option A now** to get back to a correct, simple baseline, then decide
    149 separately whether the round-1 collapse is worth **Option B**. My read: the
    150 collapse is minor and self-healing, so Option A may be sufficient on its own and
    151 Option B is only worth it if real-world reports show spurious badges/nudges
    152 while a sibling is demonstrably active.
    153 
    154 ## Key references
    155 
    156 - `Crossmate/Sync/RecordSerializer.swift:62` — Player record name (the root fact)
    157 - `Crossmate/Sync/RecordBuilder.swift:135` — outbound `readAt` = shared scalar
    158 - `Crossmate/Sync/RecordApplier.swift` (`applyPlayerRecord`) — etag guard, the
    159   `3835f0f` hoist, `onReadCursor`
    160 - `Crossmate/Persistence/GameStore.swift` — `noteIncomingReadCursor`,
    161   `setReadCursor`, `noteIncomingMovesUpdate`, `markOtherMovesRead`
    162 - `Crossmate/Services/AppServices.swift` — `publishReadCursor`,
    163   `readLeaseDuration` (10 min), `readLeaseRefreshFloor` (5 min), the
    164   `accountSeen` push path (~2208), the incoming-cursor drain (~486)
    165 - Commits: `3835f0f` (keep), `0ccbe90` and `e2d863a` (revert)