# Tabvana — Codebase Reference

## What It Is
Chrome extension (MV3) for intelligent tab management. Tracks all browser tabs, supports tagging, semantic similarity search via ML embeddings, and background async operations.

## Architecture at a Glance

```
Service Worker (background/index.ts)
  ├── chromeTabsListeners.ts  → dispatches to Redux + dbSync
  ├── dbSync.ts               → writes to IndexedDB (Dexie)
  └── queueProcessor.ts       → processes async operation queue

Main UI Window (main/mainWithStore.tsx)
  └── Redux store (store/)
      ├── windowsSlice.ts     → Chrome windows/tabs state
      └── uiSlice.ts          → selection, search, similarity state

Offscreen Document (offscreen/offscreen.ts)
  └── embedding.ts            → HuggingFace Xenova/all-MiniLM-L6-v2 (WASM)
```

## Key Files

| File | Purpose |
|------|---------|
| `src/db/db.ts` | Dexie schema (15 versions), `DBPage`/`Operation`/`Progress` types, `renormalizeAll` |
| `src/db/pages.ts` | CRUD for pages — `ensurePageTracked`, `addTag`, `removeTag`, `fetchPages`, `forEachPageBatched` |
| `src/db/dbSync.ts` | Chrome tab event → DB writes; `syncInitialState`, `handleTabActivated/Updated` |
| `src/db/queue.ts` | `enqueueOperation` — deduplicates via `[type+payloadHash]` index |
| `src/db/progress.ts` | `reportProgress`/`clearProgress` with 100ms throttle; React hooks `useActiveProgress`, `useBackgroundProgress` |
| `src/background/queueProcessor.ts` | Processes operation queue: renormalize, cleanup, stash, generateEmbedding, fetchMissingTitle |
| `src/chrome/normalizeUrl.ts` | URL normalization (strips UTM, hash, auth; forces HTTPS; special-cases Google Docs, Amazon) |
| `src/services/embeddingProxy.ts` | Singleton managing offscreen doc lifecycle; retries on "Receiving end does not exist" |
| `src/store/windowsSlice.ts` | Tab/window Redux state; `setSimilarUrlsAsync` (cosine similarity across all pages); `searchSimilarAsync` |
| `src/store/uiSlice.ts` | Selection, search tokens, expanded tags, tab index |

## Data Model

**`DBPage`** (key: `url` = normalized URL)
- `accessTimes: number[]` — thinned to 1000 entries (second/minute/hour/day precision by age)
- `lastActiveDesc / hotnessDesc` — negated values for Dexie index sorting
- `humanTags: string[]` — multi-entry indexed; system tags: `#stash`, `#flag`, `#archive`
- `embedding?: number[]` — 384-dim vector (all-MiniLM-L6-v2); cleared on tag change
- `embeddingErrored?: boolean` — skip re-generation if permanently failed
- `openerUrls / openedUrls / timeAdjacentUrls` — URL graph for similarity

**`Operation`** — async task queue
Types: `renormalize | cleanup | stash | generateEmbedding | fetchMissingTitle`
Deduplication: compound index `[type+payloadHash]`; status: `pending → processing → completed/failed`

## Patterns to Know

- **Chunking**: Dexie operations always chunk at 500–1000 items to avoid call-stack overflow. Follow this for any new bulk DB ops.
- **Progress tracking**: Use `reportProgress(id, source, type, current, total)` / `clearProgress(id, completed)` for any long-running operation. Throttled to 10fps.
- **Embedding invalidation**: Set `page.embedding = undefined; page.embeddingErrored = false;` whenever tags change, then call `enqueueOperation('generateEmbedding', { urls }, 10)`.
- **`shouldIgnore(url)`**: Check this before tracking any URL. Called from dbSync and normalizeUrl.
- **`fromDBPage`**: Always converts `DBPage → Page`; no side-effects. Safe to call in tight loops.
- **`isInitializing` flag** in dbSync.ts: Prevents event handlers from firing during `syncInitialState`.
- **`embeddingBackoff`** in queueProcessor.ts: Module-level state; uses exponential backoff (30 s → 30 min) when HuggingFace WASM reports no backend. Resets on success. Export `resetEmbeddingBackoff()` is available for tests.

## Known Bugs / Issues

### Fixed
- ~~**`setTags` uses `in` operator on array**~~ — Fixed: changed to `.includes()` in pages.ts.
- ~~**Dead-code `else` branch in embeddingProxy retry loop**~~ — Fixed: removed unreachable branch in embeddingProxy.ts.
- ~~**`timeAdjacentUrls`/`openerUrls`/`openedUrls` grow unboundedly**~~ — Fixed: capped at 500 entries each in pages.ts (`URL_GRAPH_MAX`).
- ~~**`searchSimilarAsync` navigated to Search tab with empty query**~~ — Fixed: early return when `commonTokens.length === 0` in windowsSlice.ts.
- ~~**`console.log` calls bypassing structured logger**~~ — Fixed: replaced all runtime occurrences with `logger.*` across queueProcessor.ts, windowsSlice.ts, db.ts, pages.ts, embedding.ts, backup.ts, prefsLoader.ts. Migration-era `console.log` calls inside `.upgrade()` callbacks intentionally left alone (run before the `logs` table is guaranteed to exist).
- ~~**`updateFromBrowser` called `ensurePageTracked` inside a wrapping transaction**~~ — Fixed: the outer `db.transaction` is removed. The bulk `modify` (mark all active→closed) now runs standalone, followed by chunked `ensurePageTracked` calls outside any outer transaction. Atomicity across the two phases is intentionally dropped — a crashed sync self-corrects on the next service worker restart.
- ~~**`selectWindowsCount`/`selectTabsCount` subtracted 1 unconditionally**~~ — Fixed: both selectors now accept `tabvanaWindowId` as a parameter and filter explicitly. `selectTabsCount` counts from `windows[id].tabIds` (not the tabs map) to avoid a race where the Tabvana tab appears transiently in the tabs map before the next sync. `TabsCount` in Main.tsx supplies `tabvanaWindowId` via `useLiveQuery(getTabvanaWindowId)`.
- ~~**Search performed a full table scan on every keystroke**~~ — Fixed: `src/services/pageSearchIndex.ts` maintains a MiniSearch in-memory index. Built once at startup via `initializeSearchIndex()` (called from Main.tsx), then kept current via Dexie table hooks (`creating`/`updating`/`deleting`). `fetchPagesForQuery` queries the index synchronously then `bulkGet`s the results from DB (preserving `useLiveQuery` reactivity). Falls back to full scan while initializing. **Behavior change:** now uses prefix matching (`"git"` → `"github"`) rather than substring matching (`"hub"` → `"github"` no longer works).
- ~~**`updateAdjacentAccessLinks` fired on every tab update**~~ — Fixed: now called in two places in `dbSync.ts`, each covering a distinct navigation pattern: `handleTabActivated` (tab focus switch) and `handleTabUpdated` guarded by `changeInfo.url` (intra-tab navigation via link click). `changeInfo.url` is only present on the one event where the URL actually changes, not on intermediate events for status, title, or favicon changes.
- ~~**`fromDBPage` silently enqueues `fetchMissingTitle` as a side effect**~~ — Fixed: removed the side effect from `fromDBPage`. Missing titles are now handled by: `newDBPage` (enqueues per-URL on creation), a bulk `fetchMissingTitle: {}` operation added to the Renormalize button in Options.tsx, and a low-priority startup sweep enqueued in `background/index.ts` after `syncInitialState`.

- ~~**Similarity search blocks the UI thread**~~ — Fixed: moved to `src/workers/similarityWorker.ts` (a Vite module Web Worker). `setSimilarUrlsAsync` now just posts `{ type: 'abort' }` then `{ type: 'compute', selectedUrls, taskId, source }` and returns immediately. The worker owns the full loop, calls `reportProgress`/`clearProgress` via IndexedDB directly, and posts `{ type: 'result', similarUrls, similarTags }` back. The Redux dispatches happen via the worker's `onmessage` handler in `windowsSlice.ts`. Abort uses a monotonically-incrementing `runId` rather than a shared flag, so a stale computation detects supersession atomically. **Important**: Chrome extension Web Workers do NOT have `chrome.*` APIs injected — only the extension's service worker and extension pages do. Any module imported by the worker must not have a module-level `chrome` API call or `webextension-polyfill` import (the polyfill throws at init if `chrome.runtime.id` is absent). `queue.ts` was fixed to remove the static `webextension-polyfill` import and use `chrome.runtime.sendMessage` directly behind a `typeof chrome !== 'undefined'` guard.
- ~~**`ensurePageTracked` never regenerated embedding when title changed**~~ — Fixed: now clears `page.embedding = undefined; page.embeddingErrored = false` inside the transaction when title changes, so the bulk sweep can detect and re-generate it. Previously the embedding was enqueued per URL but `generateEmbeddingForPage` skipped pages that already had an embedding set.
- ~~**`updateFromBrowser` caused DB contention and woke queue processor N times**~~ — Fixed: `ensurePageTracked` now accepts `options.skipEmbeddingEnqueue`; `updateFromBrowser` passes this flag and does one `enqueueOperation('generateEmbedding', {})` after all tabs are synced instead of one per tab. This eliminates N concurrent DB transactions on `db.operations` and N `chrome.runtime.sendMessage` calls that were competing with the sync.
- ~~**`deletePage` leaves dangling graph references**~~ — Fixed: `deletePage` now enqueues a low-priority (80) `cleanupGraphRefs { url }` operation that strips the deleted URL from `openerUrls`/`openedUrls`/`timeAdjacentUrls` of all other pages via a Dexie cursor (async, non-blocking). The Renormalize button also runs `cleanupGraphRefs {}` (bulk sweep): validates all graph-array entries against live page URLs in 500-record batches, then removes orphaned tags from the `tags` table.

- ~~**`backendPermanentlyFailed` static flag prevents recovery**~~ — Fixed: replaced with `embeddingBackoff` state object in `queueProcessor.ts`. Backoff starts at 30 s and doubles up to 30 min per failure. Resets on any successful generation. `resetEmbeddingBackoff()` exported for test isolation.
- ~~**`processQueue` did full JS-side sort after fetching all pending ops**~~ — Fixed: DB schema v17 adds `[status+priority]` compound index; `processQueue` now uses `where('[status+priority]').between(...)` to get records pre-sorted by priority from IndexedDB.
- ~~**`any` types in queueProcessor.ts / pageSearchIndex.ts**~~ — Fixed: `handleRenormalize` payload typed as `{ limit?: number; batchSize?: number }`, `generateEmbeddingForPage` parameter typed as `DBPage`, `const pages` typed as `Page[]`, catch blocks use `unknown` with `instanceof Error` narrowing, `performance.memory` uses `PerformanceWithMemory` type alias. In `pageSearchIndex.ts` the Dexie `updating` hook `this` and `_mods` keep `any` because Dexie's own type declaration requires `any` there; both are annotated with eslint-disable comments.

- ~~**Embedding generation slow + DB contention from large queues**~~ — Fixed: Three changes:
  1. **Batch ONNX inference** — `EmbeddingService.generateEmbeddings(texts[])` (new) sends an array to the WASM model in one call. `offscreen.ts` handles `generateEmbeddings` message. `embeddingProxy.ts` exposes `generateEmbeddings`. `processEmbeddingBatch()` (new helper in queueProcessor.ts) sends `EMBEDDING_BATCH_SIZE=16` texts per ONNX call — vectorised matrix ops give meaningful throughput gains over 16 sequential round-trips.
  2. **Batch DB writes** — results of each chunk are written in one `db.transaction` instead of N individual `db.pages.update()` calls.
  3. **Operation coalescing** — at the start of `handleGenerateEmbedding`, all other pending `generateEmbedding` ops are absorbed into the current batch and marked completed after processing. Eliminates the per-op `status→processing→completed` DB cycle that stacks up when many tag-change events each enqueue their own op.
  Also fixed: `embeddingProxy.init()` catch block now always re-throws (not just for backend errors), so the op is correctly marked `failed` for any init failure. Removed the `CONCURRENCY_LIMIT=4 / Promise.race` complexity since WASM is single-threaded — concurrency only caused message overhead without parallelism.

- ~~**Extension crashes with no informative log messages**~~ — Four root causes found and fixed:
  1. **`persistLog` deferred via `setTimeout`** (`logger.ts`): log writes were queued as macrotasks; when Chrome killed the service worker, queued callbacks were discarded, causing pre-crash entries to never be persisted. Fixed: removed `setTimeout` wrapper — `Dexie.ignoreTransaction` alone is sufficient to escape active transactions.
  2. **`EmbeddingProxy.creating` stuck on rejection** (`embeddingProxy.ts`): if `chrome.offscreen.createDocument()` rejected, `this.creating` remained as a permanently-rejected promise, permanently blocking future embedding attempts in that SW session. Fixed: wrapped `await this.creating` in try-finally to always clear `this.creating`.
  3. **`backupAndThin` silently swallowed exceptions** (`backup.ts`): the outer catch block set status but never logged the error. Fixed: added `logger.error(...)` call before `setBackupStatus`.
  4. **`handleFetchMissingTitle` bulk sweep perpetually re-opened background tabs** (`queueProcessor.ts`): pages that timed out or had CORS failures kept accumulating in the "no title" filter and triggered up to 50 background tab opens on every restart. Added `titleFetchFailed?: boolean` field to `DBPage`; set it on fetch failure/timeout; cleared it in `ensurePageTracked` when a title is successfully written; bulk sweep now excludes `titleFetchFailed` pages.
  - Also added defensive `try-catch` inside the `activate` event's `waitUntil` callback (`background/index.ts`) to prevent `navigationPreload.disable()` failure from crashing SW activation.

### Open — Needs a Decision

**`bulkPut` on tags in `ensurePageTracked`** (`pages.ts:191`)
A previous developer noted `bulkPut` caused excessive `useLiveQuery` re-renders and reverted to individual `put()` calls inside the transaction. The current code is trying `bulkPut` again. Dexie should coalesce subscriber notifications to once per transaction commit regardless of individual vs bulk puts — but this should be verified in the browser. If re-render storms reappear on tab activation, revert to individual puts.


## Build & Dev

```bash
yarn dev          # Vite dev server + HMR (port 5173)
yarn build        # tsc type-check + vite production build → dist/
yarn build:dev    # dev build with sourcemaps
yarn test         # jest (ts-jest, jsdom)
yarn lint         # eslint + prettier + stylelint
```

WASM files for HuggingFace are copied to `dist/wasm/` via `vite-plugin-static-copy`. CSP includes `'wasm-unsafe-eval'`.

## Extension Entry Points

| Context | Entry |
|---------|-------|
| Service Worker | `src/background/index.ts` |
| Main UI | `src/main/mainWithStore.tsx` → `main.html` |
| Offscreen | `src/offscreen/offscreen.ts` → `offscreen.html` |
| Options | `src/options/index.tsx` → `options.html` |
| Welcome | `src/welcome/index.tsx` → `welcome.html` |

Global hotkey: `Ctrl+Shift+A` (Mac: `Cmd+Shift+A`) opens search.

## Testing Infrastructure

**Test runner:** Jest 29 + ts-jest (`npx jest`)

**Key test dependencies:**
- `fake-indexeddb/auto` — auto-patches `global.indexedDB` for Dexie in jsdom
- `jest-chrome` — mocks Chrome extension APIs; set up in `src/setupTests.ts`
- `__mocks__/normalize-url.js` — CJS wrapper for the ESM `normalize-url` package (Jest can't load ESM node_modules without it)
- `__mocks__/minisearch.js` — CJS wrapper that re-exports `minisearch/dist/cjs/index.cjs`

**Test files added:**
| File | Covers |
|------|--------|
| `src/db/__tests__/queue.test.ts` | `enqueueOperation` deduplication, clearQueue |
| `src/db/__tests__/db.test.ts` | Schema (v11–v17 features), `renormalizeOne`, `renormalizeAll` |
| `src/db/__tests__/dbSync.test.ts` | `setTitleOverride`, TTL eviction, `isInitializing` guard, tab event handlers |
| `src/services/__tests__/pageSearchIndex.test.ts` | Init, search, Dexie hooks (creating/updating/deleting), prefix matching |
| `src/background/__tests__/queueProcessor.test.ts` | Queue processing, stuck-op reset, failure paths, embedding backoff/retry |

**Known pre-existing test failures (not caused by this codebase's tests):**
- `src/content/Content.spec.tsx`, `src/options/Options.spec.tsx` — fail because `store/windowsSlice.ts` uses `import.meta.url` (Web Worker) which ts-jest can't compile in CommonJS mode.
- `src/welcome/Welcome.spec.tsx` — fails because `Welcome.tsx:6` renders `<h1>Welcome hi</h1>` but the spec asserts the exact text `'Welcome'`. The "hi" is a debug artifact that needs removal. (Backlog: "fix failing tests")

**Gotchas for future test authors:**
- `normalize-url` and `minisearch` are pure-ESM packages; they need `__mocks__/*.js` wrappers.
- Do NOT use `jest.resetModules()` inside `async beforeEach` functions — it breaks fake-indexeddb's internal state. Use the singleton `db` and clear tables in `beforeEach` instead.
- `windowsSlice.ts` uses `import.meta.url` and cannot be imported in Jest; any module that depends on it (e.g. `progress.ts`, `pages.ts`) must be mocked in tests that would transitively import it.
- `urlNormalizationPrefs` starts with an empty `ignoredSubstrings` list; call `updateUrlNormalizationCache(defaultIgnoredSubstrings, defaultNormalizeRegexes)` in `beforeAll` when testing code that calls `shouldIgnore()`.
