---
title: "Ingestion & Retrieval Pipeline"
description: "How XI Lucent moves documents from raw bytes to ranked results: pipeline stages, atomicity, re-ingestion behavior, and content hash short-circuiting."
published: 2026-05-14T12:11:23.890193+00:00
updated: 2026-05-14T12:11:23.890193+00:00
tags: ["concepts", "lucent", "pipeline"]
url: https://xiobjects.com/docs/xio/lucent/concepts/pipeline
source: XI Objects
---

<!-- xion:doctype xion+markdown -->
<!-- xion:metadata
{
  "version": "1.0",
  "content_type": "application/xion\u002Bmarkdown",
  "source_type": "xi-content/doc",
  "generator": "xio-content-publisher/1.0.0",
  "generated": "2026-05-14T12:10:19.3912774\u002B00:00",
  "encoding": "utf-8",
  "render_intent": "markdown",
  "title": "Ingestion \u0026 Retrieval Pipeline",
  "slug": "xio/lucent/concepts/pipeline",
  "copyright": "\u00A9 2026 XI Objects Inc"
}
-->

# Ingestion & Retrieval Pipeline

Lucent exposes a single facade, `IKnowledgeEngine`, over two internal pipelines: one that takes a document and stores its embedded chunks, and one that takes a query text and returns ranked results. Both pipelines are fully streaming; chunks flow through bounded channels between stages rather than buffering everything in memory.

## Ingestion Pipeline

```mermaid
flowchart LR
    A[AddDocumentAsync] --> B[Content Detection]
    B --> C[Decomposition]
    C --> D[Chunking]
    D --> E[Embedding]
    E --> F[Storage]
    F --> G[Commit]

    style A fill:#1a1a2e,stroke:#7a4a9e,color:#e1d5b9
    style B fill:#1a1a2e,stroke:#7a4a9e,color:#e1d5b9
    style C fill:#1a1a2e,stroke:#7a4a9e,color:#e1d5b9
    style D fill:#582c7e,stroke:#7a4a9e,color:#fff
    style E fill:#582c7e,stroke:#7a4a9e,color:#fff
    style F fill:#1a1a2e,stroke:#7a4a9e,color:#e1d5b9
    style G fill:#0a0e1a,stroke:#ff3a00,color:#e1d5b9
```

### Content Detection

`IContentDetector` samples the leading bytes of the incoming stream and returns a detected content type. The default `HeuristicContentDetector` uses file extension, structural byte-level analysis, and format heuristics. No LLM involved.

You can override detection entirely by setting `ContentTypeHint` on the request. When a hint is provided, detection is skipped.

### Decomposition

`IDocumentDecomposer` converts binary and structured formats into text and a set of structural hints. The hints carry provenance that later attaches to every chunk from that segment: page numbers from PDFs, slide indices from PowerPoint, sheet names and row offsets from CSV, DOM element paths from HTML.

Lucent ships decomposers for txt, Markdown, HTML, PDF, Word (docx), PowerPoint (pptx), and CSV. Each decomposer registers itself for a set of content types. Multiple decomposers can be registered simultaneously; the pipeline routes to the right one based on the detected type.

### Chunking

`IChunkingStrategy` receives the decomposed text as a `PipeReader` and yields `Chunk` objects via `IAsyncEnumerable`. The chunking strategy is also selected by content type, which means you can register a specialized code chunker for `text/x-csharp` without affecting how Markdown or PDF documents are handled.

The structural hints from the decomposer arrive in `ChunkingContext` and are embedded into each chunk's `ChunkMetadata`. A chunk from page 42 of a PDF carries `PageNumber = 42`. A chunk from slide 7 of a PPTX carries `SlideIndex = 7`.

See [Chunking Strategies](/docs/xio/lucent/concepts/chunking) for a description of each built-in strategy.

### Embedding

Chunks are batched and passed to `IEmbedder.EmbedBatchAsync`. The default `OnnxEmbedder` runs `nomic-embed-text-v1.5` locally via ONNX Runtime and produces 768-dimensional vectors. Embedding is the most expensive stage; batch size directly affects throughput.

The semantic chunker pre-computes vectors during chunking to measure similarity between adjacent sentences. Those pre-computed vectors are reused at the embedding stage rather than re-embedding the same text twice.

### Storage

`EmbeddedChunk` objects are upserted to `IVectorStore` and, if configured, indexed in `ITextSearchStore`. The default SQLite stores write both to the same database file, which keeps the two indexes transactionally consistent.

Each chunk is stored with its vector, its raw text content, all metadata fields, and the model ID that produced the vector. The model ID matters: if the embedder changes, stored chunks become stale and Lucent re-embeds them automatically.

### Commit

The document registry row is written last, with `status = 'committed'`. The registry is the pipeline's commit point. If the process crashes after storage but before commit, the chunks exist but no document registry entry does. The next ingest of the same `documentId` sweeps the orphaned chunks and re-ingests cleanly.

## Re-Ingestion Behavior

Calling `AddDocumentAsync` with a `documentId` that already exists compares two values against the existing registry entry.

**Content hash.** The incoming content is hashed with BLAKE3 via `Xio.Crypto.IXioCryptoService`. If the hash matches the stored value, content hasn't changed.

**Model identity.** The current embedder's `ModelId` is compared to the model that produced the stored vectors.

If both match, the pipeline short-circuits immediately and returns the existing chunk count with zero processing time. If the hash matches but the model changed, all chunks are re-embedded with the new model. If the content hash changed, the full pipeline runs.

## Retrieval Pipeline

```mermaid
flowchart LR
    A[QueryAsync] --> B[Query Embedding]
    B --> C[Retrieval Strategy]
    C --> D[Scorer]
    D --> E[QueryResult]

    style A fill:#1a1a2e,stroke:#7a4a9e,color:#e1d5b9
    style B fill:#582c7e,stroke:#7a4a9e,color:#fff
    style C fill:#582c7e,stroke:#7a4a9e,color:#fff
    style D fill:#1a1a2e,stroke:#7a4a9e,color:#e1d5b9
    style E fill:#0a0e1a,stroke:#ff3a00,color:#e1d5b9
```

### Query Embedding

The query text is embedded by the same `IEmbedder` used at ingest time. This is why swapping the embedder requires re-ingestion: query vectors and document vectors must come from the same model to be comparable.

### Retrieval Strategy

`IRetrievalStrategy` receives both the query vector and the raw query text, then assembles and ranks candidates. The default `HybridRetrievalStrategy` runs vector similarity search and FTS5 full-text search in parallel and fuses the results. See [Hybrid Search](/docs/xio/lucent/concepts/hybrid-search) for the full details.

Filters in the `QueryRequest` are pushed down to both search paths before ranking.

### Scorer

`IScorer` receives the fused candidate list and can re-rank it. The default `NoOpScorer` returns candidates unchanged. `CrossEncoderScorer` re-scores using an ONNX cross-encoder model (e.g. `bge-reranker-v2-m3`) for higher precision at the cost of additional latency.

## Timing Breakdowns

Both `AddDocumentResult` and `QueryResult` carry per-stage durations: `ChunkingDuration`, `EmbeddingDuration`, `StorageDuration` on ingestion; `EmbeddingDuration`, `RetrievalDuration`, `ScoringDuration` on retrieval. These are wall-clock measurements from the pipeline's perspective.

For deeper telemetry, Lucent emits OpenTelemetry traces and metrics via the `"Xio.Lucent"` activity source and meter. See the observability section in the [REST API reference](/docs/xio/lucent/api/rest) for configuration.
<!-- xion:trust
{
  "v": 1,
  "canon_v": 1,
  "ctx": "xiobjects.com/content",
  "hash_blake3_hex": "059c6a6cc46750c7e956d3480118a7f4e1f8cd1dc6a2804d8eee24d8425c0d76",
  "hash_sha256_hex": null,
  "sig_alg": "ed25519",
  "sig_b64": "VXMau5EVr-d5yU9f3x_U-gDYovxs7gnyXbVSM19zDvE717kGjBzQaFpnVhcQ4htRUklQj4E4Mb-nVB__rmB3Dw",
  "pubkey_b64": "h-awvV8Rn-juph_c2Y7UH5A6e7NaFia3zBiMrJUOMOo",
  "x509_chain_pem": [
    "-----BEGIN CERTIFICATE-----\r\nMIIB9DCCAaagAwIBAgIQBrrNsmRlBvKQdA4idEliJjAFBgMrZXAwLjEsMCoGA1UE\r\nAwwjWEkgT2JqZWN0cyBJbmMgQ29udHJvbCBJbnRlcm1lZGlhdGUwHhcNMjYwNTEz\r\nMjI0NjA1WhcNMjYwNjEyMjI0NjA1WjBLMR4wHAYDVQQDDBV4aW8tY29udGVudC1w\r\ndWJsaXNoZXIxFzAVBgNVBAoMDlhJIE9iamVjdHMgSW5jMRAwDgYDVQQLDAdDb250\r\nZW50MCowBQYDK2VwAyEAh\u002BawvV8Rn\u002Bjuph/c2Y7UH5A6e7NaFia3zBiMrJUOMOqj\r\ngbwwgbkwDAYDVR0TAQH/BAIwADAOBgNVHQ8BAf8EBAMCB4AwEwYDVR0lBAwwCgYI\r\nKwYBBQUHAyQwZQYDVR0jBF4wXIAUOym3mFmw/qs1fgKrujCkxhrTk7KhLqQsMCox\r\nKDAmBgNVBAMMH0luc3RpdHV0ZSBvZiBQcm92ZW5hbmNlIFJvb3QgQ0GCFFJgN/ix\r\nQn72H6h3T5lEr9f8lJQFMB0GA1UdDgQWBBS1LSJi5\u002BeqBq8h974Ht9HTgIcdgTAF\r\nBgMrZXADQQCKjXbPwnk/DZHmLQstUWRzU6GSf\u002BSHTXTTZCtRLbmJKxT17Qlbpexc\r\nsRgdSpxNWpJPe9Fr4vwhRkESMqMIpgQO\r\n-----END CERTIFICATE-----\r\n",
    "-----BEGIN CERTIFICATE-----\r\nMIIByDCCAXqgAwIBAgIUUmA3\u002BLFCfvYfqHdPmUSv1/yUlAUwBQYDK2VwMCoxKDAm\r\nBgNVBAMMH0luc3RpdHV0ZSBvZiBQcm92ZW5hbmNlIFJvb3QgQ0EwHhcNMjUxMTAy\r\nMDMxNzEyWhcNMzAxMTAxMDMxNzEyWjAuMSwwKgYDVQQDDCNYSSBPYmplY3RzIElu\r\nYyBDb250cm9sIEludGVybWVkaWF0ZTAqMAUGAytlcAMhAFSS/pggSRmTcAMko7uc\r\nATH8OHgxVymd5mBFlPXbJkgio4GtMIGqMBIGA1UdEwEB/wQIMAYBAf8CAQAwDgYD\r\nVR0PAQH/BAQDAgEGMB0GA1UdDgQWBBQ7KbeYWbD\u002BqzV\u002BAqu6MKTGGtOTsjBlBgNV\r\nHSMEXjBcgBQAZRTDswSVORu\u002BkUOKX6WvrOvmQKEupCwwKjEoMCYGA1UEAwwfSW5z\r\ndGl0dXRlIG9mIFByb3ZlbmFuY2UgUm9vdCBDQYIUJqoJlpiSFg\u002B7W5IJLMrLttgR\r\nQp4wBQYDK2VwA0EA5FOht7YOsVRPp/FOKMQ\u002B3Mo9JxrvGR3ylKWAWNm6OUV7N3DB\r\nI9cD62wU5I0d0EKDBy0CX9DnoqUyxv5yguraAA==\r\n-----END CERTIFICATE-----\r\n",
    "-----BEGIN CERTIFICATE-----\r\nMIIBaTCCARugAwIBAgIUJqoJlpiSFg\u002B7W5IJLMrLttgRQp4wBQYDK2VwMCoxKDAm\r\nBgNVBAMMH0luc3RpdHV0ZSBvZiBQcm92ZW5hbmNlIFJvb3QgQ0EwHhcNMjUxMTAy\r\nMDMwNTEyWhcNMzUxMDMxMDMwNTEyWjAqMSgwJgYDVQQDDB9JbnN0aXR1dGUgb2Yg\r\nUHJvdmVuYW5jZSBSb290IENBMCowBQYDK2VwAyEAEWNZl\u002Br3IC7\u002BgBh90Yo1kWk1\r\npZCVzVuFdFT7qBBU8W2jUzBRMB0GA1UdDgQWBBQAZRTDswSVORu\u002BkUOKX6WvrOvm\r\nQDAfBgNVHSMEGDAWgBQAZRTDswSVORu\u002BkUOKX6WvrOvmQDAPBgNVHRMBAf8EBTAD\r\nAQH/MAUGAytlcANBAO6QeydOFNrN75qNyftggYudsxMyl4w9qWkSdZ6hlhrRcbSr\r\niG9Si0kbrIJOwYB/LTBU0RM4Rl\u002Bo9PM3Qp0mPwo=\r\n-----END CERTIFICATE-----\r\n"
  ],
  "key_id": "SDyVO7FvlAM-6CvQ62VZYOBO7JADFqLquUunUABRgKg",
  "created_at": "2026-05-14T12:10:19Z"
}
-->