---
title: "Chunking Strategies"
description: "How XI Lucent's built-in chunking strategies work, when to use each, and how to register multiple strategies for different content types."
published: 2026-05-14T12:11:20.45592+00:00
updated: 2026-05-14T12:11:20.45592+00:00
tags: ["chunking", "concepts", "lucent"]
url: https://xiobjects.com/docs/xio/lucent/concepts/chunking
source: XI Objects
---

<!-- xion:doctype xion+markdown -->
<!-- xion:metadata
{
  "version": "1.0",
  "content_type": "application/xion\u002Bmarkdown",
  "source_type": "xi-content/doc",
  "generator": "xio-content-publisher/1.0.0",
  "generated": "2026-05-14T12:10:18.9857827\u002B00:00",
  "encoding": "utf-8",
  "render_intent": "markdown",
  "title": "Chunking Strategies",
  "slug": "xio/lucent/concepts/chunking",
  "copyright": "\u00A9 2026 XI Objects Inc"
}
-->

# Chunking Strategies

Chunking is the process of breaking decomposed text into pieces small enough to embed meaningfully but large enough to carry useful context. Lucent routes each document to the right chunking strategy based on its detected content type. Multiple strategies can be registered simultaneously; the pipeline selects the one whose `ContentTypes` set includes the detected type.

If no registered strategy matches, the pipeline throws. Lucent has no silent fallbacks.

## SemanticChunker (default)

`SemanticChunker` splits text at points where meaning changes rather than at fixed token counts. It works by embedding each sentence, computing cosine similarity between adjacent sentences, and splitting when similarity drops below the configured threshold (`SemanticSimilarityThreshold`, default `0.5`).

```mermaid
flowchart LR
    A[Sentences] --> B[Embed each sentence]
    B --> C{Cosine similarity\nbelow threshold?}
    C -- Yes --> D[Split here]
    C -- No --> E[Continue group]
    D --> F[Flush group as chunk]
    E --> B

    style A fill:#1a1a2e,stroke:#7a4a9e,color:#e1d5b9
    style B fill:#582c7e,stroke:#7a4a9e,color:#fff
    style C fill:#1a1a2e,stroke:#7a4a9e,color:#e1d5b9
    style D fill:#0a0e1a,stroke:#ff3a00,color:#e1d5b9
    style E fill:#1a1a2e,stroke:#7a4a9e,color:#e1d5b9
    style F fill:#0a0e1a,stroke:#ff3a00,color:#e1d5b9
```

Structural boundaries from the decomposer (headings, page breaks, slide transitions) act as hard splits regardless of similarity. A chunk never spans a structural boundary.

When a group exceeds `DefaultChunkMaxTokens` (default 512) without a similarity drop, the chunker falls back to a size-based split with `DefaultChunkOverlapTokens` (default 64) of overlap. This prevents runaway chunks in documents where adjacent sentences are always highly similar.

Pre-computed sentence vectors are reused at the embedding stage so each sentence is only embedded once per ingestion.

`SemanticChunker` is the right default for prose: articles, documentation, reports, emails. It produces chunks that align with topical shifts in the text.

## StructuralChunker

`StructuralChunker` splits on document structure rather than semantic content: headings, code block boundaries, indentation level changes, and brace depth for source code.

It's the better choice for structured content where layout carries meaning: source code, Markdown with deep heading hierarchies, HTML, and config files. Semantic similarity between lines of code is rarely a useful signal; structural boundaries are.

## SentenceChunker

`SentenceChunker` splits on sentence boundaries and groups sentences until the token budget is hit, then starts a new chunk with overlap. It's faster than `SemanticChunker` because it doesn't embed at the sentence level, at the cost of occasionally splitting across topic transitions.

Use it when embedding cost is a concern and per-sentence embedding isn't worth the latency.

## FixedSizeChunker

`FixedSizeChunker` splits purely by token count with configurable overlap. It ignores sentence boundaries and structure entirely.

This is the catch-all for content types with no registered strategy. You won't typically register it explicitly; it exists for the edge cases where nothing else is appropriate.

## Configuring chunk size and overlap

These options are global defaults in `LucentOptions`. Individual strategy implementations may also accept additional configuration through their own constructors.

```csharp
services.AddLucent(opts =>
{
    opts.DefaultChunkMaxTokens = 512;
    opts.DefaultChunkOverlapTokens = 64;
    opts.SemanticSimilarityThreshold = 0.5f;
});
```

Lower `SemanticSimilarityThreshold` produces larger chunks (splits only when content changes significantly). Higher values produce finer-grained chunks. The right value depends on your documents and what you're querying for.

## Registering multiple strategies

`AddChunkingStrategy` is additive: calling it multiple times registers all the strategies. The pipeline routes by matching the document's detected content type against each strategy's `ContentTypes` set.

```csharp
services.AddLucent(opts =>
{
    opts.AddChunkingStrategy<SemanticChunker>();   // text/plain, text/markdown, application/pdf, etc.
    opts.AddChunkingStrategy<StructuralChunker>(); // text/x-csharp, text/html, etc.
    opts.AddChunkingStrategy<MyCsvChunker>();      // text/csv
});
```

## What's in a Chunk

Every chunk carries a `ChunkMetadata` record with provenance from the decomposer and positional context from the chunker:

| Field | Source | Description |
|-------|--------|-------------|
| `Heading` | Decomposer / chunker | Nearest heading above this chunk |
| `SectionPath` | Decomposer | Heading hierarchy path (e.g. `"Introduction > Background"`) |
| `PageNumber` | PDF decomposer | 1-based page number |
| `SlideIndex` | PPTX decomposer | 1-based slide index |
| `SheetName` | CSV decomposer | Sheet or file name |
| `RowIndex` | CSV decomposer | 1-based row within sheet |
| `ElementPath` | HTML decomposer | DOM element path (e.g. `"article > section > p"`) |
| `Language` | Structural chunker | Detected programming language |
| `ChunkIndex` | Pipeline | 0-based position within document |
| `TotalChunks` | Pipeline | Total chunk count for document |
| `Custom` | Caller | Metadata from `AddDocumentRequest.Metadata` |

Chunk IDs are deterministic: `"{documentId}:{startOffset:D10}"`. Same document, same chunking, same IDs every time.
<!-- xion:trust
{
  "v": 1,
  "canon_v": 1,
  "ctx": "xiobjects.com/content",
  "hash_blake3_hex": "fdda247b1b688c6aca2888d48952a24d66fc1799fe7b8e9c557d2aa0a6e8d22d",
  "hash_sha256_hex": null,
  "sig_alg": "ed25519",
  "sig_b64": "h9AvWKFz9mPOS3Xk6skK6EUhJ-VtqdxY9stL-WYFx3RJaCZUhuGIJjnIEwCQfCojk5c_2mPQ3lQvT8sP95NRBw",
  "pubkey_b64": "h-awvV8Rn-juph_c2Y7UH5A6e7NaFia3zBiMrJUOMOo",
  "x509_chain_pem": [
    "-----BEGIN CERTIFICATE-----\r\nMIIB9DCCAaagAwIBAgIQBrrNsmRlBvKQdA4idEliJjAFBgMrZXAwLjEsMCoGA1UE\r\nAwwjWEkgT2JqZWN0cyBJbmMgQ29udHJvbCBJbnRlcm1lZGlhdGUwHhcNMjYwNTEz\r\nMjI0NjA1WhcNMjYwNjEyMjI0NjA1WjBLMR4wHAYDVQQDDBV4aW8tY29udGVudC1w\r\ndWJsaXNoZXIxFzAVBgNVBAoMDlhJIE9iamVjdHMgSW5jMRAwDgYDVQQLDAdDb250\r\nZW50MCowBQYDK2VwAyEAh\u002BawvV8Rn\u002Bjuph/c2Y7UH5A6e7NaFia3zBiMrJUOMOqj\r\ngbwwgbkwDAYDVR0TAQH/BAIwADAOBgNVHQ8BAf8EBAMCB4AwEwYDVR0lBAwwCgYI\r\nKwYBBQUHAyQwZQYDVR0jBF4wXIAUOym3mFmw/qs1fgKrujCkxhrTk7KhLqQsMCox\r\nKDAmBgNVBAMMH0luc3RpdHV0ZSBvZiBQcm92ZW5hbmNlIFJvb3QgQ0GCFFJgN/ix\r\nQn72H6h3T5lEr9f8lJQFMB0GA1UdDgQWBBS1LSJi5\u002BeqBq8h974Ht9HTgIcdgTAF\r\nBgMrZXADQQCKjXbPwnk/DZHmLQstUWRzU6GSf\u002BSHTXTTZCtRLbmJKxT17Qlbpexc\r\nsRgdSpxNWpJPe9Fr4vwhRkESMqMIpgQO\r\n-----END CERTIFICATE-----\r\n",
    "-----BEGIN CERTIFICATE-----\r\nMIIByDCCAXqgAwIBAgIUUmA3\u002BLFCfvYfqHdPmUSv1/yUlAUwBQYDK2VwMCoxKDAm\r\nBgNVBAMMH0luc3RpdHV0ZSBvZiBQcm92ZW5hbmNlIFJvb3QgQ0EwHhcNMjUxMTAy\r\nMDMxNzEyWhcNMzAxMTAxMDMxNzEyWjAuMSwwKgYDVQQDDCNYSSBPYmplY3RzIElu\r\nYyBDb250cm9sIEludGVybWVkaWF0ZTAqMAUGAytlcAMhAFSS/pggSRmTcAMko7uc\r\nATH8OHgxVymd5mBFlPXbJkgio4GtMIGqMBIGA1UdEwEB/wQIMAYBAf8CAQAwDgYD\r\nVR0PAQH/BAQDAgEGMB0GA1UdDgQWBBQ7KbeYWbD\u002BqzV\u002BAqu6MKTGGtOTsjBlBgNV\r\nHSMEXjBcgBQAZRTDswSVORu\u002BkUOKX6WvrOvmQKEupCwwKjEoMCYGA1UEAwwfSW5z\r\ndGl0dXRlIG9mIFByb3ZlbmFuY2UgUm9vdCBDQYIUJqoJlpiSFg\u002B7W5IJLMrLttgR\r\nQp4wBQYDK2VwA0EA5FOht7YOsVRPp/FOKMQ\u002B3Mo9JxrvGR3ylKWAWNm6OUV7N3DB\r\nI9cD62wU5I0d0EKDBy0CX9DnoqUyxv5yguraAA==\r\n-----END CERTIFICATE-----\r\n",
    "-----BEGIN CERTIFICATE-----\r\nMIIBaTCCARugAwIBAgIUJqoJlpiSFg\u002B7W5IJLMrLttgRQp4wBQYDK2VwMCoxKDAm\r\nBgNVBAMMH0luc3RpdHV0ZSBvZiBQcm92ZW5hbmNlIFJvb3QgQ0EwHhcNMjUxMTAy\r\nMDMwNTEyWhcNMzUxMDMxMDMwNTEyWjAqMSgwJgYDVQQDDB9JbnN0aXR1dGUgb2Yg\r\nUHJvdmVuYW5jZSBSb290IENBMCowBQYDK2VwAyEAEWNZl\u002Br3IC7\u002BgBh90Yo1kWk1\r\npZCVzVuFdFT7qBBU8W2jUzBRMB0GA1UdDgQWBBQAZRTDswSVORu\u002BkUOKX6WvrOvm\r\nQDAfBgNVHSMEGDAWgBQAZRTDswSVORu\u002BkUOKX6WvrOvmQDAPBgNVHRMBAf8EBTAD\r\nAQH/MAUGAytlcANBAO6QeydOFNrN75qNyftggYudsxMyl4w9qWkSdZ6hlhrRcbSr\r\niG9Si0kbrIJOwYB/LTBU0RM4Rl\u002Bo9PM3Qp0mPwo=\r\n-----END CERTIFICATE-----\r\n"
  ],
  "key_id": "SDyVO7FvlAM-6CvQ62VZYOBO7JADFqLquUunUABRgKg",
  "created_at": "2026-05-14T12:10:18Z"
}
-->