Skip to content

Vectorization Strategy

Original Text Chunking Vectorization

This is the most basic and efficient strategy in RAG, which cuts the text using a sliding window mechanism: set a fixed chunk size (e.g., 512 characters) and an overlap ratio (e.g., 10%), allowing adjacent text chunks to retain some repeated content. For example, dividing a long document into several paragraphs, with each paragraph’s beginning and end including the previous and next chunks, like puzzle pieces covering the entire text. This way, each chunk contains enough context information while maintaining sufficient specificity, making it easier for subsequent searches and processing.

Applicable Scenarios

  1. Structured Document Processing:
    • Texts with clear logic such as technical manuals, legal texts, etc., can retain complete context after chunking.
  2. Quick Landing Scenarios:
    • No complex preprocessing needed, suitable for projects with high timeliness requirements (e.g., news summaries).
  3. General Knowledge Base:
    • When the document’s theme is dispersed or lacks a clear focus (e.g., encyclopedia entries), even chunking can avoid missing information.

Why Effective?

  • Avoids Information Fragmentation: The overlapping part prevents critical information from being split across chunks (e.g., proper nouns spanning chunks).
  • Balances Efficiency and Accuracy: Small chunks facilitate fast retrieval, while the overlap design compensates for continuity in context.
  • Low Barrier to Adaptation: Only needs adjusting chunk size and overlap ratio to fit most scenarios, without requiring domain knowledge.

💡 Tips: Overlap ratio is recommended to be between 10%-20% of the chunk size.

Parent-Child Chunking Vectorization

Several industry studies show that RAG chunk sizes typically between 256 to 512 tokens tend to achieve the best hit results. Therefore, we generally make the text chunks smaller, which makes it easier to find the content we need, while also ensuring the quality of the search. But the issue is that simply looking at the most matching sentence might not be enough; sometimes its contextual information is the key to providing the correct answer. How can we balance retrieval hits and the need for more contextual information? This is where the Parent-Child Chunking Vectorization strategy comes into play.

Core Principle

The document is hierarchically split:

  1. Parent Chunk (1024-2048 tokens): Retains complete logical paragraphs (e.g., a chapter in a technical document or a functional module in a product description).
  2. Child Chunk (256-512 tokens): Extracts key sentences or data fragments from the parent chunk (e.g., operation steps, parameter definitions).

Only child chunks are vectorized, but after retrieval, the corresponding parent chunk full text is returned as context.

Why Effective?

  • High Precision of Child Chunks: Smaller fragments are more likely to match the details in the user’s query (e.g., specific parameters, abbreviations).
  • Parent Chunk Context: Prevents losing key background information due to chunking (e.g., "This configuration only applies to Linux systems" may be in the first paragraph of the parent chunk).
  • Efficiency Optimization: Stores and retrieves small chunks, only recalling the parent chunk once during runtime.

Applicable Scenarios

  1. Technical Document Q&A
    • User asks: "What is the default value for the max_connections parameter?"
    • Child Chunk precisely hits parameter definition → Parent Chunk returns the complete section containing configuration examples and usage restrictions.
  2. Legal Clause Interpretation
    • User asks: "What is the compensation calculation for breach of contract?"
    • Child Chunk matches clause number → Parent Chunk returns related sections such as joint responsibility and exceptions.
  3. Long Text Knowledge Base
    • A data table in a research report (child chunk) is hit → Returns the complete chapter containing data interpretation and experimental methods (parent chunk).

Example

Document Parent Chunk:

Database Configuration Guide (Excerpt)
In MySQL 8.0, max_connections controls the maximum number of concurrent connections, and the default value is 151. Note: To modify this value, you must also adjust thread_cache_size; otherwise, it may lead to memory overflow (see section 4.2). It is recommended to set it to 800-1000 for production environments.

Child Chunking:

  1. "max_connections default value is 151"
  2. "Modification requires simultaneous adjustment of thread_cache_size"
  3. "Production environment recommended range 800-1000"

User Query:

"After changing max_connections to 1000, the server crashes. What might be the cause?"

Process:

  1. Child Chunk 1, containing "max_connections", is highly matched.
  2. The system returns the parent chunk full text, which clearly states that the related parameters need to be adjusted and directs to the specific section.

Chunk Summary Vectorization

Sometimes, the content quality of the document is low, and the valuable information in the document chunks is limited, filled with "fluff". Another situation is that key information in the document is abbreviated or referenced with pronouns, such as "MJ is considered the most famous rock singer in history" and "MJ is the best player in basketball history". These two "MJ"s represent different people: Michael Jackson and Michael Jordan. If you directly vectorize this kind of document and store it in the vector space, the subsequent retrieval will not be effective. This is when you can use the Chunk Summary Vectorization strategy.

Core Principle

Before vectorizing, use a large language model (LLM) to "purify" the original chunk:

  1. Remove Redundancy: Eliminate repeated descriptions, advertising text, and other non-core content.
  2. Disambiguate: Replace pronouns (e.g., "it", "the device") with specific objects and expand abbreviations (e.g., "LLM → large language model").
  3. Structure: Extract core facts, causal relationships, and data conclusions, forming concise summary sentences.

Only the summarized text is vectorized and stored, not the original chunk.

Why Effective?

  • Improved Information Density: Avoid ineffective text diluting the semantic weight of key content.
  • Clarified Semantics: Solves issues of "multiple meanings" (e.g., "MJ") and "vague references" (e.g., "the above method"), which cause retrieval bias.
  • User Language Adaptation: The summarized content is closer to how users naturally express their queries.

Applicable Scenarios

  1. Conversational/Informal Documents
    • Meeting minutes, user reviews, social media content.
  2. Professional Domain Documents
    • Abbreviations in technical manuals (e.g., "K8s → Kubernetes"), variable abbreviations in papers (e.g., "CNN → Convolutional Neural Network").
  3. Cross-domain Polysemy
    • "Apple" could refer to a company, a fruit, or a movie; "Java" could refer to a programming language or a coffee-producing region.

Example

Original Chunk:

"MJ's album released in 1984 set a historic record, selling over 70 million copies. His stage performance style is still widely imitated today, but some movements have been banned in the industry due to health risks."

Issues:

  • "MJ" is ambiguous
  • "It" is unclear
  • "The industry" is not specified

LLM Summary:

"Michael Jackson's 1984 album sold 70 million copies, setting a record in music history. His iconic stage moves (such as the 45-degree lean) have been restricted in the music performance industry due to the risk of spinal injury."

Retrieval Effect Comparison:

User QueryOriginal Chunk Hit RateSummarized Chunk Hit Rate
"Which dance moves of Michael Jackson were banned?"Low (original text doesn't mention "Michael Jackson")High (summary clearly associates the name with the moves)
"Who has the highest-selling album in music history?"Medium (relies on vague "historic record" match)High (directly contains "music history record" keyword)

Hypothetical Question Vectorization

The core of this strategy is "question-driven". It uses a large language model to actively analyze the document content and generate potential questions that the chunk might answer (e.g., "What are Michael Jackson's achievements in music?"). These questions (not the original text) are then transformed into vectors and stored. When a user asks a question, the system will prioritize matching the closest "hypothetical question", thus accurately locating the corresponding document segment. This shifts the document from passively waiting for matches to actively predicting user needs, like installing a "question anticipation radar" in the knowledge base. This strategy is particularly useful for solving the classic dilemma: "I know the answer is in the document, but I don’t know how to ask to find it."

Why Effective?

  • Semantic Translator: Aligns the language differences between users and documents.
  • Explicit Information: By generating questions, the document can proactively fill in context that was not explicitly stated.
  • One Answer, Many Questions: A single document chunk can generate multiple hypothetical questions, forming a question matrix.

Applicable Scenarios:

  • When documents contain many technical terms or implicit information (e.g., abbreviations like "LLM" for "large language model"), users may ask in more casual language.
  • Solves the issue of differences in how users phrase their questions and how the document is written (e.g., users ask "How to alleviate insomnia", while the document writes "Methods to improve sleep quality").

Example:

If a document fragment mentions "MJ broke multiple sales records with the 1984 album 'Thriller'", the model might generate the hypothetical question: "What record-breaking music works did Michael Jackson have?", thus covering various potential ways users may phrase their question.

Last updated: