RAG Technology:

The 16 Core Challenges & Limitations

May 2026 Technology Briefing, Lodestone Labs

by M. Naci Akkøk, Phd, CSO/CTO at Lodestone Labs

The Three Best Known Shortcomings

The three commonly known problems of current RAG technologies are:

Not finding the right part of the document (the chunk) to be sent to the LLM, and practically adding the the hallucination of the LLM
Not seeing the full context of the document due to teh being in pieces (called chunks)
Being very difficult to set up, especially if one is to mx exsiting technologies in an attempt to avoid some of the problems

In this article, we will look at not only these three challenges but all known challenges (all 16 of them!) in four categories.

CATEGORY 1 · CHUNKING & INGESTION

Challenge 1 Semantic Chunking Failures [Critical]

Fixed character or token splits break semantic units mid-sentence, mid-argument, and mid-table — destroying meaning before retrieval even begins.

The most foundational problem in RAG is how documents are divided before indexing. Standard implementations chunk by character count, word count, or token count — often with a small overlap window. This is operationally convenient but semantically destructive. A technical specification may introduce a definition in paragraph three and reference it across the next twelve paragraphs. A fixed size chunker places the definition in one chunk and its references in five others, severing the definitional relationship entirely. Each retrieved reference chunk then contains a term whose definition is absent, forcing the model to hallucinate the definition or produce underdetermined responses. The same failure occurs with tables, numbered procedures, and argument structures that span multiple paragraphs.

Fixed-size chunking treats documents as streams of tokens rather than structured knowledge objects. Semantic coherence is an afterthought, not a design principle.

Challenge 2 Arbitrary Boundary Decisions [Critical]

With no understanding of document structure, a definition and its ten downstream references routinely land in different chunks.

Even when overlap windows are used, the chunking algorithm has no awareness of what it is cutting. A paragraph break, section boundary, table row, or list item boundary carries no special weight compared to an arbitrary mid-sentence split. The result is that logically cohesive units — a clause and its antecedent, a finding and its methodology, a requirement and its exception — are regularly separated. This is compounded in highly formatted documents: legal contracts, financial reports, and technical specifications where the logical structure is the primary information-carrying mechanism. Stripping structure before indexing does not preserve information in a degraded form — it systematically destroys entire categories of information: hierarchy, dependency, cross-reference, and scope.

Challenge 3 Overlap Heuristics [Medium]

Sliding window overlap is a band-aid, not a solution — it duplicates content without preserving logical continuity.

The standard response to boundary problems is to overlap adjacent chunks by a fixed number of tokens — typically 10–20% of chunk size. This reduces the probability that a concept straddles an exact boundary, but it does not address structural relationships that span longer distances. An overlap of 100 tokens does not help when the definition is 800 tokens earlier than its reference. It also creates redundant retrieval: the same sentence may appear in two adjacent chunks, both of which are retrieved, consuming context window capacity without adding information.

CATEGORY 2 · RETRIEVAL QUALITY

Challenge 4 The Semantic Gap [High]

Vector similarity retrieves what sounds like the query — not what answers it. These are reliably different things in professional settings.

RAG relies on the assumption that semantic similarity — as measured by cosine distance in embedding space — reliably proxies for relevance. This assumption holds in consumer settings where queries and documents share vocabulary, topic, and framing. It breaks down systematically in professional contexts. The most dangerous manifestation is retrieval of plausible-but-wrong content: passages that score highly in embedding similarity but are drawn from a different document, temporal context, or analytical frame. The model receives a confident, relevant-looking passage that is factually inapplicable to the question — and it will reason from it as though it were correct.

Vector similarity is a necessary but not sufficient condition for relevance. A system retrieving the closest embedding to the query is retrieving what sounds most like the query — not the answer to it.

Challenge 5 Polysemy and Domain Mismatch [High]

Domain-specific terminology and ambiguous words reliably defeat embedding-based retrieval across professional corpora.

Domain-specific jargon is encoded in embedding space in ways that fail to align with general-language queries. A clinician asking about 'adverse cardiac events' may not retrieve a passage discussing 'myocardial infarction risk' even though both concern the same clinical reality — the surface vocabulary differs enough to widen their vector distance beyond the retrieval threshold. Polysemy — words with multiple meanings — causes retrieval failures in both directions. A query about 'model performance' in an AI context may surface passages about athletic or theatrical performance if the corpus contains such content. A query about 'resolution' in a legal document may retrieve camera specifications from a technical manual indexed in the same store. These failures are nearly invisible without document-level logging.

Challenge 6 Top-k Blindness [Medium]

Retrieving a fixed number of chunks regardless of query complexity means simple questions get noise and complex ones get truncation. Most RAG systems retrieve a fixed k chunks per query — typically 3 to 10 — regardless of whether the question needs one sentence or twenty paragraphs to answer well. For a simple factual query, retrieving 8 chunks floods the context window with irrelevant material that competes with the correct answer. For a complex synthesis query, 8 chunks may cover a fraction of the relevant material, producing an incomplete and misleading answer. Neither failure is visible to the end user; both produce fluent, apparently complete responses.

Challenge 7 Query–Document Asymmetry [High]

Short queries and long documents are poorly matched in embedding space, systematically biasing retrieval toward passage-length queries. Embedding models are trained on pairs of semantically similar text. Short queries — 'what is the cancellation policy?' — produce embeddings that are structurally different from the long passages in which the answer is likely to appear. The embedding space does not naturally bridge this asymmetry. HyDE (Hypothetical Document Embeddings) — generating a synthetic answer before embedding the query — partially addresses this, but adds latency and introduces its own error mode: the synthetic answer may be wrong, retrieving passages that confirm the hallucination rather than correct it.

CATEGORY 3 · CONTEXT & STRUCTURE

Challenge 8 Context Window Blindness [Critical]

Chunks arrive in the model's context with no knowledge of their position, role, or relationship to the rest of the document. Even when the correct chunk is retrieved, it arrives stripped of its documentary position. The model cannot know whether it is reading an introductory overview, a detailed technical sub-section, a hypothesis that is later revised, or a conclusion. The chunk arrives as an orphan. This is particularly problematic in documents that follow the classical structure of introduction, development, and revision. A document may present a hypothesis in section two, refine it in section four, and partially contradict it in section seven with new evidence. If the section-two chunk is retrieved without positional context, the model reasons from the unrevised hypothesis as though it were the document's final position. This is not a retrieval error — the most semantically relevant chunk was found. It is a context error, and it cannot be corrected by tuning the retrieval step.

Context blindness transforms a precision problem into a comprehension problem. The model is not reasoning from the wrong fragment — it is reasoning about a fragment without knowing what kind of fragment it is.

Challenge 9 Structural Unawareness [Critical]

Relationships between headings, sections, figures, tables, footnotes, and appendices are discarded before the document is ever indexed. Documents are not flat sequences of sentences. They are richly structured artifacts in which headings govern sections, figures are referenced from specific paragraphs, tables summarise surrounding prose, and appendices provide methodological detail that qualifies main-body conclusions. Current RAG systems discard this structural information almost entirely. A retrieved paragraph reading 'the results are shown in Table 3' is meaningless without Table 3. A retrieved figure caption is unintelligible without the figure. A retrieved claim may be qualified or contradicted by material in a section that was not retrieved. In highly formatted documents — legal contracts, scientific papers, technical specifications — the logical structure is the primary information-carrying mechanism. Stripping it before indexing does not preserve information in degraded form; it destroys entire categories of it.

Challenge 10 Coreference Breakdown [High]

'This method', 'it', 'the aforementioned approach' lose their referents entirely when extracted from surrounding context. Natural language achieves concision through reference: pronouns replace noun phrases, demonstratives replace full descriptions, and ellipsis leaves information implicit when prior discourse has established it. This is not a stylistic choice — it is a fundamental property of how language works, and it is structurally incompatible with chunk-based retrieval. When a chunk beginning with 'This method significantly outperforms prior approaches in three dimensions' is extracted and sent to the model, the model cannot know what method is being discussed. The antecedent — the description of the method — may be in a different chunk, in a different section, or in a document that was not retrieved. The model will infer the referent from context, do so confidently, and do so incorrectly.

CATEGORY 4 · MULTI-DOCUMENT REASONING

Challenge 11 Multi-Document Synthesis Failures [High]

No mechanism exists to understand provenance, contradiction, or authority across documents — the most common real-world use case. A large proportion of real-world use cases require reasoning across multiple documents: comparing two contract versions, reconciling conflicting findings across research papers, tracking how a policy evolved across regulatory updates, or assembling a coherent picture from a corpus of related materials. Current RAG systems handle this poorly for structural reasons. Similarity-based retrieval produces top-k passages without awareness of which document they originate from or how those documents relate to each other. Retrieved passages from different documents are concatenated in the context window without any mechanism for the model to understand inter-document relationships — which is newer, which is authoritative, which contradicts which, and which is supporting evidence for which. The model receives the pieces of the puzzle without the frame.

Multi-document reasoning requires awareness of provenance, temporality, authority, and inter-document contradiction — none of which current RAG architectures provide.

Challenge 12 Temporal Unawareness [Medium]

RAG has no sense of which document is newer, which supersedes which, or how knowledge has changed across document versions. When a knowledge base contains multiple versions of a policy, successive drafts of a contract, or a sequence of research papers on a developing topic, the RAG system has no mechanism to understand temporal ordering. A question about 'the current policy on data retention' may retrieve the most semantically similar passage — which could be from an outdated version — rather than the most recent authoritative statement. The model will answer confidently from the retrieved passage with no awareness that it has been superseded.

CATEGORY 5 · OUTPUT QUALITY & OPERATIONS

Challenge 13 Hallucination Amplification [Critical]

A retrieved chunk with a gap primes the model to fill that gap with confident fabrication — which is worse than no retrieval at all. One of RAG's primary motivations is hallucination reduction: supplying the model with retrieved factual content anchors outputs in real documents. This works when retrieval is correct, the chunk is complete, and context is sufficient. When any condition fails, RAG can actively worsen the hallucination problem. The mechanism is direct. When the model receives a chunk that is relevant but incomplete — missing a definition, a qualification, a table, or prior context — it must fill the gap. Unlike a model reasoning from training data alone, which may hedge when uncertain, a model in RAG mode has been given 'evidence'. It is primed to reason from that evidence confidently. Told that 'the maximum safe dose is shown in Figure 3' but not given Figure 3, the model will not decline to answer. It will, with high probability, fabricate a plausible dose grounded in training data and present it as document-sourced.

RAG does not eliminate hallucination — it relocates and reframes it. Partial context retrieval produces more confident and more plausible hallucinations than no context at all, making them substantially harder to detect and correct.

Challenge 14 False Grounding [Critical]

The model cites a real chunk as the source for a claim the chunk does not actually support — a new class of failure unique to RAG. Closely related to hallucination amplification is false grounding: the model produces an answer, cites a retrieved passage as its source, and the passage does not in fact support the claim. This failure mode is unique to retrieval-augmented systems and is more damaging than standard hallucination for two reasons: the cited source is real, making the claim appear credible to any reviewer who does not read the source carefully; and when the error is eventually discovered, it erodes trust in the source document rather than — or in addition to — trust in the AI system.

Challenge 15 The Evaluation Blind Spot [High]

Standard metrics measure consistency with the retrieved chunk, not correctness of the chunk — making systematic failures invisible. The challenges above would be more manageable if they were easily detectable. They are not. RAG failures at the context and structure level produce responses that are fluent, confident, and locally plausible. They fail at precisely the level of accuracy that requires detailed knowledge of the source document to assess. Standard evaluation metrics — RAGAS scores, faithfulness metrics, context relevance scores — measure whether the model's output is consistent with the retrieved chunk, not whether the chunk was correct, complete, or properly contextualised. A model that accurately summarises an incomplete chunk receives a high faithfulness score even though the chunk omitted critical information. This creates a systematic blind spot: many organisations are running RAG systems whose true failure rate is substantially higher than their evaluation dashboards indicate.

Challenge 16 No Retrieval Feedback Loop [Medium]

Failed queries do not improve the index, the chunking strategy, or the retrieval configuration — failure is silent and cumulative. In a well-designed machine learning system, prediction errors feed back into model improvement. RAG pipelines typically lack this loop entirely. A query that retrieves the wrong chunk, a chunk that is missing critical context, or a response that is confidently wrong — none of these events update the embedding index, adjust the chunk boundaries, retune the retrieval parameters, or flag the document for re-ingestion. The system fails, the user may or may not notice, and the next identical query will fail in exactly the same way. This makes RAG quality fundamentally static without explicit human-in-the-loop review processes.

Consolidated Challenge Summary

The following table provides a reference view of all sixteen challenges, their severity at the architectural level, their impact on end-user LLM output quality, and their category. Severity is assessed at the architectural level — how fundamentally the challenge undermines the RAG paradigm. LLM Impact reflects the consequence for end-user output quality in production deployment. Both are rated Critical, High, or Medium.

Challenge	Severity	LLM Impact	Category
Semantic chunking failures	Critical	Critical	Chunking & Ingestion
Arbitrary boundary decisions	Critical	High	Chunking & Ingestion
Overlap heuristics	Medium	Medium	Chunking & Ingestion
Semantic gap	High	High	Retrieval Quality
Polysemy & domain mismatch	High	High	Retrieval Quality
Top-k blindness	Medium	Medium	Retrieval Quality
Query–document asymmetry	High	Medium	Retrieval Quality
Context window blindness	Critical	Critical	Context & Structure
Structural unawareness	Critical	High	Context & Structure
Coreference breakdown	High	High	Context & Structure
Multi-document synthesis	High	Critical	Multi-Document
Temporal unawareness	Medium	High	Multi-Document
Hallucination amplification	Critical	Critical	Output Quality
False grounding	Critical	Critical	Output Quality
Evaluation blind spot	High	High	Operational
No retrieval feedback loop	Medium	Medium	Operational

The 16 Core Challenges of Current RAG Technologies