Through the forest - Tracing the implementation of RAG

Introduction: Tracing Through the Forest of RAG

Imagine embarking on a journey through a dense forest, where each step brings a new discovery, yet hints at deeper mysteries ahead.

That’s how it felt when I first delved into Retrieval-Augmented Generation (RAG).

Every layer I uncovered felt like peeling back an onion—sometimes revealing something exciting, other times forcing me to pause and reflect, like wiping away the sting from my eyes.

But I knew the destination would be worth the journey.

In this article, we’ll take you through the intricacies of RAG, using a starter tutorial from LlamaIndex.

Don’t worry if the path seems complex—together, we’ll break down each layer, understanding every part of the process until we reach a full understanding.

So what does RAG mean?

R- Retrieval : This is the step where the model searches external databases or sources for relevant information.

Large language models (LLMs) like GPT may not always be up-to-date with the latest data, so they retrieve context from external sources to ensure accurate responses.

A- Augmented :- The retrieved context enriches or “augments” the LLM, allowing it to generate more informed and precise answers.

G- Generation :- The model uses the augmented context to produce a natural language response, synthesizing the retrieved information with its internal knowledge.

This process is particularly helpful for providing accurate responses with recent or domain-specific data.

Query Pipeline: The journey begins

RAG is composed of distinct modules that work together to meet a specific need. Here’s the typical pipeline flow:

Retrieval

Data Loading: First, the data is loaded from external sources.
Chunking and Indexing: The loaded data is split into chunks and converted into embeddings (vector representations of the text).
Storing the Data: The embeddings are stored in a vector database for efficient retrieval later.
Querying: When a user query is received, it is converted into an embedding.
Retrieving: The system searches for the most relevant data by comparing the query embedding to those in the vector store.

Generation

Context and Query: The retrieved context is combined with the user’s query.
Response Generation: This combined information is passed to the LLM, which generates a natural language response based on both the query and the context.

View sample code implementation below. Source [https://ts.llamaindex.ai/getting_started/starter_tutorial/retrieval_augmented_generation]

import fs from "node:fs/promises";

import {
  Document,
  MetadataMode,
  NodeWithScore,
  VectorStoreIndex,
} from "llamaindex";

async function main() {
  // Load essay from abramov.txt in Node
  const path = "node_modules/llamaindex/examples/abramov.txt";

  const essay = await fs.readFile(path, "utf-8");

  // Create Document object with essay
  const document = new Document({ text: essay, id_: path });

  // Split text and create embeddings. Store them in a VectorStoreIndex
  const index = await VectorStoreIndex.fromDocuments([document]);

  // Query the index
  const queryEngine = index.asQueryEngine();
  const { response, sourceNodes } = await queryEngine.query({
    query: "What did the author do in college?",
  });

  // Output response with sources
  console.log(response);

  if (sourceNodes) {
    sourceNodes.forEach((source: NodeWithScore, index: number) => {
      console.log(
        `\n${index}: Score: ${source.score} - ${source.node.getContent(MetadataMode.NONE).substring(0, 50)}...\n`,
      );
    });
  }
}

main().catch(console.error);

why create a document?

const document = new Document({ text: essay, id_: path });

Here, we’re wrapping the essay text in a Document object, which is part of LlamaIndex’s structure for managing chunks of data. By assigning an id_ (in this case, the path to the text file), we ensure the document is identifiable and can be processed into embeddings.

export class Document<T extends Metadata = Metadata> extends TextNode<T> {
  constructor(init?: TextNodeParams<T>) {
    super(init);
  }

  get type() {
    return ObjectType.DOCUMENT;
  }
}

What happens at this part?

// Split text and create embeddings. Store them in a VectorStoreIndex
  const index = await VectorStoreIndex.fromDocuments([document]);

Text Splitting: The document is broken down into smaller chunks, which makes it easier to handle large bodies of text and improves the search efficiency during retrieval.
Creating Embeddings: Each chunk of text is transformed into a vector (a numerical representation of the text), which captures the meaning and semantic relationships in the text. This is essential for enabling the model to “understand” and compare pieces of text based on their meaning rather than just their exact wording.
Storing in VectorStoreIndex: The embeddings are stored in a VectorStoreIndex, a specialized data structure that facilitates fast, efficient searching of similar text. This step allows the model to quickly find and retrieve the most relevant chunks of information during a query.

Now that we’ve walked through the theory, let’s dive into the details of how this process is implemented.

  static async fromDocuments(
    documents: Document[],
    args: VectorIndexOptions & {
      docStoreStrategy?: DocStoreStrategy;
    } = {},
  ): Promise<VectorStoreIndex> {
    args.storageContext =
      args.storageContext ?? (await storageContextFromDefaults({}));
    args.vectorStores = args.vectorStores ?? args.storageContext.vectorStores;
    args.docStoreStrategy =
      args.docStoreStrategy ??
      // set doc store strategy defaults to the same as for the IngestionPipeline
      (args.vectorStores
        ? DocStoreStrategy.UPSERTS
        : DocStoreStrategy.DUPLICATES_ONLY);
    args.serviceContext = args.serviceContext;
    const docStore = args.storageContext.docStore;

    if (args.logProgress) {
      console.log("Using node parser on documents...");
    }

    // use doc store strategy to avoid duplicates
    const vectorStores = Object.values(args.vectorStores ?? {});
    const docStoreStrategy = createDocStoreStrategy(
      args.docStoreStrategy,
      docStore,
      vectorStores,
    );
    args.nodes = await runTransformations(
      documents,
      [nodeParserFromSettingsOrContext(args.serviceContext)],
      {},
      { docStoreStrategy },
    );
    if (args.logProgress) {
      console.log("Finished parsing documents.");
    }
    return await this.init(args);
  }

Explaining the function above

  static async fromDocuments(
    documents: Document[],
    args: VectorIndexOptions & {
      docStoreStrategy?: DocStoreStrategy;
    } = {},
  ): Promise<VectorStoreIndex> {}

This function is responsible for taking an array of Document objects (in this case, a single essay), and creating the vector store that will later be queried for information. Let’s break it down:

fromDocuments: This static method is responsible for transforming documents into a vector index. It handles everything from document ingestion, transformation, to index initialization.

Parameters:
    - documents: Document[]: This is an array of document objects, where each document contains text data and associated metadata.
    - args: VectorIndexOptions: This argument includes various options to control how the vector index should be built, such as which strategies to use for storing documents (docStoreStrategy) or which context to use (storageContext).

      args.storageContext =
      args.storageContext ?? (await storageContextFromDefaults({}));
    args.vectorStores = args.vectorStores ?? args.storageContext.vectorStores;

storageContext: If no storage context is provided in the arguments, the function sets a default one. This context controls where and how the embeddings are stored and retrieved later.
vectorStores: This field holds the actual vector storage, which is where the embeddings (numerical representations of the documents) will live. If it’s not explicitly provided, it’s fetched from the storageContext.

          args.docStoreStrategy =
      args.docStoreStrategy ??
      (args.vectorStores
        ? DocStoreStrategy.UPSERTS
        : DocStoreStrategy.DUPLICATES_ONLY);

docStoreStrategy: This controls how documents are stored:
    UPSERTS ensures that new documents will either be inserted or updated (preventing duplication).
    DUPLICATES_ONLY allows duplication of documents when no vector store is provided.

This flexibility ensures that documents are managed efficiently based on the available storage resources.

      const docStore = args.storageContext.docStore;

docStore: This is the storage system that manages the original documents, keeping track of them alongside their embeddings in the vector store.

         if (args.logProgress) {
      console.log("Using node parser on documents...");
    }

Logging Progress: For tracking, this logs when the documents are being parsed, which is helpful when processing large datasets.

     // use doc store strategy to avoid duplicates
    const vectorStores = Object.values(args.vectorStores ?? {});
    const docStoreStrategy = createDocStoreStrategy(
      args.docStoreStrategy,
      docStore,
      vectorStores,
    );

Avoiding Duplicates: This part ensures that the strategy for storing documents avoids duplicating them unnecessarily, keeping storage efficient.

       args.nodes = await runTransformations(
      documents,
      [nodeParserFromSettingsOrContext(args.serviceContext)],
      {},
      { docStoreStrategy },
    );

Running Transformations: This processes the documents, converting them into smaller nodes (chunks) that will be indexed. It uses the nodeParser to extract the relevant information and store it in the appropriate format.

     return await this.init(args);

Final Initialization: Once everything is processed and stored, the index is initialized, making it ready to be queried.

Query Engine

  // Query the index
  const queryEngine = index.asQueryEngine();
  const { response, sourceNodes } = await queryEngine.query({
    query: "What did the author do in college?",
  });

This part of the code sets up the engine that will allow us to ask questions about the stored documents.

- asQueryEngine: This transforms the VectorStoreIndex into a query engine. In this context, the engine is designed to retrieve the most relevant information from the index when a query is made.

- query: This is where we ask the system a question, like "What did the author do in college?". The query is transformed into an embedding and matched against the stored vectors in the index.

- response: The result of the query, which contains the answer from the document.

- sourceNodes: These represent the specific nodes (chunks) of the document that the query matched with. This allows us to trace the source of the information returned.

Let’s walk through the code implementation for the Query Engine.

  asQueryEngine(options?: {
    retriever?: BaseRetriever;
    responseSynthesizer?: BaseSynthesizer;
    preFilters?: MetadataFilters;
    nodePostprocessors?: BaseNodePostprocessor[];
    similarityTopK?: number;
  }): RetrieverQueryEngine {
    const {
      retriever,
      responseSynthesizer,
      preFilters,
      nodePostprocessors,
      similarityTopK,
    } = options ?? {};
    return new RetrieverQueryEngine(
      retriever ?? this.asRetriever({ similarityTopK }),
      responseSynthesizer,
      preFilters,
      nodePostprocessors,
    );
  }

    asQueryEngine(options?: {
    retriever?: BaseRetriever;
    responseSynthesizer?: BaseSynthesizer;
    preFilters?: MetadataFilters;
    nodePostprocessors?: BaseNodePostprocessor[];
    similarityTopK?: number;
  }): RetrieverQueryEngine {}

This part of the code offers additional options when creating the query engine, such as:

- retriever: The mechanism responsible for finding relevant nodes in the vector index.

- responseSynthesizer: This is used to combine the retrieved information into a coherent response.

- preFilters: These are applied to the data before retrieval, allowing for more targeted searches.

- similarityTopK: This controls how many similar nodes to return, allowing for broader or narrower search results.

Wrapping it Up: Navigating the Forest of RAG

In this journey, we’ve traced the path of Retrieval-Augmented Generation (RAG) from its theoretical foundations to its implementation in code.

We’ve explored how RAG leverages external data through retrieval to empower large language models to generate more accurate and contextually rich responses.

By breaking down the process—retrieving data, augmenting it, and generating a response—we’ve peeled back the layers of complexity that make RAG a powerful tool for modern AI applications.

As you’ve seen, building and querying a vector store is central to making RAG work effectively.

Whether you’re handling large-scale data or trying to keep your language model up-to-date, RAG provides a flexible, scalable way to bridge the gap between static model knowledge and real-time, dynamic data retrieval.

Next Steps: Fine-Tuning and Optimization

As a next step, we can experiment with fine-tuning your language model to specialize it for a domain or dataset. Fine-tuning will allow the model to generate more accurate responses by training it on specific data relevant to a use case. Additionally, we can optimize the retrieval process by refining how we chunk, store, and query data, ensuring that your system provides the most relevant context for each query.

By fine-tuning and optimizing these components, NodeWithScore can create a more powerful and targeted RAG implementation that aligns perfectly with our specific needs.