Exploring Retriever Options in LlamaIndex- A Journey Through JSON Data Queries

My journey into understanding the capabilities and limitations of retrievers in LlamaIndex began with an issue raised on GitHub: Issue #1095. While the primary issue concerning the JSON Reader had been resolved, a follow-up concern raised by Sudhir Veerakamaraj caught my attention.

The concern revolved around a retriever’s inability to return all expected values from a JSON dataset, even when explicitly prompted to do so. This sparked my curiosity, leading me to dive deeper into the problem and discover valuable insights about how retrievers work in LlamaIndex.

The Problem: Missing cName Values

When querying a JSON dataset for all cName values, the expected behavior was to retrieve every cName present in the data. However, only two values—s10 and s9—were returned. Here’s the sample JSON data used for the test:

[
    {
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "experimentname": "7.10.24",
        "cName":"s0"
    },
    {
        "experimentname": "7.10.24",
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "cName":"s1"
    },
    {
        "experimentname": "7.10.24",
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "cName":"s2"
    },
    {
        "experimentname": "7.10.24",
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "cName":"s3"
    },
    {
        "experimentname": "7.10.24",
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "cName":"s4"
    },
    {
        "experimentname": "7.10.24",
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "cName":"s5"
    },
    {
        "experimentname": "7.10.24",
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "cName":"s5"
    },
    {
        "experimentname": "7.10.24",
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "cName":"s6"
    },
    {
        "experimentname": "7.10.24",
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "cName":"s7"
    },
    {
        "experimentname": "7.10.24",
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "cName":"s8"
    },
    {
        "experimentname": "7.10.24",
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "cName":"s9"
    },
    {
        "experimentname": "7.10.24",
        "experimentid": "204512b1-004d-4bb3-ada2-e495ed2727ce",
        "cName":"s10"
    }
]

The following code was used to query the data:

import {
  OpenAI,
  OpenAIEmbedding,
  Settings,
  VectorStoreIndex,
} from "llamaindex";

async function main() {
  Settings.llm = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
  Settings.embedModel = new OpenAIEmbedding({
    model: "text-embedding-ada-002",
  });
 
  const reader1 = new JSONReader();
  const docs1 = await reader1.loadData("examples/test/data.json");
  const index = await VectorStoreIndex.fromDocuments(docs1);

  const queryEngine = index.asQueryEngine();
  const prompt = `list down all the cName values`;
  const results = await queryEngine.query({
    query: prompt,
  });

  return results;
}

main().then((data) => {
  console.log(data);
});

The expectation was to retrieve all cName values (s0 to s10), but only s10 and s9 were returned. This unexpected behavior prompted a deeper investigation.

Root Cause: Default Behavior of VectorRetrieverIndex

The issue stemmed from the default configuration of the VectorRetrieverIndex. By design, this retriever fetches the top two most similar nodes unless the similarityTopK parameter is explicitly configured. The relevant code snippet can be found in the LlamaIndex repository.

In this case, the VectorRetrieverIndex returned only s10 and s9 because they were the top two most similar entries according to the retriever’s algorithm.

The Solution: Using the Right Retriever

To resolve this issue and retrieve all cName values, switching to the SummaryIndexRetriever was necessary. This retriever is designed to return all relevant context from the data, making it ideal for this use case. The updated code is as follows:

import {
  OpenAI,
  OpenAIEmbedding,
  Settings,
  VectorStoreIndex,
} from "llamaindex";

async function main() {
  Settings.llm = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
  Settings.embedModel = new OpenAIEmbedding({
    model: "text-embedding-ada-002",
  });
 
  const reader1 = new JSONReader();
  const docs1 = await reader1.loadData("examples/test/data.json");
  const index = await SummaryIndex.fromDocuments(docs1);

  const queryEngine = index.asQueryEngine();
  const prompt = `list down all the cName values`;
  const results = await queryEngine.query({
    query: prompt,
  });

  return results;
}

main().then((data) => {
  console.log(data);
});

By using the SummaryIndexRetriever, all cName values were successfully returned.

Reflection: Learning About Retriever Options

From the insight shared by KindOfAScam, This experience provided a deeper understanding of the retriever options available in LlamaIndex, such as:

VectorIndexRetriever: Fetches the top k most similar nodes.
SummaryIndexRetriever: Returns all relevant context.
KeywordTableRetrievers: Focus on keyword-based queries with options like RAKE and Simple.

Choosing the right retriever for the task is critical to achieving the desired results. While VectorIndexRetriever is suitable for similarity-based queries, SummaryIndexRetriever excels in scenarios where all context must be retrieved.

Conclusion

This journey highlighted the importance of understanding the tools and their default behaviors. By switching to the SummaryIndexRetriever, it was possible to get all cName values.

References

GitHub Issue #1095 LlamaIndex Documentation on Retrievers