Designing a Data Source Discovery App - Part 10: Retrieval-Augmented Generation (RAG) in kDS-DSD

In my last post, I talked about managing organizational data. In this post, I want to discuss Retrieval-Augmented-Generation (RAG). We'll dive into what RAG is, why it matters, and how we’re incorporating it into the design of the kDS data source discovery (DSD) app.

What is RAG?

RAG is an advanced AI technique that blends information retrieval with generative AI models to produce responses that are not only accurate but also highly contextually relevant. Here’s how it works:

Retrieval: The system searches a knowledge base, database, or document repository to retrieve relevant information based on the user’s input.
Augmentation: The retrieved data is combined with the input query to provide additional context.
Generation: A generative AI model, such as a large language model (LLM), uses this enriched context to craft a detailed, informed response.

How the kDS-DSD App Leverages RAG

Retrieval from an LLM

The app prompts an LLM to generate a plausible NASIC industry code based on the user’s description of their industry. While the LLM isn’t searching an external database, it uses its internal knowledge base—a form of retrieval.

Augmentation

The app integrates the generated NASIC code into the user interface and allows the user to confirm or modify it. This feedback loop ensures the final code is accurate and tailored to the user’s needs.

Generation

The LLM generates the NASIC code and may refine its suggestions based on user interactions.

Enhancing RAG with a Database

The app includes a table with over 1,000 NASIC codes to ensure accuracy. Users can retrieve multiple suggestions; if none are suitable, they’re prompted to manually select a code from the table. This fallback mechanism combines the best of AI-driven insights and structured data retrieval.

When dealing with large volumes of data, vector databases are often used to efficiently manage and retrieve information. These databases store data as high-dimensional vectors, enabling fast and accurate similarity searches. This approach is especially beneficial in RAG systems where embeddings of textual data are calculated and stored, facilitating precise matches during the retrieval step.

Fine-tuning an LLM can also be considered a form of RAG. By training the model on specific data, you are effectively embedding retrieval capabilities within the model itself. This enables the LLM to generate responses that are more tailored to a specific domain or task, mimicking the retrieval-augmentation process through its fine-tuned knowledge.

Fine-tuning is used by kDS-DSD when analyzing and summarizing interview answers. The process will have the LLM break down answers by topics such as data sources, flows, usage quality, and sentiments -- a good topic for a future post.

Designing RAG-Ready Database Tables

To support RAG functionalities, the data architecture must integrate retrieved and generated items seamlessly. For instance, the stage.parent table used to store parent organization data includes fields for the industry code and additional metadata:

CREATE TABLE IF NOT EXISTS stage.parent
(
    name_   VARCHAR(96) NOT NULL,
    organization_type  VARCHAR(32),
    stock_symbol  VARCHAR(8),
    product_service  VARCHAR(128),
    annual_revenue  VARCHAR(48),
    employee_total  VARCHAR(48),
    website_  VARCHAR(92),
    location_  VARCHAR(48),
    source_  VARCHAR(96) NOT NULL,
    create_date date NOT NULL,
    created_by  VARCHAR(92) NOT NULL,
    modified_date date,
    modified_by  VARCHAR(92),
    rag_industry_code  VARCHAR(24) NOT NULL,
    rag_industry_description  TEXT,
    rag_description_rationale  TEXT,
    rag_industry_source  VARCHAR(96),
    rag_industry_create_date date,
    rag_industry_created_by  VARCHAR(96),
    CONSTRAINT pk_stage_parent PRIMARY KEY (name_)
);

This design embeds RAG outputs alongside source data, ensuring traceability and allowing for both automated and manual validation of industry codes.

Conclusion

RAG represents a powerful fusion of retrieval and generative technologies, enabling smarter, more context-aware systems. By combining the strengths of large language models with structured data retrieval, it bridges the gap between AI creativity and factual accuracy. In the kDS-DSD app, we’ve integrated these principles into NASIC code generation and plan to extend them to more complex tasks like summarizing interview data by topics such as data sources, flows, and sentiments. As we refine this design, we aim to empower users with smarter tools and better insights.

Thanks for stopping by. This is the last post of 2024. We'll talk again in 2025. Have a great holiday and a wonderful new year!. Peace.

Designing a Data Source Discovery App - Part 10: Retrieval-Augmented-Generation (RAG) in kDS-DSD