MessyAi: A LlamaIndex Text-to-SQL RAG PoC for SMB Analytics

Luca Zosso | Jun 10, 2024 min read

Small and mid-sized teams do not lack data. They lack a safe, fast path from a plain English question to a correct aggregate or filter over their tables. Spreadsheets scale poorly. A full BI rollout is heavy. In the early wave of LLM tooling, the obvious experiment was Text-to-SQL: let the model read the schema, write SELECT, run it, narrate the result.

This repository, MessyAi (Text2SQL_RAG_POC_SMBs), is a proof of concept I developed in the context of the Welcome Back IE Venture Lab. It implements a classic RAG-style Text-to-SQL stack: pre-processing (clean CSVs, load into SQLite, LLM-authored table summaries), retrieval (pick the most relevant tables and enriched schema text), and generation (GPT-4 proposes SQL, a custom parser extracts the query, SQLite runs it, a second prompt synthesizes the natural language answer). The README documents a graded evaluation (five difficulty levels) and reports strong accuracy on the first three levels, where most user questions were expected to land.

I am writing this in 2024, just after finishing the project. When this PoC was built, the dominant pattern was still “retrieve schema-ish text, then free-form SQL.” The article closes with a concrete modernization path — what a semantic layer would add on top.

Code: https://github.com/LucaZoss/Text2SQL_RAG_POC_SMBs


Prerequisites

  • Python 3.10+ and pip (or your preferred env tool).
  • An OpenAI API key in a .env file (see repo README).
  • Basic SQL and comfort reading LlamaIndex QueryPipeline graphs.
  • Optional: Streamlit path via text2sql_loader_st.py if you want a UI wrapper.

The problem: SMB questions over messy, multi-table CSVs

Operational analytics questions rarely arrive as SELECT ... GROUP BY. They arrive as “B2B last week,” “RFM high value,” or “orders when it rained.” Under the hood you still need correct table choice, column names, and dialect (here SQLite). Raw schema dumps help models, but they are noisy and long. RAG over table metadata was the early fix: summarize each table once, embed those summaries, retrieve a small set of candidates, then condition SQL generation on that slice only.

That is exactly the design this repo encodes.


Historical context: early Text-to-SQL RAG

At the time this style of system went mainstream in demos and coursework, the novelty was composition: not a single prompt with the whole information schema, but a pipeline with retrieval in the middle. Vendors and papers now discuss verified execution, SQL repair loops, fine-tuned models, and semantic layers. This codebase sits in the first generation: trust retrieval plus a capable LLM, validate mostly by “does the query run,” and evaluate with human-labeled question tiers.

Naming that era explicitly is useful for hiring and portfolio readers. It signals you know what you would not ship unchanged today without trashing the learning value of the PoC.


Architecture in three stages

Plain text flow:

CSV files
↓
LLM table summaries (JSON) + SQLite tables in memory
↓
Vector retrieval of relevant SQLTableSchema objects
↓
Text-to-SQL LLM + parse SQLQuery ... strip SQLResult
↓
SQL retriever executes + synthesis LLM answers the user

Pre-processing: LLM-authored table capsules

The TableInfo model forces a machine-friendly table name and a short summary. process_tables either loads cached JSON from tableinfo_dir or calls LLMTextCompletionProgram with GPT-4 until the suggested table_name is unique.

# text2sql_rag/text2sql_loader.py
program = LLMTextCompletionProgram.from_defaults(
    output_cls=TableInfo,
    llm=OpenAI(model="gpt-4", openai_api_key=OPENAI_API_KEY),
    prompt_template_str=self.prompt_str()
)
table_info = program(
    table_str=df_str,
    exclude_table_name_list=str(list(table_names)),
)

That step is the R in RAG for tables: you are not embedding raw rows at scale first, you are embedding curated descriptions tied to schema objects. It was a practical compromise when full-document RAG over every cell was too expensive.


Retrieval and generation: QueryPipeline as the contract

setup_query_pipeline wires LlamaIndex modules: SQLDatabase, SQLTableNodeMapping, ObjectIndex with VectorStoreIndex, retriever with similarity_top_k=5, SQLRetriever, and a hand-built SQL string parser on the model output.

# text2sql_rag/text2sql_loader.py
self.qp = QP(
    modules={
        "input": InputComponent(),
        "table_retriever": self.obj_retriever,
        "table_output_parser": table_parser_component,
        "text2sql_prompt": text2sql_prompt,
        "text2sql_llm": llm,
        "sql_output_parser": sql_parser_component,
        "sql_retriever": self.sql_retriever,
        "response_synthesis_prompt": response_synthesis_prompt,
        "response_synthesis_llm": llm,
    },
    verbose=True,
)

The parser is brittle by design (string markers like SQLQuery: and SQLResult:), which was typical before stronger structured output and AST validators became common talking points.

# text2sql_rag/text2sql_loader.py
def parse_response_to_sql(self, response: ChatResponse) -> str:
    response = response.message.content
    sql_query_start = response.find("SQLQuery:")
    if sql_query_start != -1:
        response = response[sql_query_start:]
        if response.startswith("SQLQuery:"):
            response = response[len("SQLQuery:"):]
    sql_result_start = response.find("SQLResult:")
    if sql_result_start != -1:
        response = response[:sql_result_start]
    return response.strip().strip("```").strip()

Evaluation

The project includes notebooks and scripts such as rag_eval_vf.py for batch runs and visual assets under text2sql_rag/evaluation/. The README states that most user questions fall in the first three difficulty levels and cites about 91% average accuracy there, with charts for cumulative averages and accuracy drop at harder levels.


What I would add today: a semantic layer on top

Early Text-to-SQL RAG put the model too close to physical tables. A semantic layer sits between the business question and SQL generation and defines metrics (revenue, orders, churn), dimensions (date, channel, postcode), allowed joins, and grain. Tools and patterns in that space include dbt metrics, LookML-style models, MetricFlow, Cube, or a thin internal YAML that compiles to SQL templates.

Concretely for this repo, a semantic layer would:

  1. Shrink retrieval: embed and retrieve metrics and dimensions, not only free-text table summaries, so the LLM chooses “official” definitions.
  2. Constrain SQL: generate against a known join graph or emit parameterized templates instead of arbitrary FROM clauses.
  3. Improve trust: attach lineage and versioned definitions so “B2B revenue” means one thing in the board deck and the chat UI.

Conclusion

MessyAi is a clear artifact of early Text-to-SQL RAG: SQLite and SQLAlchemy for a portable demo, GPT-4 for both table summarization and SQL, LlamaIndex QueryPipeline for orchestration, and a graded eval story aimed at SMB-style questions. Lead with the business loop (Venture Lab, SMB analytics, messy joins in CSV land) and close with what you would ship now — semantic layer, stricter SQL validation, production auth and logging.

Repo: https://github.com/LucaZoss/Text2SQL_RAG_POC_SMBs