Section 19.5: Structured Data & Text-to-SQL

★ Big Picture

Most enterprise knowledge lives in databases, not documents. Sales figures, inventory counts, customer records, and financial data are stored in relational databases and spreadsheets. Text-to-SQL enables users to query these structured data sources using natural language, bridging the gap between business questions and database queries. Combined with traditional document-based RAG, this creates hybrid retrieval systems that can answer questions requiring both factual context and precise numerical data.

1. LLMs and Tabular Data

Tabular data presents unique challenges for LLMs. Unlike free-form text, tables have strict row-column structure, typed values, and relational constraints (foreign keys, uniqueness). LLMs can process small tables directly in their context window, but larger datasets require the model to generate executable queries (SQL, pandas) that retrieve the relevant information.

1.1 Direct Table Reasoning

For small tables (fewer than 50 rows), LLMs can reason over tabular data presented directly in the prompt. The key is formatting: presenting tables as Markdown or structured text with clear column headers helps the model understand the schema. However, this approach fails for large datasets because of context window limits and because LLM arithmetic on many rows is unreliable.

import pandas as pd
from openai import OpenAI

client = OpenAI()

def ask_about_table(df: pd.DataFrame, question: str):
    """Ask a question about a small DataFrame directly."""
    # Convert to markdown for clear formatting
    table_md = df.to_markdown(index=False)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "You are a data analyst. Answer questions about "
                       "the provided table. Show your reasoning."
        }, {
            "role": "user",
            "content": f"Table:\n{table_md}\n\nQuestion: {question}"
        }]
    )
    return response.choices[0].message.content

2. Text-to-SQL: Architecture

Text-to-SQL systems convert natural language questions into SQL queries that can be executed against a database. The core challenge is mapping the user's informal language to the precise syntax of SQL while correctly identifying which tables, columns, and joins are needed. Modern LLM-based text-to-SQL pipelines achieve over 85% accuracy on standard benchmarks by combining schema linking, few-shot examples, and error correction.

Figure 19.12: Text-to-SQL pipeline: schema linking maps natural language to database elements, the LLM generates SQL, execution returns results, and errors trigger regeneration.

2.1 Schema Linking

Schema linking is the process of identifying which database tables and columns are relevant to the user's question. This is often the hardest step because users rarely use exact column names. A user asking about "revenue" might mean the total_amount column in the orders table, or the annual_revenue column in the companies table.

def get_schema_context(db_connection, relevant_tables=None):
    """Extract schema information for the LLM prompt."""
    schema_parts = []

    # Get all tables or filter to relevant ones
    tables = relevant_tables or get_all_tables(db_connection)

    for table in tables:
        columns = get_columns(db_connection, table)
        col_info = []
        for col in columns:
            col_str = f"  {col['name']} {col['type']}"
            if col.get("primary_key"):
                col_str += " PRIMARY KEY"
            if col.get("foreign_key"):
                col_str += f" REFERENCES {col['foreign_key']}"
            col_info.append(col_str)

        # Include sample values for disambiguation
        samples = get_sample_values(db_connection, table, limit=3)

        schema_parts.append(
            f"CREATE TABLE {table} (\n"
            + ",\n".join(col_info)
            + f"\n);\n-- Sample rows: {samples}"
        )

    return "\n\n".join(schema_parts)

2.2 SQL Generation with Few-Shot Examples

def text_to_sql(question, schema_context, examples=None):
    """Generate SQL from natural language question."""

    few_shot = ""
    if examples:
        few_shot = "\n\nExamples:\n" + "\n".join([
            f"Q: {ex['question']}\nSQL: {ex['sql']}"
            for ex in examples
        ])

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": f"""You are a SQL expert. Generate a SQL query to
answer the user's question based on this schema:

{schema_context}
{few_shot}

Rules:
- Use only tables and columns from the schema
- Return ONLY the SQL query, no explanation
- Use standard SQL compatible with PostgreSQL
- Add LIMIT 100 to prevent unbounded results
- Handle NULLs appropriately"""
        }, {
            "role": "user",
            "content": question
        }],
        temperature=0.0
    )

    sql = response.choices[0].message.content.strip()
    # Remove markdown code fences if present
    sql = sql.replace("```sql", "").replace("```", "").strip()
    return sql

ⓘ Few-Shot Example Selection

The choice of few-shot examples dramatically affects text-to-SQL accuracy. The most effective strategy is to embed the user's question and retrieve the most similar examples from a curated example bank using vector similarity. This "dynamic few-shot" approach ensures that the examples shown to the LLM are maximally relevant to the current question, improving accuracy by 10 to 20% compared to fixed examples on complex queries.

3. Error Correction and Self-Healing Queries

Even the best LLMs generate incorrect SQL queries. Common errors include referencing nonexistent columns, incorrect join conditions, missing GROUP BY clauses, and type mismatches. A robust text-to-SQL pipeline includes an error correction loop that catches execution failures, sends the error message back to the LLM along with the original query, and asks for a corrected version.

import sqlite3

def execute_with_retry(question, schema_context, db_path,
                        max_retries=3):
    """Execute SQL with automatic error correction."""
    conn = sqlite3.connect(db_path)
    error_history = []

    for attempt in range(max_retries):
        # Generate SQL (include errors from prior attempts)
        if error_history:
            error_context = "\n".join([
                f"Attempt {e['attempt']}: {e['sql']}\n"
                f"Error: {e['error']}"
                for e in error_history
            ])
            question_with_context = (
                f"{question}\n\nPrevious failed attempts:\n"
                f"{error_context}\n\nFix the errors."
            )
        else:
            question_with_context = question

        sql = text_to_sql(question_with_context, schema_context)

        try:
            cursor = conn.execute(sql)
            columns = [desc[0] for desc in cursor.description]
            rows = cursor.fetchall()
            return {
                "sql": sql,
                "columns": columns,
                "rows": rows,
                "attempts": attempt + 1
            }
        except Exception as e:
            error_history.append({
                "attempt": attempt + 1,
                "sql": sql,
                "error": str(e)
            })

    return {"error": "Failed after max retries",
            "history": error_history}

⚠ SQL Injection and Safety

Allowing an LLM to generate and execute SQL queries introduces serious security risks. Always enforce these safeguards: (1) use a read-only database connection (no INSERT, UPDATE, DELETE, or DROP), (2) set query timeouts to prevent long-running queries, (3) limit result set sizes, (4) validate generated SQL against an allowlist of permitted operations before execution, and (5) run queries against a replica or snapshot, never the production database directly.

4. Text-to-SQL Benchmarks

Standard benchmarks measure text-to-SQL system performance across databases of varying complexity. The two most widely used benchmarks are Spider and BIRD, which test different aspects of the text-to-SQL challenge.

Benchmark	Databases	Questions	Key Challenge	SOTA Accuracy
Spider	200 databases	10,181	Cross-database generalization	~87% (execution)
BIRD	95 databases	12,751	Real-world complexity, external knowledge	~72% (execution)
WikiSQL	26,521 tables	80,654	Single-table, simpler queries	~93% (execution)
SParC	200 databases	4,298	Multi-turn conversational SQL	~70% (execution)

5. CSV and Spreadsheet Processing

Not all structured data lives in databases. CSV files, Excel spreadsheets, and Google Sheets are ubiquitous in business environments. LLMs can process these by either loading them into an in-memory database (SQLite) for SQL queries or generating pandas code for direct manipulation.

import pandas as pd
import sqlite3

def csv_to_queryable(csv_path, table_name="data"):
    """Load a CSV into SQLite for text-to-SQL querying."""
    df = pd.read_csv(csv_path)

    # Create in-memory database
    conn = sqlite3.connect(":memory:")
    df.to_sql(table_name, conn, index=False)

    # Generate schema description
    schema = f"Table: {table_name}\nColumns:\n"
    for col in df.columns:
        dtype = str(df[col].dtype)
        sample = df[col].dropna().head(3).tolist()
        schema += f"  {col} ({dtype}) -- e.g., {sample}\n"

    return conn, schema


# Usage
conn, schema = csv_to_queryable("sales_data.csv", "sales")
result = execute_with_retry(
    "What are the top 5 products by total revenue this quarter?",
    schema, conn
)

6. Hybrid Structured/Unstructured Retrieval

Many real-world questions require combining structured data (from databases) with unstructured context (from documents). For example, "Why did Q3 revenue decline for our enterprise segment?" requires both the numerical revenue data from a database and contextual explanations from quarterly reports, meeting notes, or market analyses.

Figure 19.13: Hybrid retrieval combines precise numerical data from databases with contextual explanations from documents to produce comprehensive answers.

★ Key Insight

The most common pattern for hybrid retrieval is to route the question to both a text-to-SQL pipeline and a document RAG pipeline in parallel, then combine both result sets in the LLM prompt. The structured data provides the "what" (exact numbers, trends, comparisons), while the unstructured data provides the "why" (explanations, context, causal factors). This combination produces answers that are both precise and insightful.

7. Multi-Table Joins and Complex Queries

The hardest text-to-SQL queries involve multiple tables with complex join conditions, nested subqueries, window functions, and conditional aggregations. These queries require the LLM to understand the database's relational structure: which tables connect to which, through what foreign keys, and with what cardinality.

ⓘ Improving Multi-Table Accuracy

Three techniques significantly improve accuracy on complex multi-table queries: (1) Schema descriptions: annotating each column with a human-readable description of what it contains and how it relates to business concepts. (2) Join hints: explicitly listing common join patterns (e.g., "orders.customer_id joins to customers.id"). (3) Chain-of-thought SQL: asking the LLM to first write out its reasoning about which tables and joins are needed before generating the SQL, which reduces errors on queries requiring 3+ table joins.

Section 19.5 Quiz

1. Why is direct table reasoning in the LLM context window unreliable for large datasets?

Show Answer

Two main reasons: (1) Context window limits prevent including large tables (a table with 10,000 rows would consume most of the context budget), and (2) LLMs perform unreliable arithmetic over many rows. They may miscount, miscalculate sums, or skip rows. For any dataset beyond roughly 50 rows, generating executable code (SQL or pandas) that computes the answer precisely is far more reliable.

2. What is schema linking and why is it the hardest step in text-to-SQL?

Show Answer

Schema linking maps natural language terms in the user's question to specific database tables and columns. It is the hardest step because users rarely use exact column names. "Revenue" might mean total_amount, annual_revenue, or sales_total depending on context. The system must disambiguate these mappings using table descriptions, column types, sample values, and the overall question context. Errors in schema linking cascade into incorrect SQL generation.

3. How does the error correction loop improve text-to-SQL reliability?

Show Answer

When a generated SQL query fails execution (syntax error, missing column, type mismatch), the error message is sent back to the LLM along with the failed query and the original question. The LLM can then diagnose the specific error and generate a corrected query. This retry loop typically runs for up to 3 attempts and catches most common errors (wrong column names, missing GROUP BY, incorrect joins), significantly improving end-to-end execution accuracy.

4. What security measures are essential for production text-to-SQL systems?

Show Answer

Five essential safeguards: (1) Read-only database connections (prevent INSERT, UPDATE, DELETE, DROP), (2) query timeouts to prevent long-running queries from overloading the database, (3) result set size limits to prevent memory exhaustion, (4) SQL validation against an allowlist of permitted operations before execution, and (5) running queries against a database replica or snapshot rather than the production database.

5. When should you use hybrid structured/unstructured retrieval instead of text-to-SQL alone?

Show Answer

Use hybrid retrieval when the question requires both precise data and contextual explanation. Questions like "Why did Q3 revenue decline?" need numerical revenue data from a database (the "what") combined with contextual explanations from quarterly reports, meeting notes, or market analyses (the "why"). Text-to-SQL alone can provide the numbers but cannot explain causation or context that exists only in unstructured documents.

Key Takeaways

Text-to-SQL unlocks database-backed RAG: By converting natural language to SQL, LLMs can query relational databases for precise numerical answers that vector search cannot provide.
Schema linking is the critical bottleneck: Correctly mapping user terms to database columns determines success; enhance it with column descriptions, sample values, and dynamic few-shot examples.
Error correction loops are essential: Automatic retry with error feedback catches most SQL generation mistakes, improving reliability from roughly 70% to over 85% on complex queries.
Security is non-negotiable: Read-only connections, query timeouts, result limits, SQL validation, and database replicas are mandatory safeguards for any production text-to-SQL system.
Hybrid retrieval combines precision and context: The most powerful answers come from combining structured data (exact numbers) with unstructured context (explanations), retrieved in parallel and synthesized by the LLM.