ROSE: An Intent-Centered Evaluation Metric for NL2SQL

ROSE is an intent-centered evaluation metric for NL2SQL systems. It assesses if generated SQL accurately reflects user intent, overcoming Execution Accuracy's (EX) syntactic limitations for more robust model assessment.

intermediate30 min3 steps

The play

Understand Execution Accuracy (EX) Limitations
Review your current NL2SQL evaluation pipelines. Identify scenarios where a syntactically different but semantically equivalent SQL query might fail EX, or where an EX pass might not truly reflect user intent due to data-specific coincidences. Recognize EX's sensitivity to syntactic variations and its inability to handle multiple valid query interpretations.
Grasp ROSE's Core Intent-Centered Concept
Familiarize yourself with the principles of intent-centered evaluation. Consider how you would define and measure 'intent correspondence' between a natural language query, a gold SQL query, and a generated SQL query. This involves moving beyond simple string or result-set comparison to evaluate the underlying semantic goal.
Design a Semantic Evaluation Framework
Integrate ROSE conceptually into your evaluation strategy by designing a framework for semantic comparison. This involves deriving a canonical semantic representation for both gold and generated SQL (e.g., a logical form or abstract syntax tree) and then developing a mechanism to compare these representations for equivalence, focusing on core operations like selection, filtering, aggregation, and joining. Consider combining ROSE with EX for a hybrid evaluation.

Starter code

import sqlite3

def get_db_connection():
    conn = sqlite3.connect(':memory:')
    cursor = conn.cursor()
    cursor.execute("CREATE TABLE products (id INTEGER, name TEXT, price REAL)")
    cursor.execute("INSERT INTO products VALUES (1, 'Laptop', 1200.00)")
    cursor.execute("INSERT INTO products VALUES (2, 'Mouse', 25.00)")
    cursor.execute("INSERT INTO products VALUES (3, 'Keyboard', 75.00)")
    conn.commit()
    return conn

def execute_and_fetch_results(conn, sql_query):
    cursor = conn.cursor()
    try:
        cursor.execute(sql_query)
        return sorted(cursor.fetchall()) # Sort for consistent comparison, EX sometimes sensitive to order
    except Exception as e:
        return f"Error: {e}"

if __name__ == "__main__":
    conn = get_db_connection()

    gold_sql = "SELECT name, price FROM products WHERE price > 50 ORDER BY name"
    # EX will pass: same syntax, same result
    generated_sql_1 = "SELECT name, price FROM products WHERE price > 50 ORDER BY name ASC"
    # EX will fail: different columns in result, different intent
    generated_sql_2 = "SELECT name FROM products WHERE price > 50 ORDER BY name"
    # EX will fail: result column order different, but same underlying intent
    generated_sql_3 = "SELECT price, name FROM products WHERE price > 50 ORDER BY name"

    print("--- Demonstrating Execution Accuracy (EX) ---")

    gold_results = execute_and_fetch_results(conn, gold_sql)
    print(f"Gold SQL Results: {gold_results}\n")

    results_1 = execute_and_fetch_results(conn, generated_sql_1)
    print(f"Generated SQL 1 Results: {results_1}")
    print(f"Match Gold (EX): {gold_results == results_1}\n")

    results_2 = execute_and_fetch_results(conn, generated_sql_2)
    print(f"Generated SQL 2 Results: {results_2}")
    print(f"Match Gold (EX): {gold_results == results_2}\n")

    results_3 = execute_and_fetch_results(conn, generated_sql_3)
    print(f"Generated SQL 3 Results: {results_3}")
    print(f"Match Gold (EX): {gold_results == results_3}\n")

    conn.close()

    print("\n--- ROSE (Conceptual Difference) ---")
    print("ROSE would evaluate generated_sql_3 as semantically equivalent to gold_sql,")
    print("even though EX fails due to result set order/structure differences.")
    print("It focuses on whether the *intent* (select name and price for products > 50, ordered by name) is met.")