General text to sql solutions work on well known use cases like employee database, but do not work when specific domain or enterprise specific terminology is used in the natural language query, or when the database and column names do not correctly explain the entities they represent.
This solution provides a way to define the schema of your databases and domain specific terms in a structured way to enable correct SQL query formation from natural language questions.
- Application Architecture
The application follows a well-structured pipeline architecture that processes queries through several specialized components. Think of it like an assembly line where each station performs a specific task to transform the raw material (natural language) into the final product (SQL query).
- Core Components in Detail
A. QueryTranslator This is the orchestrator of the entire process. Like a conductor in an orchestra, it coordinates all other components to process the query. Here's how it works:
async def translate_to_sql(self, natural_query: str) -> Tuple[str, dict]:
# Step 1: Preprocess the query
processed_query = self.preprocessor.preprocess_query(natural_query, self.domain_context)
# Step 2: Analyze query intent
query_intent = self.semantic_analyzer.analyze_query(processed_query, self.domain_context)
# Step 3: Find similar terms using vector search
query_embedding = self.semantic_analyzer.embedding_model.encode(processed_query)
similar_terms = await self.vector_manager.find_similar_terms(query_embedding)
# Step 4: Generate SQL using LLM
prompt = self._prepare_llm_prompt(processed_query, query_intent, similar_terms)
sql = await self.llm_client.generate_completion(prompt)
B. Vector Management System The vector management system uses advanced embedding techniques to understand the semantic meaning of terms. Imagine it as a smart librarian that can find related terms even when they're expressed differently:
class VectorManager:
async def find_similar_terms(self, query_embedding: List[float]) -> List[VectorSearchResult]:
"""
Searches for semantically similar terms in the vector database
For example: "credit score" might match with "creditworthiness" or "FICO score"
"""
results = await self.vector_api_client.search_vectors(query_embedding)
return [r for r in results if r.score >= VECTOR_SIMILARITY_THRESHOLD]
C. Semantic Analysis The semantic analyzer is like a language expert that understands the intent behind queries:
class SemanticAnalyzer:
def analyze_query(self, query: str, context: DomainContext) -> QueryIntent:
"""
Breaks down queries to understand:
- What action is being requested (SELECT, AGGREGATE, etc.)
- Which entities are involved (tables, columns)
- What conditions should be applied
- Whether time periods are specified
"""
tokens = word_tokenize(query.lower())
action_type = self._determine_action_type(tokens)
aggregation_type = self._determine_aggregation(tokens)
main_entities = self._extract_main_entities(query)
conditions = self._extract_conditions(query, context)
return QueryIntent(
action_type=action_type,
main_entities=main_entities,
conditions=conditions,
aggregation_type=aggregation_type
)
D. SQL Validator The SQL validator acts as a quality control system, ensuring the generated SQL is correct and follows business rules:
class SQLValidator:
def validate_sql(self, sql: str) -> Tuple[bool, List[str]]:
"""
Validates SQL query against:
1. Correct syntax
2. Valid table and column names
3. Proper join conditions
4. Business rules (e.g., credit score ranges)
"""
issues = []
# Parse SQL
parsed = sqlparse.parse(sql)[0]
# Validate components
issues.extend(self._validate_tables(parsed))
issues.extend(self._validate_columns(parsed))
issues.extend(self._validate_joins(parsed))
issues.extend(self._validate_business_rules(parsed))
return len(issues) == 0, issues
- Integration with External Services
The system is designed to work with multiple external services through clean interfaces:
A. Vector Databases (like Pinecone):
class PineconeVectorAPIClient(VectorAPIClient):
"""
Handles vector similarity search operations:
- Storing term embeddings
- Searching for similar terms
- Managing vector indices
"""
B. Language Models (like OpenAI GPT-4):
class OpenAILLMClient(LLMAPIClient):
"""
Manages interactions with the language model:
- Generating SQL from natural language
- Handling API rate limits and errors
- Processing responses
"""
- Example Query Flow
Let's follow a query through the system:
Input: "What's the average credit worthiness for customers with late payments?"
-
Preprocessing:
- Standardizes terms (e.g., "credit worthiness" → "credit_score")
- Removes unnecessary words
- Applies domain-specific rules
-
Semantic Analysis:
- Identifies action type: AGGREGATE
- Main entities: credit_score, payment_status
- Conditions: payment_status = 'Late'
- Aggregation: AVG
-
Vector Search:
- Finds similar terms in the domain
- Matches "credit worthiness" with "credit_score"
- Identifies "late payments" relates to payment_history table
-
SQL Generation:
SELECT AVG(customer_credit.credit_score) as average_credit_score
FROM customer_credit
JOIN payment_history ON customer_credit.customer_id = payment_history.customer_id
WHERE payment_history.payment_status = 'Late';
-
Areas for Enhancement
-
Error Handling:
- Add more specific error types for different failure modes
- Implement retry logic for API calls
- Add circuit breakers for external services
- Performance:
- Cache frequently used vector embeddings
- Implement batch processing for multiple queries
- Add connection pooling for database operations
- Monitoring:
- Add performance metrics tracking
- Implement query success rate monitoring
- Add detailed logging for debugging
Would you like me to elaborate on any particular aspect or explore how to implement any of these enhancements?