GitHub - virunew/textsql: Domain and schema aware natural language to SQL generator project

Domain aware Text to SQL solution

General text to sql solutions work on well known use cases like employee database, but do not work when specific domain or enterprise specific terminology is used in the natural language query, or when the database and column names do not correctly explain the entities they represent.

This solution provides a way to define the schema of your databases and domain specific terms in a structured way to enable correct SQL query formation from natural language questions.

Application Architecture

The application follows a well-structured pipeline architecture that processes queries through several specialized components. Think of it like an assembly line where each station performs a specific task to transform the raw material (natural language) into the final product (SQL query).

Core Components in Detail

A. QueryTranslator This is the orchestrator of the entire process. Like a conductor in an orchestra, it coordinates all other components to process the query. Here's how it works:

async def translate_to_sql(self, natural_query: str) -> Tuple[str, dict]:
    # Step 1: Preprocess the query
    processed_query = self.preprocessor.preprocess_query(natural_query, self.domain_context)
    
    # Step 2: Analyze query intent
    query_intent = self.semantic_analyzer.analyze_query(processed_query, self.domain_context)
    
    # Step 3: Find similar terms using vector search
    query_embedding = self.semantic_analyzer.embedding_model.encode(processed_query)
    similar_terms = await self.vector_manager.find_similar_terms(query_embedding)
    
    # Step 4: Generate SQL using LLM
    prompt = self._prepare_llm_prompt(processed_query, query_intent, similar_terms)
    sql = await self.llm_client.generate_completion(prompt)

B. Vector Management System The vector management system uses advanced embedding techniques to understand the semantic meaning of terms. Imagine it as a smart librarian that can find related terms even when they're expressed differently:

class VectorManager:
    async def find_similar_terms(self, query_embedding: List[float]) -> List[VectorSearchResult]:
        """
        Searches for semantically similar terms in the vector database
        For example: "credit score" might match with "creditworthiness" or "FICO score"
        """
        results = await self.vector_api_client.search_vectors(query_embedding)
        return [r for r in results if r.score >= VECTOR_SIMILARITY_THRESHOLD]

C. Semantic Analysis The semantic analyzer is like a language expert that understands the intent behind queries:

class SemanticAnalyzer:
    def analyze_query(self, query: str, context: DomainContext) -> QueryIntent:
        """
        Breaks down queries to understand:
        - What action is being requested (SELECT, AGGREGATE, etc.)
        - Which entities are involved (tables, columns)
        - What conditions should be applied
        - Whether time periods are specified
        """
        tokens = word_tokenize(query.lower())
        action_type = self._determine_action_type(tokens)
        aggregation_type = self._determine_aggregation(tokens)
        main_entities = self._extract_main_entities(query)
        conditions = self._extract_conditions(query, context)
        
        return QueryIntent(
            action_type=action_type,
            main_entities=main_entities,
            conditions=conditions,
            aggregation_type=aggregation_type
        )

D. SQL Validator The SQL validator acts as a quality control system, ensuring the generated SQL is correct and follows business rules:

class SQLValidator:
    def validate_sql(self, sql: str) -> Tuple[bool, List[str]]:
        """
        Validates SQL query against:
        1. Correct syntax
        2. Valid table and column names
        3. Proper join conditions
        4. Business rules (e.g., credit score ranges)
        """
        issues = []
        
        # Parse SQL
        parsed = sqlparse.parse(sql)[0]
        
        # Validate components
        issues.extend(self._validate_tables(parsed))
        issues.extend(self._validate_columns(parsed))
        issues.extend(self._validate_joins(parsed))
        issues.extend(self._validate_business_rules(parsed))
        
        return len(issues) == 0, issues

Integration with External Services

The system is designed to work with multiple external services through clean interfaces:

A. Vector Databases (like Pinecone):

class PineconeVectorAPIClient(VectorAPIClient):
    """
    Handles vector similarity search operations:
    - Storing term embeddings
    - Searching for similar terms
    - Managing vector indices
    """

B. Language Models (like OpenAI GPT-4):

class OpenAILLMClient(LLMAPIClient):
    """
    Manages interactions with the language model:
    - Generating SQL from natural language
    - Handling API rate limits and errors
    - Processing responses
    """

Example Query Flow

Let's follow a query through the system:

Input: "What's the average credit worthiness for customers with late payments?"

Preprocessing:
- Standardizes terms (e.g., "credit worthiness" → "credit_score")
- Removes unnecessary words
- Applies domain-specific rules
Semantic Analysis:
- Identifies action type: AGGREGATE
- Main entities: credit_score, payment_status
- Conditions: payment_status = 'Late'
- Aggregation: AVG
Vector Search:
- Finds similar terms in the domain
- Matches "credit worthiness" with "credit_score"
- Identifies "late payments" relates to payment_history table
SQL Generation:

SELECT AVG(customer_credit.credit_score) as average_credit_score
FROM customer_credit
JOIN payment_history ON customer_credit.customer_id = payment_history.customer_id
WHERE payment_history.payment_status = 'Late';

Areas for Enhancement
Error Handling:

Add more specific error types for different failure modes
Implement retry logic for API calls
Add circuit breakers for external services

Performance:

Cache frequently used vector embeddings
Implement batch processing for multiple queries
Add connection pooling for database operations

Monitoring:

Add performance metrics tracking
Implement query success rate monitoring
Add detailed logging for debugging

Would you like me to elaborate on any particular aspect or explore how to implement any of these enhancements?

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
.idea		.idea
.vscode		.vscode
chroma_db		chroma_db
devcontainer		devcontainer
docs		docs
examples		examples
fast_start		fast_start
hackathon		hackathon
llmware		llmware
repo_docs		repo_docs
scripts		scripts
src		src
tests		tests
wheel_archives		wheel_archives
.cursorrules		.cursorrules
.devcontainer		.devcontainer
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
gguf_streaming.py		gguf_streaming.py
invoice_processing.py		invoice_processing.py
parsing_microsoft_ir_docs.py		parsing_microsoft_ir_docs.py
requirements.txt		requirements.txt
setup.py		setup.py
using-whisper-cpp-sample-files.py		using-whisper-cpp-sample-files.py
using_slim_summary.py		using_slim_summary.py
web_services_slim_fx.py		web_services_slim_fx.py
welcome_example.py		welcome_example.py
welcome_to_llmware.sh		welcome_to_llmware.sh
welcome_to_llmware_windows.sh		welcome_to_llmware_windows.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain aware Text to SQL solution

About

Releases

Packages

Languages

License

virunew/textsql

Folders and files

Latest commit

History

Repository files navigation

Domain aware Text to SQL solution

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages