SQL Query Generation with Fine-Tuned Language Model

Project Overview

In this cutting-edge machine learning project, I fine-tuned the defog/sqlcoder-1b model on the WikiSQL dataset to generate precise SQL queries from natural language questions. By leveraging LoRA and 4-bit quantization, I optimized training efficiency, enabling robust performance within a 2–3 hour window on GPU hardware. This solution enhances data querying for analysts and developers, delivering accurate SQL outputs that streamline database interactions. My work showcases advanced NLP techniques and efficient model training, paving the way for intelligent query generation in data-driven applications.

Project Objectives:

Fine-tune sqlcoder-1b to generate accurate SQL queries from natural language inputs.
Optimize training with LoRA and 4-bit quantization for efficiency on limited hardware.
Preprocess WikiSQL dataset into prompt pairs for effective model training.
Evaluate model performance to ensure reliable query generation.
Produce a deployable model for real-world data querying applications.

Technology Stack

This project utilized a powerful suite of machine learning and NLP tools to achieve efficient model fine-tuning:

PyTorch

Transformers

PEFT

Parameter-efficient Fine-tuning (LoRA)

Architecture Diagram:

The workflow begins with loading the WikiSQL dataset, followed by preprocessing into prompt pairs. The sqlcoder-1b model is fine-tuned with LoRA and 4-bit quantization, tokenized inputs are processed, and the trained model generates SQL queries from natural language questions.

Implementation Details

Data Preprocessing

Loaded the WikiSQL dataset using the datasets library, extracting natural language questions and their corresponding SQL queries. Formatted the data into prompt pairs (e.g., “### Question: ... ### SQL: ...”) to align with the model’s input requirements, ensuring effective training.

Model Setup

Utilized the defog/sqlcoder-1b model, loaded with 4-bit quantization via bitsandbytes to reduce memory usage. Applied LoRA using the peft library, targeting q_proj and v_proj modules with rank 8 and alpha 16, minimizing trainable parameters for efficiency.

Tokenization

Tokenized prompt pairs using the model’s tokenizer, setting a maximum length of 512 tokens to balance query complexity and computational constraints. Applied truncation and padding to ensure consistent input sizes for training and validation.

Training

Is still under progress: Training the model on 8,000 training samples and 1,000 validation samples for 1 epoch using the Hugging Face Trainer API. Configured training with a batch size of 2, learning rate of 2e-4, and FP16 precision, saving checkpoints every 100 steps and evaluating performance every 50 steps.

Key Outcomes

Enhanced Query Accuracy

Fine-tuned sqlcoder-1b to generate precise SQL queries, improving reliability for data querying tasks.

Training Efficiency

Optimized training with LoRA and 4-bit quantization, completing fine-tuning within 2–3 hours on GPU hardware.

Deployable Model

Is still under progress

This project demonstrates advanced NLP expertise, delivering a scalable solution that empowers data analysts and developers to query databases effortlessly.

Challenges & Solutions

GPU Memory Constraints

Challenge: Fine-tuning a 1B-parameter model on limited GPU memory risked out-of-memory errors.

Solution: Applied 4-bit quantization and LoRA to reduce memory footprint, enabling efficient training with minimal parameter updates.

Tokenization Limits

Challenge: Ensuring complex SQL queries fit within a 512-token limit without losing critical information.

Solution: Carefully designed prompt formatting and truncation strategies to preserve query structure and context.

Training Speed vs. Performance

Challenge: Balancing training speed with model accuracy within a 2–3 hour window.

Solution: Limited training to 8,000 samples and used FP16 precision, achieving robust performance in a single epoch.