Large Language Models exhibit strong text generation capabilities but demonstrate poor performance on tabular data analysis, frequently producing incorrect calculations despite generating syntactically valid code. This research addresses this limitation through a systematic approach combining automated data curation, programmatic verification, and parameter-efficient fine-tuning. The methodology centers on creating a verified training dataset where every example undergoes computational validation to ensure correctness. Using GPT-4o-mini as a teacher model, the system generates question-reasoning-code triplets, then executes the code against ground-truth data to filter only correct solutions. This "validity by association" principle ensures that reasoning traces leading to correct results represent valid analytical thinking. Three open-source models were fine-tuned using QLoRA technique: DeepSeek-Coder (1.3B), Phi-3-mini (3.8B), and CodeQwen1.5 (7B). A critical discovery emerged regarding training format alignment - the DeepSeek model required architecture-specific formatting rather than standard conversational templates to achieve optimal performance. Results demonstrate substantial improvements across all models. The DeepSeek model achieved the most dramatic transformation, improving from 2.19% to 50.94% accuracy on the development set and 42.34% on the final test set - a 23-fold improvement. This performance rivals commercial models like GPT-4o while using significantly fewer computational resources. The research establishes that smaller, specialized models can achieve competitive performance with systems containing over 100 times more parameters. The fine-tuned 1.3B model requires only 2.8GB memory, enabling deployment on consumer hardware while maintaining transparency through explicit reasoning traces. This challenges current scaling paradigms and demonstrates efficient pathways for model specialization. The work contributes a reproducible framework for creating domain-specific analytical agents, automated quality assurance through programmatic verification, and evidence that architectural appropriateness and training data quality can outweigh raw parameter count for specialized tasks.