As reported by TowardsAI.net, coding has become a prominent area of development in foundation models. OpenAI introduced models like Codex, later evolving into GPT-4, while Amazon and Salesforce also made significant contributions. Initially, coding models prioritized quantity and size over quality. However, Microsoft Research challenged this approach with their paper “Textbooks is All You Need,” training a compact model solely on high-quality textbook datasets. Their model, phi-1, achieved impressive results, surpassing competing models on benchmarks like HumanEval and MBPP with a pass@1 accuracy of 50.6% and 55.5%, respectively. The significance of high-quality data was highlighted in their training process, combining synthetic data and carefully filtered content.
Microsoft Research’s research paper highlights the importance of high-quality training data from textbooks in their code language model. They propose a training set resembling a well-structured “textbook” to overcome limitations in existing code datasets. By utilizing filtered code-language data, synthetic textbooks generated by GPT-3.5, and a smaller synthetic dataset of Python exercises, they achieved exceptional performance in code generation, surpassing existing approaches.
Filtered Code-Language Dataset
Microsoft Research utilized a range of datasets and techniques to develop an efficient code-generation model. They curated a reliable filtered dataset by leveraging Python code datasets from The Stack and StackOverflow. Using GPT-4, they assessed the educational value of the samples and annotated them accordingly. A random forest classifier, trained on the annotated data, predicted the quality of each file or sample.
Synthetic Textbook Dataset
Microsoft Research employed GPT-4 sparingly for quality annotations on a subset of The Stack and StackOverflow samples. This reduced the need for labor-intensive human annotation.
Using GPT-3.5, they generated a synthetic textbook dataset with less than 1 billion tokens. The textbooks combined natural language-heavy text with relevant code snippets, targeting specific topics to foster reasoning skills and understanding of algorithms. Constraints were applied during generation to ensure dataset diversity.
This approach proved effective, as exemplified by an example of synthetically generated text for the textbooks.
Smaller Textbook Dataset
Microsoft Research curated the CodeExercises dataset, comprising less than 180 million tokens of Python exercises and solutions. These exercises involved completing partially coded functions with docstrings, and training the model for function completion tasks. GPT-3.5 generated the dataset with constraints on function names for diversity. Precautions were taken to avoid replicating issues from the HumanEval benchmark during finetuning.
Microsoft Research employed a decoder-only transformer model based on FlashAttention for their code language model. The phi-1 model had 1.3 billion parameters, 24 layers, and 32 attention heads, while the smaller phi-1-small model had 350 million parameters, 20 layers, and 16 attention heads. They used a rotary position embedding and the same tokenizer as codegen-350M-mono. Pretraining used a sequence length of 2048 and included the CodeTextbook dataset. Fine-tuning with the CodeExercises dataset was performed with adjusted hyperparameters.
Microsoft Research evaluated phi-1 on various benchmarks and demonstrated its superiority over larger models like StarCoder or Replit. Finetuning had a significant impact on phi-1’s performance.
Their work highlights the profound influence of high-quality data in enhancing language models’ proficiency in code-generation tasks. Creating “textbook-quality” data, they trained a model that outperforms most open-source models on coding benchmarks like HumanEval and MBPP. Remarkably, despite being smaller in size, their model exhibited exceptional performance.
Using high-quality data provides clear, self-contained, and instructive examples of coding concepts and skills. This conducive learning environment enables language models to effectively grasp coding principles, resulting in superior code-generation performance.