Andrew Williamson
Training a compute-optimal efficient small language model while GPU-poor
Assuming we have only a consumer grade GPU with 24GB memory (3090, etc.), how can we train a small language model efficiently to maximize the compute we have available? This post explores the process of training a small language model with a limited GPU in order to produce the best possible model.
Background: Language Models
Language models have rapidly accelerated over the past few years in terms of capabilities and size. In 2019, OpenAI introduced GPT-2 a 1.5 billion parameter language model noted as best in class at the time. Which was then proceeded by the Chinchilla scaling laws paper highlighting the efficient manner in which these models could be scaled. This was followed by GPT-3, a 175 billion parameter language model which reigned supreme until it's predecessor GPT-4 was announced. The vast growth in language model size has been driven by the ability to scale these models efficiently, and the availability of compute to do so, but what if we aren't a funded research lab with access to a GPU cluster?
The open source community has been blessed by the contributions of Meta and Mistral in their open sourcing of the weights for language models such as Llama 2 or Mistral's own MoE Mixtral. However, these models as well contain billions of parameters and require a large amount of compute to train. Luckily, a consumer GPU can atleast handle the inference of these models, providing an accessible way to use these models for everyone. I wouldn't be amiss to say that the availability of these models has been a boon to the community, and has allowed for the rapid development of new models and techniques. However, even for some these models are too large to be used in their entirety, and thus we must look to smaller models.
Microsoft has been forthcoming in their research and open sourcing of their small language models recently which provides the main focus of this post. In particular, the Phi-2 models. This is a small language model with only 2.7 million parameters, which is a far cry from the billions of parameters in the previously mentioned models. It's recently been provided with a MIT license and the weights made available for use on huggingface. This model is a great starting point for those who want to experiment with language models, but don't have the compute to work with a larger model. It's shown strong reasoning and language understanding capabilities, including SOTA performance among base language models with less than 13 billion parameters. Interestingly enough, they attribute the success of this model to the data curation rather than the model architecture itself. Therefore, we can assume that the model architecture is not the limiting factor in the performance of this model, and that we can use this model as a starting point for our own experiments.