This past year, our team published a white paper on small data. We explored the positive trend toward democratization enabled by small data, and the role it plays in easing the grip that large tech firms have on access to, and application of, usable data. In this blogpost, we turn our attention to the latest advances in Large Language Models (LLM) and Natural Language Processing (NLP). In these areas, we see a similar tension emerging between big tech and less well-resourced organizations in the development and application of these models.

The potential of NLP is immense, as imaginative developers rapidly expand the technology’s use set. From machine translators to email filters, chat bots, search engine processing, predictive text, text analytics, social media and sentiment monitoring, the applications seem endless. But solving the problem of statistically recreating language in all its nuance and complexity does not come cheap. So, it’s not too surprising that it has been largely the provenance of a few giant tech firms, with household names like Google, Microsoft, and Meta.

 

Size Matters: The Race for Dominance

Leading language model developers have been in something of an arms race in recent years. Simply put, large language models “learn” by example. In theory, the more examples they ingest, the better and more complete their “understanding” of the language becomes and the more adept they are at applying language in novel contexts.  This has become especially true as developers embraced the Transformer model after 2017, which opened the door to training models with ever increasing amounts of unlabeled data.

 

The table above shows how far and fast large language models have come. As evidenced by ELMo’s 94 million parameters early in the post-transformer period, to Nvidia’s introduction last October of the 530 billion parameter Megatron-Turing NLG, which claims to be the largest and most powerful monolithic transformer English language model in the world, the race to scale is on and does not appear to be slowing down. In addition, Microsoft andOpen AI’s GPT4 is due out soon and may top 1 trillion parameters, a mark already surpassed by BAAI’s Wu Dao 2’s reported 1.5 trillion.

 

At What Cost?

Not surprisingly, the price of building, training, and running such large models is prohibitive for all but the biggest companies, and the larger these models get, the more exclusive the club of developers becomes. Megatron Turing cost an estimated $100 million to get off the ground. Even the smaller and purportedly more efficiently produced OPT-175B from Meta had training costs alone at close to $30 million.

The inevitable result of this concentration of effort is that businesses and research labs looking to develop and apply NLP for targeted purposes will be forced to rely on the leading commercial providers in the space. And this will give these providers more power to decide what the future directions of AI research will be, which will likely coincide with their financial interests.

 

Leveling the Playing Field

There are countervailing forces that are challenging the hegemony of the tech giants in NLP. Among the more well-known is Hugging Face, a community of developers that share models and datasets with the goal of democratizing the development and application of AI.

In addition to offering a range of pre-trained models, Hugging Face has been the driver of BigScience, a collaborative workshop for the creation and study of large language models that includes more than 1,000 researchers from around the world. This year, BigScience launched BLOOM, touted to be the “first multi-lingual large language model trained in complete transparency.” BLOOM boasts a respectable 176 billion parameters and can generate text in 46 languages and 13 programming languages. Access to BLOOM is open for all who agree to the model’s “responsible” licensing agreement.

 

The Future

Just as in the evolution from big to small data applications, there is tension in these early days of LLM and NLP development between the natural dominance of large, well-resourced developers and the broader community. Right now, the big players have taken the lead in the race for model dominance. In the short-term, these forces will likely drive larger commercial players to be more accommodative in their approach to sharing technologies. In the longer-term, they may successfully unlock new levels of open-source LLM development and application.