Open Source GPT Chat has taken another significant step forward with the introduction of the Dolly Large Language Model (DLL), developed by the Databricks enterprise software company.
The recently launched ChatGPT clone, named Dolly, is inspired by the famous sheep that became the first mammal to be cloned.
Advancing Open Source Large Language Models
The Dolly LLM represents the latest progress in the expanding open source AI movement. The aim of this movement is to provide wider access to technology, preventing it from being monopolized and controlled solely by large corporations.
One concern driving the open source AI movement is the reluctance of businesses to entrust sensitive data to a third party that controls the AI technology.
Built on Open Source Foundations
Dolly was developed based on an open source model created by the non-profit EleutherAI research institute, in collaboration with the Stanford University Alpaca model. The Alpaca model, in turn, was built from Meta's 65 billion parameter open source LLaMA model.
LLaMA, an abbreviation for Large Language Model Meta AI, is a language model trained on publicly available data.
In fact, according to an article by Weights & Biases, LLaMA can outperform many of the top language models, such as OpenAI's GPT-3, DeepMind's Gopher, and DeepMind's Chinchilla, despite its smaller size.
Improving the Dataset
Another source of inspiration came from an academic research paper titled "SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions" (PDF), which proposed a method for creating high-quality autogenerated question and answer training data that surpasses the limitations of public data.
The research paper on Self-Instruct explains:
“...we curate a set of expert-written instructions for novel tasks, and show through human evaluation that tuning GPT3 with SELF-INSTRUCT outperforms using existing public instruction datasets by a substantial margin, with only a 5% absolute gap compared to InstructGPT…”
“...By applying our method to vanilla GPT3, we achieve a 33% absolute improvement over the original model on SUPERNATURALINSTRUCTIONS, a performance on par with InstructGPT, which is trained with private user data and human annotations.”
The significance of Dolly lies in its ability to demonstrate that a highly useful large language model can be created using a smaller yet high-quality dataset.
“Dolly functions by taking an existing open source 6 billion parameter model from EleutherAI and making minimal modifications to enhance instruction-following capabilities, such as brainstorming and text generation, which were not present in the original model. This is achieved through the use of data from Alpaca.
...Our research shows that anyone can take a dated off-the-shelf open source large language model (LLM) and endow it with magical ChatGPT-like instruction-following ability, all in just 30 minutes of training on a single machine, using high-quality training data.
Surprisingly, instruction-following does not seem to require the latest or largest models: our model consists of only 6 billion parameters, compared to GPT-3's 175 billion.”
Databricks and Open Source AI
Dolly's introduction is regarded as a step towards democratizing AI. It is part of a growing movement recently joined by the non-profit organization Mozilla, with the establishment of Mozilla.ai. Mozilla, known as the publisher of the Firefox browser and other open source software, is actively supporting the open source AI ecosystem.