Open Source GPT Chat took another step forward with the release of the Dolly Large Language Model (DLL) created by the Databricks enterprise software company.
The new ChatGPT clone is called Dolly, named after the famous sheep of that name, the first mammal to be cloned.
Open Source Large Language Models
The Dolly LLM is the latest manifestation of the growing open source AI movement that seeks to offer greater access to the technology so that it’s not monopolized and controlled by large corporations.
One of the concerns driving the open source AI movement is that businesses may be reluctant to hand over sensitive data to a third party that controls the AI technology.
Based on Open Source
Dolly was created from an open source model created by the non-profit EleutherAI research institute and the Stanford University Alpaca model which itself that was created from the 65 billion parameter open source LLaMA model created by Meta.
LLaMA, which stands for Large Language Model Meta AI, is a language model that is trained on publicly available data.
According to an article by Weights & Biases, LLaMA can outperform many of the top language models (OpenAI GPT-3, Gopher by Deep Mind and Chinchilla by DeepMind) despite being smaller.
Creating a Better Dataset
Another inspiration came from an academic research paper (SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions PDF) that outlined a way to create a high quality autogenerated question and answer training data that is better than the limited public data.
The Self-Instruct research paper explains:
“…we curate a set of expert-written instructions for novel tasks, and show through human evaluation that tuning GPT3 with SELF-INSTRUCT outperforms using existing public instruction datasets by a large margin, leaving only a 5% absolute gap behind InstructGPT…
…Applying our method to vanilla GPT3, we demonstrate a 33% absolute improvement over the original model on SUPERNATURALINSTRUCTIONS, on par with the performance of InstructGPT… which is trained with private user data and human annotations.”
The importance of Dolly is that it demonstrates that a useful large language model can be created with a smaller but high quality dataset.
Databricks observes:
“Dolly works by taking an existing open source 6 billion parameter model from EleutherAI and modifying it ever so slightly to elicit instruction following capabilities such as brainstorming and text generation not present in the original model, using data from Alpaca.
…We show that anyone can take a dated off-the-shelf open source large language model (LLM) and give it magical ChatGPT-like instruction following ability by training it in 30 minutes on one machine, using high-quality training data.
Surprisingly, instruction-following does not seem to require the latest or largest models: our model is only 6 billion parameters, compared to 175 billion for GPT-3.”
Databricks Open Source AI
Dolly is said to democratize AI. It’s a part of a gowning movement that was recently joined by the non-profit Mozilla organization with the founding of Mozilla.ai. Mozilla is the publisher of the Firefox browser and other open source software.
Read the full announcement by Databricks:
Hello Dolly: Democratizing the magic of ChatGPT with open models