It’s been a long time coming, but we finally have some promising LLMs to try out which are trained entirely on openly licensed text!EleutherAI released the Pile four and a half years ago: “an 800GB dataset of diverse text for language modeling”. It’s been used as the basis for many LLMs since then, but much of the data in it came from Common Crawl—a crawl of the public web which mostly ignored the licenses of the data it was collecting.
While we’ve long been largely positive about the impact of LLMs hereat Conffab, we’ve also not ignored some significant challenges with the technology–including the high problematic intellectual property aspects of how these models have open ben trained.
So it’s encouraging to see significant models now traded on openly licensed data as Simon Willison details here for The Common Pile v0.1
EleutherAI’s successor to the original Pile, in collaboration with a large group of other organizations with whom they have been “meticulously curating a 8 TB corpus of openly licensed and public domain text for training large language models”.