Worthwhile read from Hugging Face - on Dec 5th they released the Pleias 1.0 family of small language models, "the first ever models trained exclusively on open data ... the first fully EU AI Act compliant models" - - definitely worth a look.
On their training data: "We are moving away from the standard format of web archives. Instead, we use our new dataset composed of uncopyrighted and permissibly licensed data, Common Corpus. To create this dataset, we had to develop an extensive range of tools to collect, to generate, and to process pretraining."
On the models:
- "multilingual, offering strong support for multiple European languages
- safe, showing the lowest results on the toxicity benchmark
- performant for key tasks, such as knowledge retrieval
- able to run efficiently on consumer-grade hardware locally (CPU-only, without quantisation)"
Read the full press release here: https://huggingface.co/blog/Pclanglais/common-models
Warmly,
------------------------------
Heather Sardis
Associate Director for Technology and Strategic Planning
Massachusetts Institute of Technology
------------------------------