Chinese Tiny is the first fully open source language model centered on Chinese

Pre-training and fine-tuning are mainly carried out on Chinese corpus.

They also released the supporting MAP-CC Chinese dataset and CHC-Bench Chinese model test benchmark.

Detailed introduction:

CT-LLM was built from scratch. Unlike traditional methods, it mainly contains Chinese text data and utilizes a huge corpus with a total amount of 120 billion Tokens, of which 80 billion are Chinese Tokens, 30 billion are English Tokens, and 10 billion are code Tokens.
This unique structure allows CT-LLM to perform well in understanding and processing Chinese, which is further enhanced through alignment technology.
CT-LLM has demonstrated excellent performance on the Chinese Difficult Case Benchmark (CHC-Bench), shining brightly in Chinese tasks, and also demonstrated its capabilities through SFT (Sentence Functional Testing) in English tasks.
This research challenges the traditional English-based large-language model training method and opens up new horizons for large-language model training methods.
We have disclosed the complete training process of the Chinese large language model, including detailed data processing steps using Massive Appropriate Pretraining Chinese Corpus (MAP-CC, a large and suitable pre-trained Chinese corpus), the Selected Chinese Difficult Case Benchmark (CHC-Bench), and the 2B-scale Chinese Tiny LLM (CT-LLM).

Project address:https://chinese-tiny-llm.github.io/
Data and model downloads:https://huggingface.co/collections/m-a-p/chinese-tiny-llm-660d0133dff6856f94ce0fc6

Video: