LLM accuracy at rating text quality on a 1–6 scale across multiple languages
· Documents sourced from FineWeb dataset
Methodology
The core objective of this benchmark is to evaluate how effectively Large Language Models can assess text quality, simulating the process of filtering data for LLM pre-training. The dataset curation followed a strict pipeline:
Initial Scoring: Multilingual texts sampled from the FineWeb dataset were evaluated by DeepSeek V3.2, which assigned them a quality and substantiveness rating on a scale from 1 (lowest quality) to 6 (highest quality).
Verification: These initial scores were subsequently verified by an independent judge, Gemini 3 Flash.
Filtering: To ensure the highest ground-truth reliability, only the documents that received the absolute highest approval rating during the Gemini verification phase were included in this benchmark.
Version: 1.0, *de excluded in this version
Exact match = 1.0 ptOff by ±1 = 0.5 ptOff by ≥2 = 0.0 pt
Filter by language
Global Model Comparison
Weighted Score vs Exact Accuracy — all languages combined, sorted by Weighted Score
Dataset Distribution
Number of unique texts per rating score (1–6) for each language — sourced from original files
Model Error Analysis
Bias, critical misclassifications and confusion patterns