Text Quality Rating Benchmark

LLM accuracy at rating text quality on a 1–6 scale across multiple languages · Documents sourced from FineWeb dataset

Methodology

The core objective of this benchmark is to evaluate how effectively Large Language Models can assess text quality, simulating the process of filtering data for LLM pre-training. The dataset curation followed a strict pipeline:

Initial Scoring: Multilingual texts sampled from the FineWeb dataset were evaluated by DeepSeek V3.2, which assigned them a quality and substantiveness rating on a scale from 1 (lowest quality) to 6 (highest quality).
Verification: These initial scores were subsequently verified by an independent judge, Gemini 3 Flash.
Filtering: To ensure the highest ground-truth reliability, only the documents that received the absolute highest approval rating during the Gemini verification phase were included in this benchmark.
Version: 1.0, *de excluded in this version

Weighted Score vs Exact Accuracy — all languages combined, sorted by Weighted Score

Number of unique texts per rating score (1–6) for each language — sourced from original files

Critical Confusion Rate

% of low-quality texts (rating 1–2) predicted as high-quality (5–6) and vice versa. These are the most dangerous misclassifications.

Confusion Matrix

Row = ground truth rating, column = predicted rating. Values show % of predictions within each true class.

Text Quality Rating Benchmark

Methodology

Prediction Bias

Critical Confusion Rate

Confusion Matrix