Research

Open notes on text + task difficulty modeling

Current Model: Score Engine v1

Published: 2026-03-17 | Authors: CEFR.AI | Method Note

This note documents the current production scoring model as implemented in the score engine API (meta.version = legacy-gse-v1). The goal is methodological transparency: what v1 does well, what it does not do, and what evidence currently supports it.

Model Scope (v1)

v1 estimates text difficulty from text-only inputs. It combines:

  • Flesch Reading Ease on preprocessed text.
  • Vocabulary-based GSE estimation from lemma lookups in the internal vocabulary database.
  • Fixed weighted blending into a single 10-90 GSE estimate, then CEFR band conversion.

What v1 explicitly does not include:

  • Placement-score evidence from real learners.
  • Task-demand modeling (difficulty is estimated from text only, not text plus task).
  • Multi-source calibration corpora beyond the current validation set.

Data and Validation Snapshot

Current validation files in score_engine/algorithm_validation/test_texts contain 59 British Council texts (A1-A2-B1-B2-C1). The initial intent was approximately 60 texts; the current checked-in set is 59.

From the latest available validation report in repo (report_20241226_105848.txt):

  • Exact CEFR accuracy: 32.20%
  • Within one CEFR level: 71.19%
  • Source coverage: single publisher (British Council), no explicit task annotations

These numbers are useful as a baseline, not as final performance claims for broader domains.

Pipeline Overview

  1. Preprocess input text with formatting heuristics (block/list/regular handling).
  2. Compute Flesch Reading Ease on preprocessed text.
  3. Lemmatize with spaCy (en_core_web_sm) and lookup lemma GSE values in SQLite.
  4. Compute mean vocabulary GSE.
  5. Convert both channels into bounded 10-90 indices.
  6. Blend channels with fixed weights.
  7. Convert combined GSE to CEFR label.

Core formulas currently configured:

gse_reading_index = clamp(-3.2 * flesch_score + 280, 10, 90)
word_gse_index    = clamp( 5.0 * avg_word_gse -  90, 10, 90)
overall_gse       = 0.1 * gse_reading_index + 0.9 * word_gse_index

Strengths and Weaknesses by Section

Section Strengths Weaknesses
Input preprocessing Handles structured text formats and normalizes sentence endings; reduces noise from raw copied content. Heuristic rules can alter readability behavior in edge formats; no task/context awareness.
Readability channel (Flesch) Fast, deterministic, easy to audit, stable baseline feature. Designed for native-speaker readability; misses lexical and pedagogical nuance for L2 progression.
Vocabulary channel (spaCy + GSE lookup) Uses lemma-based matching and an explicit lexical difficulty signal; dominates final score in v1 (0.9 weight). Sensitive to lemma coverage and POS/lemmatization quality; unknown words and named entities can distort signal.
Score blending Transparent fixed coefficients/weights; reproducible runs. Fixed global weights are not personalized by learner profile, L1, genre, or task type.
CEFR conversion Simple mapping from combined GSE to CEFR labels (+ subbands such as B1+/B2+). Threshold mapping is hard-banded; can hide uncertainty near boundaries.
Validation setup Versioned local validation scripts and reports; fast to rerun. Single-source corpus (British Council), 59 texts, no task metadata, no placement-linked outcomes.
API packaging Clear /v1/analyze/text endpoint with latency metadata and explicit engine version tag. No built-in confidence interval or rationale trace in response; version is labeled legacy but still baseline production logic.

What v1 Is Good For

  • Fast first-pass text difficulty screening.
  • Consistent internal comparisons across similar content.
  • Generating a transparent baseline for future model iterations.

Where v1 Can Mislead

  • Task-heavy learning contexts where text-only signals understate real difficulty.
  • Cross-domain or cross-publisher comparisons with style/format drift.
  • High-stakes placement decisions without learner performance evidence.

Open Methodology Commitments

For each engine version, CEFR.AI should publish:

  • Inputs and exclusions (what signals are used vs omitted).
  • Fixed coefficients and thresholds.
  • Validation set composition and known blind spots.
  • A change log of what moved between versions.

This post is the v1 baseline reference for that process.

Implementation Pointers (Code Paths)

  • Analyzer orchestration: score_engine/text_analysis_engine/analyzers/gse_analyzer.py
  • Estimator formulas and blending: score_engine/text_analysis_engine/utils/gse_estimator.py
  • Coefficients/weights: score_engine/text_analysis_engine/utils/gse_config.py
  • CEFR mapping: score_engine/text_analysis_engine/utils/cefr_converter.py
  • Preprocessing: score_engine/text_analysis_engine/processors/text_preprocessor.py
  • Validation scripts: score_engine/algorithm_validation/

Next Version Priorities

  1. Add placement-score evidence to calibration and evaluation.
  2. Add task-demand features so scoring reflects text plus task.
  3. Expand validation beyond one publisher and publish stratified performance by source/genre.
  4. Add uncertainty reporting so near-boundary predictions are treated cautiously.

To reference this article: CEFR.AI (2026-03-17). Current Model: Score Engine v1. CEFR.AI Research.


All Notes