This note documents the current production scoring model as implemented in the score engine API (meta.version = legacy-gse-v1). The goal is methodological transparency: what v1 does well, what it does not do, and what evidence currently supports it.
Model Scope (v1)
v1 estimates text difficulty from text-only inputs. It combines:
- Flesch Reading Ease on preprocessed text.
- Vocabulary-based GSE estimation from lemma lookups in the internal vocabulary database.
- Fixed weighted blending into a single 10-90 GSE estimate, then CEFR band conversion.
What v1 explicitly does not include:
- Placement-score evidence from real learners.
- Task-demand modeling (difficulty is estimated from text only, not text plus task).
- Multi-source calibration corpora beyond the current validation set.
Data and Validation Snapshot
Current validation files in score_engine/algorithm_validation/test_texts contain 59 British Council texts (A1-A2-B1-B2-C1). The initial intent was approximately 60 texts; the current checked-in set is 59.
From the latest available validation report in repo (report_20241226_105848.txt):
- Exact CEFR accuracy: 32.20%
- Within one CEFR level: 71.19%
- Source coverage: single publisher (British Council), no explicit task annotations
These numbers are useful as a baseline, not as final performance claims for broader domains.
Pipeline Overview
- Preprocess input text with formatting heuristics (block/list/regular handling).
- Compute Flesch Reading Ease on preprocessed text.
- Lemmatize with spaCy (
en_core_web_sm) and lookup lemma GSE values in SQLite. - Compute mean vocabulary GSE.
- Convert both channels into bounded 10-90 indices.
- Blend channels with fixed weights.
- Convert combined GSE to CEFR label.
Core formulas currently configured:
gse_reading_index = clamp(-3.2 * flesch_score + 280, 10, 90)
word_gse_index = clamp( 5.0 * avg_word_gse - 90, 10, 90)
overall_gse = 0.1 * gse_reading_index + 0.9 * word_gse_index
Strengths and Weaknesses by Section
| Section | Strengths | Weaknesses |
|---|---|---|
| Input preprocessing | Handles structured text formats and normalizes sentence endings; reduces noise from raw copied content. | Heuristic rules can alter readability behavior in edge formats; no task/context awareness. |
| Readability channel (Flesch) | Fast, deterministic, easy to audit, stable baseline feature. | Designed for native-speaker readability; misses lexical and pedagogical nuance for L2 progression. |
| Vocabulary channel (spaCy + GSE lookup) | Uses lemma-based matching and an explicit lexical difficulty signal; dominates final score in v1 (0.9 weight). | Sensitive to lemma coverage and POS/lemmatization quality; unknown words and named entities can distort signal. |
| Score blending | Transparent fixed coefficients/weights; reproducible runs. | Fixed global weights are not personalized by learner profile, L1, genre, or task type. |
| CEFR conversion | Simple mapping from combined GSE to CEFR labels (+ subbands such as B1+/B2+). | Threshold mapping is hard-banded; can hide uncertainty near boundaries. |
| Validation setup | Versioned local validation scripts and reports; fast to rerun. | Single-source corpus (British Council), 59 texts, no task metadata, no placement-linked outcomes. |
| API packaging | Clear /v1/analyze/text endpoint with latency metadata and explicit engine version tag. |
No built-in confidence interval or rationale trace in response; version is labeled legacy but still baseline production logic. |
What v1 Is Good For
- Fast first-pass text difficulty screening.
- Consistent internal comparisons across similar content.
- Generating a transparent baseline for future model iterations.
Where v1 Can Mislead
- Task-heavy learning contexts where text-only signals understate real difficulty.
- Cross-domain or cross-publisher comparisons with style/format drift.
- High-stakes placement decisions without learner performance evidence.
Open Methodology Commitments
For each engine version, CEFR.AI should publish:
- Inputs and exclusions (what signals are used vs omitted).
- Fixed coefficients and thresholds.
- Validation set composition and known blind spots.
- A change log of what moved between versions.
This post is the v1 baseline reference for that process.
Implementation Pointers (Code Paths)
- Analyzer orchestration:
score_engine/text_analysis_engine/analyzers/gse_analyzer.py - Estimator formulas and blending:
score_engine/text_analysis_engine/utils/gse_estimator.py - Coefficients/weights:
score_engine/text_analysis_engine/utils/gse_config.py - CEFR mapping:
score_engine/text_analysis_engine/utils/cefr_converter.py - Preprocessing:
score_engine/text_analysis_engine/processors/text_preprocessor.py - Validation scripts:
score_engine/algorithm_validation/
Next Version Priorities
- Add placement-score evidence to calibration and evaluation.
- Add task-demand features so scoring reflects text plus task.
- Expand validation beyond one publisher and publish stratified performance by source/genre.
- Add uncertainty reporting so near-boundary predictions are treated cautiously.