Current Model: Score Engine v1

This note documents the current production scoring model as implemented in the score engine API (meta.version = legacy-gse-v1). The goal is methodological transparency: what v1 does well, what it does not do, and what evidence currently supports it.

Model Scope (v1)

v1 estimates text difficulty from text-only inputs. It combines:

Flesch Reading Ease on preprocessed text.
Vocabulary-based GSE estimation from lemma lookups in the internal vocabulary database.
Fixed weighted blending into a single 10-90 GSE estimate, then CEFR band conversion.

What v1 explicitly does not include:

Placement-score evidence from real learners.
Task-demand modeling (difficulty is estimated from text only, not text plus task).
Multi-source calibration corpora beyond the current validation set.

Data and Validation Snapshot

Current validation files in score_engine/algorithm_validation/test_texts contain 59 British Council texts (A1-A2-B1-B2-C1). The initial intent was approximately 60 texts; the current checked-in set is 59.

From the latest available validation report in repo (report_20241226_105848.txt):

Exact CEFR accuracy: 32.20%
Within one CEFR level: 71.19%
Source coverage: single publisher (British Council), no explicit task annotations

These numbers are useful as a baseline, not as final performance claims for broader domains.

Pipeline Overview

Preprocess input text with formatting heuristics (block/list/regular handling).
Compute Flesch Reading Ease on preprocessed text.
Lemmatize with spaCy (en_core_web_sm) and lookup lemma GSE values in SQLite.
Compute mean vocabulary GSE.
Convert both channels into bounded 10-90 indices.
Blend channels with fixed weights.
Convert combined GSE to CEFR label.

Core formulas currently configured:

gse_reading_index = clamp(-3.2 * flesch_score + 280, 10, 90)
word_gse_index    = clamp( 5.0 * avg_word_gse -  90, 10, 90)
overall_gse       = 0.1 * gse_reading_index + 0.9 * word_gse_index

Strengths and Weaknesses by Section

Section	Strengths	Weaknesses
Input preprocessing	Handles structured text formats and normalizes sentence endings; reduces noise from raw copied content.	Heuristic rules can alter readability behavior in edge formats; no task/context awareness.
Readability channel (Flesch)	Fast, deterministic, easy to audit, stable baseline feature.	Designed for native-speaker readability; misses lexical and pedagogical nuance for L2 progression.
Vocabulary channel (spaCy + GSE lookup)	Uses lemma-based matching and an explicit lexical difficulty signal; dominates final score in v1 (0.9 weight).	Sensitive to lemma coverage and POS/lemmatization quality; unknown words and named entities can distort signal.
Score blending	Transparent fixed coefficients/weights; reproducible runs.	Fixed global weights are not personalized by learner profile, L1, genre, or task type.
CEFR conversion	Simple mapping from combined GSE to CEFR labels (+ subbands such as B1+/B2+).	Threshold mapping is hard-banded; can hide uncertainty near boundaries.
Validation setup	Versioned local validation scripts and reports; fast to rerun.	Single-source corpus (British Council), 59 texts, no task metadata, no placement-linked outcomes.
API packaging	Clear `/v1/analyze/text` endpoint with latency metadata and explicit engine version tag.	No built-in confidence interval or rationale trace in response; version is labeled legacy but still baseline production logic.

What v1 Is Good For

Fast first-pass text difficulty screening.
Consistent internal comparisons across similar content.
Generating a transparent baseline for future model iterations.

Where v1 Can Mislead

Task-heavy learning contexts where text-only signals understate real difficulty.
Cross-domain or cross-publisher comparisons with style/format drift.
High-stakes placement decisions without learner performance evidence.

Open Methodology Commitments

For each engine version, CEFR.AI should publish:

Inputs and exclusions (what signals are used vs omitted).
Fixed coefficients and thresholds.
Validation set composition and known blind spots.
A change log of what moved between versions.

This post is the v1 baseline reference for that process.

Implementation Pointers (Code Paths)

Analyzer orchestration: score_engine/text_analysis_engine/analyzers/gse_analyzer.py
Estimator formulas and blending: score_engine/text_analysis_engine/utils/gse_estimator.py
Coefficients/weights: score_engine/text_analysis_engine/utils/gse_config.py
CEFR mapping: score_engine/text_analysis_engine/utils/cefr_converter.py
Preprocessing: score_engine/text_analysis_engine/processors/text_preprocessor.py
Validation scripts: score_engine/algorithm_validation/

Next Version Priorities

Add placement-score evidence to calibration and evaluation.
Add task-demand features so scoring reflects text plus task.
Expand validation beyond one publisher and publish stratified performance by source/genre.
Add uncertainty reporting so near-boundary predictions are treated cautiously.

Research