Character-Level Language Modeling with Parallel Corpora: A Computational Linguistics Project - 05/2024

Sep 23

Project Overview

This project investigated the construction and evaluation of character-level statistical language models (LMs) trained on multilingual data. Using a parallel corpus derived from EuroParl—a multilingual collection of European Parliament proceedings—I developed and tested several English language models and evaluated their ability to generalize to Dutch and Italian. The models were trained using both full English sentences and isolated English word types, enabling a comparison between frequency-rich and type-based training. The project provided hands-on experience in data preprocessing, n-gram modeling, perplexity-based evaluation, and cross-linguistic analysis, offering insight into the strengths and limitations of character-level language modeling.

Objectives

The primary aim of the project was to explore how different training inputs and modeling choices affect a language model’s ability to generalize within and across languages. Specifically, I sought to train bigram and tetragram character-level models on both English sentences and word types, assess their performance using perplexity on a range of test sets, and identify model behaviors when applied to related and unrelated languages. To do this, I preprocessed a parallel corpus of English, Dutch, and Italian texts, constructed four language models with shared vocabulary and smoothing, and analyzed the resulting scores to understand which models generalized best and why. The project concluded with a series of analytical questions that explored the influence of training context and linguistic similarity on model outcomes.

Dataset and Preprocessing

The dataset consisted of two JSON-formatted parallel corpora for training and testing, each containing aligned sentences in English, Dutch, and Italian. The preprocessing pipeline applied uniformly across all three languages involved lowercasing all characters, replacing digits with a generic 'D' symbol, and removing non-alphabetic characters while preserving accented letters and whitespace. Additional steps were tailored for character-level modeling: whitespace was removed from full-sentence inputs to support uninterrupted character sequences, and rare characters—defined as those occurring fewer than 20 times in the English training set—were replaced with a "?" placeholder to encourage generalization and handle sparsity.

This normalization strategy ensured that each model operated on a consistent vocabulary, avoiding overfitting to language-specific orthographic noise and allowing a fair comparison across training contexts.

Model Construction

The language models were implemented as n-gram character-level statistical models using add-k smoothing with k = 0.01 to manage unseen events in evaluation. Four models were created: bigram and tetragram models trained on either English sentences or word types. All models used the same vocabulary derived from the preprocessed English training data and incorporated explicit beginning-of-sequence (BoS) and end-of-sequence (EoS) markers to capture boundary information. This allowed for direct comparisons of model performance across different input types and context lengths. Each model was serialized for reuse and structured to support efficient lookup, probability computation, and perplexity evaluation.

Evaluation and Analysis

Perplexity was used as the primary metric for evaluating model performance. Each model was tested on five datasets: English training sentences, English word types, English test sentences, and test sentences in Dutch and Italian. This provided a comprehensive view of both in-language and cross-linguistic generalization. All results were recorded in a structured CSV file, and each model was stored as a serialized .pkl object for reproducibility.

To better understand how the models handled orthographic variation across languages, a secondary evaluation focused on Dutch and Italian word types. Only words with at least five characters and sufficient frequency in the test set were considered. The analysis identified words with the lowest and highest perplexity values, providing insight into how familiar or foreign these words appeared from the model’s perspective. Words with low perplexity typically contained common English character sequences, while high-perplexity words often included accented letters or unusual patterns not encountered in training.

Key Findings

Several patterns emerged from the analysis. Models trained on full sentences consistently outperformed those trained on word types. The sentence-based models benefited from richer co-occurrence patterns and frequency information, allowing them to make more confident predictions. Tetragram models also showed a clear advantage over bigram models, especially on English test data, where the longer context window enabled more precise character prediction and reduced uncertainty.

Interestingly, the training data format—sentence vs. word-type—had a greater influence on cross-linguistic performance than the size of the n-gram context. Sentence-based models trained on English generalized more effectively to Dutch and Italian than word-type-based models, likely due to structural patterns in sentence data that transfer more easily between related languages.

The word-level analysis confirmed these trends. Low-perplexity words across all languages tended to feature familiar English morphemes such as “re-” or “-tion,” while high-perplexity words included orthographic features foreign to English, such as rare accented characters or unexpected consonant combinations. These findings revealed the model’s implicit sensitivity to typological similarities and limitations in handling morphophonological diversity.

Beliz Pekkan