Tutorials

From the Kremlin to Pandas

This session shows you how a historian uses data science to work with large Russian-language source collections. It focuses on the real challenges of Slavic inflectional languages (especially Russian) demonstrating how you can turn messy historical materials into clean datasets ready to be analyzed.

The following questions are addressed: Why Russian (and other Slavic languages) are tricky for quantitative text analysis (many word forms, spelling and style variation), How to make key preparation choices: normalisation, stemming vs lemmatisation, detecting duplicates and near-duplicates, and why these decisions change your results, How to treat official sources critically (provenance, context, and what “the data” really represents), How to go from PDF to dataset: extracting text from born-digital and scanned PDFs, cleaning Cyrillic text, adding metadata, and exporting to CSV / DataFrame.

The session demonstrates that data science in the humanities is not only about coding. Rather, it is about making methodological decisions with sources.

Dr. Bartłomiej Gajos is a historian of Russia who combines archival research with computer-based analysis of large datasets (data science). He specialises in politics of memory and in how the Kremlin uses history for political purposes. He is a Visiting Fellow at Harvard University under a scholarship from the Kościuszko Foundation. He earned his PhD at the Tadeusz Manteuffel Institute of History, Polish Academy of Sciences, with a dissertation on the Bolsheviks’ politics of memory (1917–1920). His academic achievements have been recognised with Poland’s Prime Minister’s Award, a START scholarship from the Foundation for Polish Science, and a Diamond Grant from the Ministry of Science and Higher Education. He currently works at the Juliusz Mieroszewski Centre in Warsaw. Together with Ernest Wyciszkiewicz, he co-hosts the YouTube programme Polihistor 2.0 devoted to Russia, which reaches around 250,000 views per month.