On why we still can't (really) understand Latin and Ancient Greek word meanings

Mon, 13 Apr 2026 00:00:00 +0000

When thinking about Latin and Ancient Greek, two of most spoken, studied, and influential historical languages ever, it would be an easy mistake to believe that their meaning has been thoroughly understood. In reality, this couldn’t be further from the truth: as with many “dead” languages, we are missing extensive, quantitative studies on word senses, and it is due to a structural and multi-faceted problem.

Semantics is the field of linguistics that studies the meanings of words (cit.). As such, it focuses on the actual concepts we encode and reproduce with our languages, and has multidisciplinary ties that go from cognitive sciences to Artificial Intelligence. Why would we care about the word senses of languages, especially dead ones?

Citing a renowned example (Farina et al., 2026), the Latin word civitas can mean both ‘city’, as a geographical place, and ‘citizenship’, as a political community. The first meaning was derived over time from the second one, following a process that linguists would call semasiological - a single word acquires new concepts. This certainly didn’t happen randomly: such a shift, from a body of citizens to the space they inhabit, is indicative of how the Romans gradually collapsed the linguistic distinction between their political identity and their physical settlements. The problem is that, at the moment, we have no way to trace these shifts systematically, because no large-scale resource exists that records what Latin or Ancient Greek words actually mean in their context of use.

This short piece is about that gap, and about why it constitutes a structural bottleneck for an entire field of research.

An overview on Latin and Ancient Greek annotated resources

Over the past decades, the computational study of Latin and Ancient Greek has made remarkable progress: we possess incredibly accurate morphological analyzers and large syntactic treebanks, databases that organize sentence structures into formal dependency trees. Even from an infrastructural point of view, the LiLa Knowledge Base, for Latin, connects corpora, dictionaries, and lexical resources through over 80 million data points in a Linked Open Data architecture (Passarotti et al., 2020). Thus, for morphology and syntax, Classical languages are increasingly well served.

Semantics is a lot more complicated to work with, unfortunately. In this subfield, one of the most important tasks for creating annotations is Word Sense Disambiguation (WSD; Navigli et al., 2009), that consists of deciding which meaning a word carries in a specific sentence. In English, for instance, the word “bank” means one thing in “she sat on the river bank” and another in “she deposited money at the bank.” For native human speakers this is usually easy to resolve, but not always: words can have very large inventories of literal and metaphorical senses. Think of the English ‘see’: apart from the act of perceiving with the eyes (I saw a bird on a tree), it can also indicate mental perception (I can’t see your point); or examination (I need to see your passport); or even dating someone: I’m seeing a colleague). For a computer to do it, it needs two things: a sense inventory (a structured list of possible meanings) and annotated data (examples where each word has been manually tagged with its correct meaning by a human expert). For high-resource languages, such as English, both exist in abundance. WordNet, for instance, is an established lexical database that catalogues over 150,000 lemmas organized by synsets, groups of words sharing the same meaning (Fellbaum, 1998); we also have SemCor, an annotated collection of over 200,000 tokens tagged with their correct WordNet sense, serving as a training resource and benchmark for WSD research since the 1990s (Miller et al., 1993). Individual WordNets exist for many languages, but they are often imperfect resources maintained by individual research teams across the world. The Latin WordNet, originally created by automatic translation from English and Italian, contained numerous spurious entries during across its first iterations: the word ager (‘field’) was at one point assigned the sense of a database field, and capitolium (’the Capitol’) was linked to a sense referring to the U.S. federal government (Franzini et al., 2019). The resource has been substantially improved under the LiLa project, but gaps in coverage remain, and its senses are mapped to English rather than defined by the semantics of Latin itself. While this approach makes the resource more interoperable, enabling useful cross-linguistic enquiries, it rests on the fallacious assumption that Latin concepts map neatly into English ones. In this sense, BabelNet and ConceptNet both represent an alternative approach: they are natively multilingual databases that link multiple lexical resources, such as WordNet itself and Wiktionary, building shared sense inventories across languages. Their coverage of Latin, however, is a lot more limited than that of the Latin WordNet. For Ancient Greek, a WordNet is under construction but far from usable (Marchesi et al., 2025), while BabelNet and ConceptNet do not cover it as a language: as such, Ancient Greek is a clearly more critical situation than Latin. As for annotated data, the closest thing Latin has to SemCor is the dataset from SemEval-2020 Task 1, which provides semantic relatedness judgments for just 40 lemmas (only four of them verbs) across a limited number of sentences (Schlechtweg et al., 2020). For Ancient Greek, nothing comparable exists at all. Other resources are limited in their linguistic domain: an example is the PREMOVE dataset, which stores roughly 2,800 sense-annotated tokens across both languages, but exclusively for preverbed motion verbs (Farina, 2026).

Why should we care about word meanings?

The problem goes beyond linguistics and touches history and literature.

Latin and Ancient Greek are unique languages in that they are attested over extraordinarily long timespans (cit). Greek literature began in the 8th century BCE with the Homeric poems and Latin literary production started in the 3rd century BCE. Both languages continued to produce texts well into the Middle Ages and beyond, with Latin serving as the main language of the European scientific community up to the late 18th century, while remaining structurally distinct throughout. Even today, papal encyclicals keep the language alive by introducing new terms for the contemporary age (Iurescia et al., 2025). The texts of both languages preserve, in the evolving distribution of word meanings, evidence of how foundational concepts were formed and reshaped across more than a millennium of intellectual history. Justice, nature, community, or the sacred are all concepts encoded dynamically and uniquely: the Latin virtus, for instance, originally denoted valor and courage, derived from vir, ‘man’. Through centuries of philosophical and then Christian reinterpretation, it absorbed ethical connotations that eventually influenced the formation of the English word ‘virtue’. Something similar happened to the word fides, shifting from ’trust’ to ‘faith’ after the advent of Christianity. Again, in Ancient Greek the verb horao means ’to see’, but its aorist form oida, denoting a completed action, ended up meaning ’to know’, through the logical inference ‘I saw, therefore I know’. These are all gradual transformations that show the mental processes of ancient civilizations, as well as offering us knowledge into the origins of many modern languages; recoverable only if we can track sense frequencies across large bodies of text.

And why does this need to be done systematically and in large numbers?

Historical linguistics has been undergoing what Jenset and McGillivray (2017) described as a “quantitative turn”, trying to move from subjective claims about frequency (“this form becomes more common over time”) toward statistically grounded analysis on large corpora. This turn has already largely transformed morphology and syntax. But for semantics, where language connects most directly to thought and culture, progress is often blocked by the absence of the already-discusaed foundational data layer. Without sense-annotated corpora, we cannot train WSD models, which could be used to annotate large corpora at scale. Without these, quantitative research on meaning remains confined to whatever small, domain-specific datasets individual projects manage to build.

The structural roots of the problem

At this point, one could believe that the bottleneck consists simply in the fact that no one has gotten around to annotating data, but, unfortunately, several structural issues complicate the task.

Firstly, there is no shared sense inventory. Even for high-resource languages, word meanings are often a source of disagreement between annotators. WordNet, for instance, has often been found to have too many competing meanings that do not actually reflect the experiences of human speakers, a problem identified as sense granularity (Navigli et al., 2009). This issue can only be exacerbated for historical languages for which no native speakers exist anymore: different dictionaries group senses differently; so the Oxford Latin Dictionary, for instance, may treat as one sense what the Thesaurus Linguae Latinae splits into three. Every project that annotates meaning currently has to make its own choices about what senses exist, producing resources that cannot easily be combined and compared.

As a natural consequence, annotation frameworks are fragmented. For instance, SemEval-2020 used graded relatedness judgments between pairs of word usages, while PREMOVE assigned English WordNet synsets. Recent work on geographical nouns in Latin and Greek (Farina et al., 2026) used an LLM to select among candidate synsets. Each approach made sense in its own context, but the results are not straightforwardly compatible. Relevant data ends up being scattered across projects annotated on different principles, each one having started in some important sense from scratch.

Furthermore, the human expertise required for the work is rare. Annotating meaning in historical languages requires deep competence in the ancient language, an understanding of diachronic and genre-based variation, and often familiarity with computational annotation protocols. Classicists are not typically trained in Natural Language Processing; at the same time, computational linguists are rarely knowledgeable on Latin or Greek at the required level. The few who can do both are stretched thin across their own research commitments, and the work itself, slow and often painstaking, does not attract enough new people.

These problems end up reinforcing each other, as without shared standards, individual efforts remain isolated; without datasets, models cannot learn to help; and without accurate models, annotation stays slow. The result is a field where the quantitative study of meaning, the layer of language that matters most for intellectual and cultural history, is unable to progress.

Why now?

In the last few years, however, things seem to start to go in the right direction. For Latin, the completion of the LiLa infrastructure (Passarotti et al., 2020) has provided a very solid backbone that semantic resources may benefit from: any new annotations built on standard formats can be immediately integrated into an existing ecosystem of corpora, dictionaries, lexical databases, rather than standing alone. Large Language Models have also reached a point where they can meaningfully assist human annotators. In a recent study I co-authored, we evaluated 13 LLMs on the task of interpreting preverb semantics in Latin and Ancient Greek. The best models, when fine-tuned on human-made data, achieved F1 scores as high as 0.84 under few-shot prompting and appropriate parameters (Farina & Ciletti, 2026). Another study obtained similar results on geographical noun senses (Farina et al., 2026). These numbers are definitely not high enough for autonomous annotation, and it has been noticed that models still struggle with figurative uses and rare senses across languages (Navigli, 2026). But they are high enough for a different workflow: pre-annotation with a model, correction with a human expert, iteration.

A hypothesis worth testing

My hypothesis is that this bottleneck can be addressed by building a shared, open semantic annotation framework for Latin and Ancient Greek, designed from the start for interoperability and scalability through human and Machine Learning components, and maintained through community efforts. Rather than starting from scratch, which would end up creating one more standard for researchers to choose from, the best approach would be to start from WordNet. Synsets, while imperfect, have been widely accepted by linguists as a de facto standard for annotation, but, most of all, they are interoperable: different databases, such as the aforementioned BabelNet and ConceptNet, integrate WordNet data in their structures. At the same time, it would be unwise to simply accept WordNet’s weaknesses uncritically in the name of convenience, but I believe the two main ones - sense granularity and overreliance on English - could be relatively easily tackled.

Sense granularity could be addressed by not limiting the annotation to a single gold synset, but applying a tiered approach that accepts multiple, related synsets at different levels. WordNet lends itself well to such an approach by being structured as a massive knowledge graph: a hypernym of a gold synset, that is, a broader concept that encompasses it, could be accepted as a slightly less specific, but still correct annotation. Tiering the candidates remains important, however, since subtle semantic differences are still fundamental to fully grasp the senses of a word: the verbto behold may be assigned the sense of look at with admiration (synset 02169125-v) or see with attention (synset 02134625-v), but it is evident that a preference may emerge based on the subject being observed (a statue? A god?).
Overreliance on English, I argue, would be less widespread as a problem. The Latin WordNet project has demonstrated that a Latin-English concept mapping is feasible and works relatively well. Gaps, however, remain: the word consul, which indicates the highest public office in Republican Rome, is simply assigned the meaning of a worker who holds or is invested with an office. Thus, precisely the most culturally relevant and unique concepts are approximated by forcing them through an English-based inventory. The fix would be straightforward: a set of specific guidelines would identify cases where a concept is not represented satisfyingly enough by any synset, and a new one would be created.

Concretely, the first step would be to construct a curated sense inventory and gather in-context tokens for a core set of high-frequency polysemous lemmas, starting from the WordNet and reconciling entries across existing dictionaries and other lexicons. Together with those, a set of sentences that is randomized, yet balanced across time periods and genres, would also be annotated fully, capturing a broader set of senses and lemmas - a sort of “control group” that also mirrors the structure of established resources such as SemCor. The second step would be a pilot annotation of these lemmas across a diachronically and generically balanced corpus, following shared guidelines, producing a gold-standard dataset. Fortunately, semantic annotation guidelines for Classical languages already exist (Farina, 2024); adapting them would be feasible and incredibly important: as such an effort would require a community behind it, a clear framework would ensure consistency between annotators. It has also been empirically shown that Large Language Models are more accurate at Word-Sense Disambiguation when induced to reason through defined linguistic concepts - just like a human would do (cit). The third step would be to test systematically whether fine-tuned models, trained on this gold standard, can bootstrap larger “silver-standard” annotation of acceptable quality. The fourth and final step would involve testing whether both the gold and silver data are able to show attested and verified semantic patterns quantitatively: as we exemplified in the introduction, existing scholarship has traced a shift in the meaning of civitas across time - does our annotation process reflect and support this phenomenon?

The experiment is informative regardless of its outcome, and it is fundamentally capable of failing: if fine-tuned models achieve acceptable accuracy on the expanded annotation, a relatively clear path to scaling would appear. If they fail, particularly on the rare, figurative senses that matter most for studying semantic change, that result is equally valuable: it would tell us exactly where human expertise remains irreplaceable and where future investment should be directed. Similarly, if the annotation procedure works efficiently and produces accurate data that confirms existing theories, the workflow would be ready for scaling. If it contradicts them or is inconclusive, it could prompt both a reflection on the theories themselves, and a refinement of the framework. Either way, the shared sense inventory and annotation guidelines would remain as lasting resources for the community.

Why this needs a dedicated effort

Securing funding would be complicated for the slow, iterative construction of a shared infrastructure for languages without commercial applications. What is needed is something closer to a dedicated, community-driven project that can create connections between classicists and computational linguists and maintain shared, open resources.

Hopefully, if such a community gets started, it could spark new interest and resources into the field. Several scholars work independently on this topic (the success of a newly-founded conference on Semantic Annotation in the Ancient World supports this claim; cit) - giving them an organized space to work together could lead to the creation of new training resources and spin-off case studies that would enrich the suite of standardised assets.

What I propose, summing up, is a structural effort that goes beyond the creation of any single dataset or model, focusing instead on a community-led work on new frameworks to finally, fully support the systematic semantic annotation of Classical texts. It would be a grave error to ignore the words of two languages that shaped law, philosophy, science, and religion for two thousand years. I believe that how their meanings evolved is one of the best records we have of how human thought changes, finally visible and traceable in the distribution of senses across centuries of text. It is time, then, to work on the missing shared effort on which such valuable research must stand.

Ancient Greek | Michele Ciletti - Personal Website