GDELT Translingual: Translating the Planet

With the debut of GDELT 2.0 we are incredibly excited and proud to announce the public debut of a system we believe will fundamentally reshape how we understand the world around us: GDELT Translingual. Put simply, GDELT Translingual represents what we believe is the largest realtime streaming news machine translation deployment in the world: all global news that GDELT monitors in 65 languages, representing 98.4% of its daily non-English monitoring volume, is translated in realtime into English for processing through the entire GDELT Event and GKG/GCAM pipelines.

With the advent of GDELT Translingual, GDELT is able to take the next steps in its mission to provide a platform for computing on the entire world. Nearly all current global news monitoring platforms focus primarily or exclusively on Western English-language sources, missing the vast majority of world events, while the few platforms that incorporate any form of translation do so sparingly, either through the use of a small team of human translators, through on-demand translation of a small number of articles as they are read by human analysts, or through simplistic keyword tagging of a small number of subjects. In contrast, GDELT Translingual is designed to allow GDELT to monitor the entire planet at full volume, performing full translation of every single news report it monitors in realtime in 65 languages as it sees that article and processing it just the same as it would a native English report, creating the very first glimpses of a world without language barriers.

We are not aware of any other monitoring system in the world that comes even close in its attempt to translate the entire planet in realtime, and creating a platform this massive has required enormous fundamental development over the past few months. In particular, GDELT Translingual has two primary requirements that could not be met by current translation systems: the ability to dynamically modulate the quality of translations to use less computational time during periods of high intake volume to ensure that it is still able to translation all material within the 15 minute window, and the ability to see not just a single “best” translation, but to access the entire universe of all possible translations of each sentence to find the translation that will yield the highest quality Event and GKG matches (more on both in a moment).

What you see in the output of GDELT Translingual 1.0 is the results of a truly massive infrastructure with an incredible number of moving parts. As with all machine translation systems, you are bound to find numerous errors in its results, especially in languages with many layers of overlapping decisions, such as ambiguous word segmentation scenarios in languages like Chinese. If you have suggestions for additional language resources, toolkits, or are willing to share Moses translation models or other translation resources, we’d love to hear from you! Remember that what you see here represents just the very first version 1.0 incarnation of GDELT Translingual and we’ve already got a list a mile long of additional ideas and features to constantly improve its accuracy and capabilities from here – so stay tuned, we’re just getting started!

We’ve only just begun to explore the potential of GDELT Translingual and the kinds of previously unimaginable applications that become possible when you can “live translate the planet.”

WHY CREATE A TRANSLATION SYSTEM?

Why create a full-fledged translation infrastructure capable of translating the planet in realtime instead of simply translating GDELT’s Event and GKG systems into the other languages of interest? The most obvious reason is GDELT’s scale: there are more than 120,000 entries in the Global Knowledge Graph, the majority of which are multiword phrases, while GCAM has more than 1.2 million entries, including a large number of lengthy phrases. Translating this much material into 65 languages would simply be intractable. Further, due to the inherent ambiguities of language, a large fraction of these words and phrases could not simply be directly translated – they would require substantial contextual information to permit disambiguation, vastly increasing the amount of material to be translated and the linguistic resources required. In addition, GDELT is constantly growing, with new themes, emotions, and features added every few weeks, which would make it nearly impossible to constantly translate every single new addition into 65 languages on an ongoing basis. Instead, GDELT attempts to blend the best of both worlds, combining support for native script entries in both the Global Knowledge Graph and GCAM (making it possible to filter for a very specific phrasing in a given language) and translation to allow a new feature to instantly function across all 65 languages.

Of course, most of you are probably wondering – why make your own translation system when there are so many systems out there already? The short answer is that, as noted above, GDELT has two very unique needs that it does not share with many other translation systems. Building an infrastructure that would allow it to live-translate the entirety of what it monitors in 65 languages in realtime, while coping with the enormous fluctuations in volume over the course of a given day, along with the ability to directly access the underlying universe of possible translations of each sentence to provide feedback proved difficult to achieve with current systems. Very few systems supported the raw internal access and ability to perform time constrained translations that GDELT needed. More critically, what you see here represents only the very beginning of GDELT Translingual – over the coming months we plan to massively expand its capabilities, using it as a base platform to break down the barrier of language when monitoring the planet and this gives us the unbounded flexibility to reshape how translation is used at scale to understand the world in realtime.

  • Time Constrained Translation. Most machine translation work today is focused on achieving maximal accuracy without regard for the time the translation takes, since the focus is on providing output for human consumption. In contrast, GDELT’s translation system must be able to provide at least basic translation of 100% of monitored material every 15 minutes, coping with sudden massive surges in volume without ever requiring more time than the 15 minute window. This “streaming” translation is very similar to streaming compression, in which the system must dynamically modulate the quality of its output to meet time constraints: during periods with relatively little content, maximal translation accuracy can be achieved, with accuracy linearly degraded as needed to cope with increases in volume in order to ensure that translation always finishes within the 15 minute window. In this way GDELT operates more similarly to an interpreter than a translator. This has not been a focal point of current machine translation research and required a highly iterative processing pipeline that breaks the translation process into quality stages and prioritizes the highest quality material, accepting that lower-quality material may have a lower-quality translation to stay within within the available time window.
  • User Adaptation. High quality human translation often involves an iterative process in which knowledge of the end user or application is used to carefully tailor the translation, adjusting it between higher linguistic versus higher content fidelity or to select specific language more amenable to a particular user or task. Machine translation systems, on the other hand, do not ordinarily have knowledge of the user or use case their translation is intended for and thus can only produce a single “best” translation that is a reasonable approximation of the source material for general use. In the case of GDELT, the results of machine translation are fed into further machine processing algorithms, offering the ability to create a unique feedback loop in which the translation system has highly detailed knowledge of the needs of its user in order to precisely tailor its translation. Using the equivalent of a dynamic language model, GDELT essentially iterates over all possible translations of a given sentence, weighting them both by traditional linguistic fidelity scores and by a secondary set of scores that evaluate how well each possible translation aligns with the specific language needed by GDELT’s Event and GKG systems. This is very different from domain adaptation in which a translation or language model is altered for a new topic space or genre of material: here the source material stays the same and the translation output is adjusted based on knowledge of the end user’s needs for the material.

 

SUPPORTED LANGUAGES

The following 65 languages are supported by GDELT Translingual (both in native script form and many common transliterations):

  • Afrikaans
  • Albanian
  • Arabic (MSA and many common dialects)
  • Armenian
  • Azerbaijani
  • Bengali
  • Bosnian
  • Bulgarian
  • Catalan
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Croatian
  • Czech
  • Danish
  • Dutch
  • Estonian
  • Finnish
  • French
  • Galician
  • Georgian
  • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Italian
  • Japanese
  • Kannada
  • Kazakh
  • Korean
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian (Bokmal)
  • Norwegian (Nynorsk)
  • Persian
  • Polish
  • Portuguese (Brazilian)
  • Portuguese (European)
  • Punjabi
  • Romanian
  • Russian
  • Serbian
  • Sinhalese
  • Slovak
  • Slovenian
  • Somali
  • Spanish
  • Swahili
  • Swedish
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Vietnamese

 

THE TRANSLATION PIPELINE

GDELT’s translation infrastructure proceeds in a tightly coupled pipeline, designed to maximize efficiency in order to provide streaming translation capable of sustained servicing of GDELT’s 15 minute update interval (including the ability to absorb massive unexpected surges of foreign language material during breaking news events):

  • Language Detection. The first stage of the GDELT Translingual pipeline is language detection, built around an enhanced version of the CLD2 engine in which it is paired with covering language models to significantly increase overall speed and accuracy on boundary cases at the expense of additional memory. Language detection is actually performed in two passes – the first pass only detects English-language material and operates at extremely high speed (performed inline during injection into the processing system using a covering language model) to identify English coverage to allow that material to proceed directly to EVENT/GKG processing, while all other material is queued for transshipping to the translation infrastructure.
  • Word Segmentation. Not all languages provide unambiguous word boundaries, requiring sophisticated algorithms, ranging from simple dictionary-based maximal matching to full-fledged CRF models, to segment input text into discrete “words.” Specialized algorithms are therefore used to perform word segmentation for Chinese, Japanese, Vietnamese, and Thai. While not all Korean material features word spacing, all Korean news presently monitored by GDELT does use modern conventions of word spacing and horizontal alignment, thus segmentation is used as-is. Preprocessing and any necessary conversion of right-to-left languages is also handled here, along with processing of specialized punctuation such as the Katakana Middle Dot (Japanese) and Interpunct/Partition Sign (Chinese) when used to subdivide transliterated foreign words.
  • Morphological Analysis. While sometimes grouped in the literature under the generic heading of “word segmentation,” morphological analysis is actually a distinct process that fundamentally modifies the underlying text to align it more closely with the linguistic structures of English. Here, an additional set of algorithms are applied for selected languages to perform various normalization and morphological analysis tasks ranging from normalizing conjugation and declension in inflected languages to conversion of the conjunctive form of the Arabic Waw. Other language-specific resources are also brought to bear in this stage to normalize and reprocess material to assist in the translation stage.
  • Native GCAM and GKG Processing. GCAM emotional dictionaries are currently available for 15 languages other than English, each requiring access to the original native script material. Since these dictionaries operate on the original document, rather than its English translation, a version of the GCAM engine tailored for non-English processing is run on the document after segmentation and morphological analysis (specific morphological normalization tasks can be enabled/disabled prior to GCAM processing depending on the needs and assumptions of a particular dictionary), but prior to the translation portions of the pipeline beginning. GKG themes that include native script entries are also handled in this stage.
  • Sentence and Clause Segmentation. Language-specific punctuation and segmentation models are used to separate text into component sentences and sub-sentence clauses.
  • Low Pass Translation. Finally, translation commences. The initial low pass translation uses only basic reordering, a basic language model, and a maximal-coverage low-candidacy translation model that offers translations for a large portion of the language (usually higher than 95% including proper names), but only the most common translations of each word (not a more exhaustive list of rarer usages). This allows this stage to achieve maximal speed and to provide actionable translation of languages for which minimal language resources are available. The output of this stage is strongly coherent, but grammatically lower quality, text that captures the overall gist of the article and allows for topical analysis and basic reasoning, as well as the application of bag-of-words engines such as translational tonal analysis.
  • Mid Pass Translation. Run in parallel to Low Pass Translation, Mid Pass Translation applies additional language model resources to increase the quality of the translation, depending on the availability of resources for a given language and the current time pressure on the system. Among other tasks, it applies successively higher-quality language models to correct basic grammatical errors, smooth narrative structure, and better align concept and structural ordering between source and target languages. It also performs selective high-order correction and rewriting for concepts of specific interest to the Event and Global Knowledge Graph pipelines to increase recovery. The number, size and quality of the additional resources applied during this stage may be adaptively adjusted based on time pressure – if there is a sudden large volume of material to process, lower-quality and smaller models may be used, while during a lull period, language models and other selected resources from the High Pass Translation stage may even be applied to all languages. Divergences between the Mid and Low Pass Translations result in replacement of the translated text with the Mid Pass output.
  • Toponymic Translation. A special engine is used alongside the Low and Mid Pass Translation engines to provide special processing of geographic candidates, offering high quality maximal coverage toponymic recovery, including considerable recovery of local language-specific name variants and normalization of inflected languages to perform full geocoding and geographic disambiguation. The output of this stage is also capitalized and passed to the High Pass Translation stage to override its native output using Moses’ “xml-input exclusive” feature. The ability to recognize worldwide mentions of local geographic features across 65 languages likely makes GDELT Translingual the largest multilingual geocoding initiative today.
  • Well Known Entity Translation. Similar to Toponymic Translation, a special engine is used alongside the Low and Mid Pass Translation engines to provide special processing of proper names of people, organizations, events, and other entities. At present this is based primarily on a compilation of entities having their own entries and script-to-English interlingual links in Wikipedia or which were manually added or contained in other selected datasets and will continue to expand over time. This stage is critical because many proper names have literal translations that differ from the actual name. For example, the name of the King of Saudi Arabia as commonly cited in Arabic-language news coverage is composed of a sequence of words that have their own literal meanings, and this engine overrides this and ensures a final normalized translation of “Abdullah bin Abdulaziz”. It also normalizes numerous common variants of a name across languages into a single standardized name form to assist in name disambiguation. As with toponyms, the output of this stage is passed on to the High Pass Translation stage as well.
  • Candidate Translation Ranking. At the conclusion of Low and Mid Pass Translation, each document is scored using a set of language models that determine its relevance to various Event and GKG categories. Each document is assigned two scores, one ranking its likelihood of containing one or more extractable Events (with higher scores based on the level of detail, likelihood of high-quality recovery, and number of candidate extractable Events) and the second ranking its likelihood of containing one or more extractable GKG entries (with higher scores based on the level of detail, likelihood of high-quality recovery, and number of candidate extractable GKG and GCAM matches). These are combined into a final score for each document, with the highest-scored documents having the greatest likelihood of being highly relevant to GDELT Event and GKG processing.
  • High Pass Translation. Finally, high accuracy translation proceeds for a small set of languages: at present GDELT has access to Moses models for Czech, Estonian, French, German, Russian, and Spanish. These are full-fledged SMT systems with extensive translation models sharing a single massive language model and can achieve accuracy approaching human quality on some material. However, the size and scale of these models means that they are extremely costly to use, requiring substantial fractions of a second up to several seconds or even ten seconds or more per clause to translate, depending on linguistic complexity and the size of the search space. A typical 15 minute GDELT update would require upwards of 500 hours of CPU time on a single CPU core to translate in its entirety using full-quality Moses models. GDELT therefore is designed to aggressively optimize its use of these high-quality resources and uses the translation ranking from the previous stage to translate only those documents estimated to have the highest likelihood of yielding the largest number and highest quality of extracted Events and GKG entries. It proceeds through the Candidate Translation Ranking starting with the highest-scored documents and proceeds to translate each document using Moses until the 15 minutes are up, at which point High Pass Translation halts and the final results up to that point are packaged and returned to the core GDELT infrastructure for processing, while the translation pipeline begins anew for the next 15 minutes of material. In this way additional computational resources can be added to linearly scale the quality of the translation infrastructure and the Candidate Translation Ranking scoring algorithm can be modified in the future to prioritize material about a specific evolving situation of interest (such as an infectious disease outbreak) or to deemphasize redundant coverage of a major world event to ensure that other events are not drowned out. The output of the Low and Mid Pass Translation engines are of sufficient quality for topical, emotional, toponymic and well known entity recovery, as well the identification and extraction of a large number of events, but may not achieve the same quality for certain kinds of topical and emotional discourse and for certain more complex events that require deep reading of the text. Depending on time remaining in a 15 minute interval and the score of a given document, documents in a language for which a Moses model is not currently available may still be processed using Moses in a kind of pass-through mode in which a null or semi-null translation model (English-to-English) is used, but the language model is the massive English language model shared by the other Moses models in order to improve its grammatical quality and perform wider-scoped reordering. Certain words with similar usage contexts or equivalent translations in some languages such as “suppress” and “repress” are included in the semi-null translation model, using the language model to provide the proper translation based on context even for languages for which a Moses translation model is not available.
  • User Adaptation (Dynamic Language Models). One of the greatest values of the nuanced human translation of the news offered by government agencies like BBCM and FBIS/OSC is that they take into consideration the specific use cases of their end users, adjusting their translations to preserve key elements of the meaning or linguistic cues of the original text as needed and can utilize iterative user feedback to refine the translation as needed. Machine translation, on the other hand, ordinarily has no external domain knowledge of its user or use case and must instead choose an overall “best” translation that is a reasonable balance of linguistic and content fidelity in a generalized use case. In the case of GDELT, however, machine translation is being used as input to other machine algorithms, instead of to a human, offering a unique opportunity to allow the translation system to communicate directly with its consumer to carefully tailor its translations. Unlike traditional “domain adaptation” in which translation and language models are altered to work on new source material, the ability of translator and consumer to work together offers the unique ability to perform User Adaptation, which approximates a dynamic language model in which translation output is dynamically reweighted to best meet the specific needs of a particular translation task. To achieve this, Moses is run in “N Best Distinct” mode in which it outputs the top several candidate translations for each sentence. These are scored in a process similar to that used by Candidate Translation Ranking, to select the candidate translation most aligned with the specific wording recognized by GDELT’s Event and GKG dictionaries. In essence, Moses is asked to output not the single “best” translation of each sentence, but instead to offer a range of options, each of which is assigned a secondary score assessing its overlap with the Event and GKG vocabularies and this score is then combined with the Moses score to rerank the candidates to select the one most aligned with GDELT’s needs. This reranking score can be dynamically adjusted on-the-fly to prioritize specific kinds of translations, essentially creating a dynamic language model. As an example of this at work, when presented with “Die israelische Luftwaffe hat Medienberichten zufolge Ziele in Syrien angegriffen”, the MosesCore German models offer “The Israeli Air Force has according to media reports, objectives in Syria attacked” as the “best” translation. Indeed, this is the most linguistically faithful to the original German, but would not result in a match from the GDELT Event system. The second and third best translations change the latter half to “aims in Syria attacked” and “attacked objectives in Syria”, respectively. Yet, the fourth best translation according to Moses is perhaps the most understandable: “The Israeli Air Force has been attacking targets in Syria according to media reports.” This version also results in a much higher score for its overlap with the language required of the Event extraction system and thus is selected as the final translation, even though Moses on its own would have selected the first version. This even extends to emotional language, with translations being weighted towards ones containing emotional words from the GCAM dictionaries if they are reasonable candidates. In essence, this User Adaptation process allows GDELT to restore the translator-consumer feedback missing from most machine translation pipelines.
  • High Pass UNK Substitution. In many cases the translation models used in the Low and Mid Pass stages have greater coverage than the Moses translation models due to their more varied and extensive training data. Thus, when Moses translation of a document is complete, it is reprocessed and all words marked as “UNK” (Unknown) are translated using the Low/Mid Pass translation model and replaced if known. This combines Moses’ superior grammatical output and more extensive disambiguation capacity with the Low and Mid Pass stages’ greater coverage.
  • Recapitalization Models. Not all languages have the concept of capitalization of proper names, while others utilize capitalization rules that differ from English (for example the German phrase “die israelische Luftwaffe” instead of “the Israeli Airforce”). Capitalization language models transcode translation output into English capitalization standards. In addition, when Moses is used to translate a document, the word-level alignment data is requested from Moses and used to reconstruct the mapping between words capitalized in the original text and those in the final output text, adjusted for languages in which capitalization differs from English.
  • Event/GKG Processing. Finally, the translated document and any Native GCAM/GKG scores are packaged and sent back to the core infrastructure for Event and GKG processing like normal.

 

DATA SOURCES

A translation system as massive as GDELT Translingual, spanning 65 languages, would not be possible without drawing from an enormous collection of linguistic resources and tools. In addition to the resources cited below, GDELT Translingual draws from a myriad people and resources that provided translations, grammatical rules, advice and recommendations on language-specific nuances, and a wealth of other information and assistance that has made this platform possible.

GNS

Geography is a key focal point of GDELT and to ensure maximal coverage of even the smallest local locations and name variants, all available multilingual Romanization data was extracted from the United States National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) database. All “S” script entries were crosswalked to their Romanized versions and all entries in which the Romanized and native script names differed were also extracted. GNS only records the native script name of each toponym in the primary official language of that location, meaning that only Egyptian Arabic name variants for Cairo are included in GNS, for example. In practice, the majority of larger cities throughout the world (that are likely to be mentioned in other languages) have entries on Wikipedia with interlingual links connecting them to their English names (discussed in more detail in the Wikipedia section below), allowing mentions of them to be recognized across languages. GNS in this case is used to ensure that mentions of very small local landmarks and cities that are not frequently mentioned outside of local media in the primary domestic language are recognized (since they likely do not have an entry Wikipedia and/or have interlingual links). A small remote hilltop in an uninhabited area of Egypt is likely only mentioned in Egyptian Arabic domestic press (and recorded in GNS), while a major city like Cairo has its own entry in Wikipedia with interlingual links connecting it to its name and name variants in many languages.

Google Translate

We would like to thank the Google Translate for Research program for its long-standing grant provided to the GDELT Project. This was instrumental in the early prototyping of GDELT’s translation pipeline in exploring the ability of the GDELT Event and GKG systems to cope with machine translated content and to verify and compare the output of the GDELT Translingual system during its development.

Language Detection

The Google Chrome Compact Language Detector 2 (CLD2) is used for all language identification tasks, supplemented by a set of covering language models to increase accuracy in boundary cases and to dramatically accelerate base separation of English/Non-English.

  • Sites, Dick. (2013). Compact Language Detection 2 (CLD2). https://code.google.com/p/cld2/

Moses

The Moses statistical machine translation system is used to provide very high quality translations for a subset of languages for which translation models have been graciously shared with GDELT by their respective creators. We would like to thank Philipp Koehn and the team behind Moses, both for making Moses itself available to the open community, and for the use of several of their translation and language models, including Czech, French, German, Spanish, and Russian:

  • Moses: Open Source Toolkit for Statistical Machine Translation, Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst, ACL 2007

We'd like to thank Mark Fishel and the Institute of Computer Science at the University of Tartu, Estonia for use of their Estonian translation model:

  • Fishel, Mark. (2012). “In-domain Data FTW”. Proceedings of the Fifth International Conference Human Language Technologies — The Baltic Perspective. Pp. 50-57. Available online at http://ebooks.iospress.nl/Download/Pdf/7484

Unicode Common Locale Data Repository (CLDR)

The Unicode CLDR “provides key building blocks for software to support the world's languages, with the largest and most extensive standard repository of locale data available.” CLDR data is incorporated directly and through enrichments integrated into the WordNet distributions compiled by:

  • Francis Bond and Ryan Foster (2013) Linking and extending an open multilingual wordnet. In 51st Annual Meeting of the Association for Computational Linguistics: ACL-2013. Sofia. 1352–1362. http://aclweb.org/anthology/P/P13/P13-1133.pdf

Wikipedia

Wikipedia represents one of the world’s largest interlingual multidisciplinary knowledge resources. Each of its more than 34.8 million entries spanning over 270 languages includes a set of “interlingual links” that connect it to the equivalent entry in other languages, offering a rudimentary form of translation. Entries for major topics like a capital city or head of state can include interlingual links to the corresponding entries in upwards of 100 other languages, while more specialized entries may have only a handful of links to other languages or none at all. While more nuanced than a true bilingual translation dictionary, these interlingual links nonetheless provide an extremely powerful translation capacity. Most critically, they cover the world’s people, organizations, locations, and major events, connecting their names (which are rarely found in traditional translation guides) across languages. Even more importantly, Wikipedia is constantly updated in near-realtime, meaning that new emerging names are rapidly added to Wikipedia, such that within hours or days of a new figure emerging into the news, there is not only a basic entry for that person on Wikipedia, there are interlingual links translating that person’s name into many other languages. In total, there are currently 29.5 million interlingual links to/from the English Wikipedia’s 4.6 million pages and the other 270 Wikipedia languages. There are 519,169 links to/from Farsi alone and 731,774 to/from Russian, while even smaller languages like Estonian feature 92,081 links to/from English.

However, Wikipedia entries tend to use an entity’s full formal name, which is not always the name used in news coverage. To address this, Wikipedia also includes “redirects,” which are commonly used variants of a name that redirect to the relevant page, allowing searches for “Barack Hussein Obama” and “President Obama” to both redirect to the entry for “Barack Obama.” Redirects are used across all language editions of Wikipedia, meaning that in Arabic, for example, the various shortened names used to refer to major political leaders are all recorded in these redirect fields, connecting the alternate spellings and forms of a name back to the full formal version of that name, whose interlingual links can, in turn, be used to connect each of those language-specific name variants back to the English translation of that name. In total there are 25.7 million redirect links in the top 127 language Wikipedias.

GDELT Translingual combines Wikipedia’s redirects and interlingual links together to build a massive translation dictionary containing over 55 million entries and updated in near-realtime. This dictionary has a particular emphasis on proper names of people, organizations, and named events (such as the “Orange Revolution”),

Wikitionary

Wikitionary (http://en.wikipedia.org/wiki/Wiktionary) is a sister project of Wikipedia, aiming to “create a free content dictionary of all words in all languages” and today covers 159 languages. Like Wikipedia, Wikitionary features interlingual connectivity, drawing together various translations of a word across languages. Given that Wikitionary includes many archaic and rarer alternative translations, some of which conflict with modern use, and thus several filtering passes are used to attempt to remove entries which conflict with the other linguistic data sources herein.

Wikitionary entries are incorporated through two processes. One involves directly compiling all interlingual links from Wikitionary through parsing it, while the second incorporates the Wikitionary enrichments integrated into the WordNet distributions compiled by:

  • Francis Bond and Ryan Foster (2013) Linking and extending an open multilingual wordnet. In 51st Annual Meeting of the Association for Computational Linguistics: ACL-2013. Sofia. 1352–1362. http://aclweb.org/anthology/P/P13/P13-1133.pdf

WordNet

The Following 22 WordNets are used (some encompass multiple languages – see http://compling.hss.ntu.edu.sg/omw/ for more details on each):

  • Francis Bond and Kyonghee Paik (2012) A survey of wordnets and their licenses In Proceedings of the 6th Global WordNet Conference (GWC 2012). Matsue. 64–71. http://web.mysites.ntu.edu.sg/fcbond/open/pubs/2012-gwc-wn-license.pdf
  • Albanet – Ervin Ruci (2008). On the current state of Albanet and related applications, Technical Report, University of Vlora. http://fjalnet.com/technicalreportalbanet.pdf
  • Arabic WordNet (AWN) – Black W., Elkateb S., Rodriguez H., Alkhalifa M., Vossen P., Pease A., Bertran M., Fellbaum C., (2006) The Arabic WordNet Project, Proceedings of LREC 2006
  • BulTreeBank Wordnet (BTB-WN) – Simov, Kiril and Osenova, Petya (2010) Constructing of an Ontology-based Lexicon for Bulgarian, Proceedings of LREC 2010. http://www.lrec-conf.org/proceedings/lrec2010/summaries/848.html
  • Chinese Open WordNet – Shan Wang and Francis Bond (2013) Building the Chinese Open Wordnet (COW): Starting from Core Synsets. In Proceedings of the 11th Workshop on Asian Language Resources, a Workshop of The 6th International Joint Conference on Natural Language Processing (IJCNLP-6). Nagoya, Japan. pp.10–18. https://aclweb.org/anthology/W/W13/W13-4302.pdf
  • Chinese Wordnet (Taiwan) – Huang, C.-R., Hsieh, S.-K., Hong, J.-F., Chen, Y.-Z., Su, I.-L., Chen, Y.-X., and Huang, S.-W. (2010). Chinese wordnet: Design and implementation of a cross-lingual knowledge processing infrastructure. In Journal of Chinese Information Processing. 24:2 pp 14–23.
  • DanNet – Pedersen, B. S., Nimb, S., Asmussen, J., Sørensen, N. H., Trap-Jensen, L. and Lorentzen, H. (2009) DanNet — the challenge of compiling a WordNet for Danish by reusing a monolingual dictionary Language Resources and Evaluation. Volume 43:3 pp. 269-299
  • Greek Wordnet – Open Knowledge Foundation Greece https://github.com/okfngr/wordnet
  • Princeton WordNet – Christiane Fellbaum. (ed.) (1998) WordNet: An Electronic Lexical Database, MIT Press
  • Persian WordNet – Montazery, Mortaza and Heshaam Faili (2010) Automatic Persian WordNet Construction the 23rd International conference on computational linguistics pp. 846–850
  • FinnWordNet – Lindén K., Carlson. L., (2010) FinnWordNet — WordNet påfinska via översättning,LexicoNordica — Nordic Journal of Lexicography, 17:119–140
  • WOLF (Wordnet Libre du Francais) – Benoit Sagot and Darla Fišer (2008) Building a free French wordnet from multilingual resources, E. L. R. A. (ELRA) (ed.), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco
  • Hebrew Wordnet – Noam Ordan and Shuly Wintner (2007) Hebrew WordNet: a test case of aligning lexical databases across languages. International Journal of Translation 19(1):39–58, 2007
  • MultiWordNet – Emanuele Pianta, Luisa Bentivogli and Christian Girardi. (2002) MultiWordNet: Developing an Aligned Multilingual Database. In Proceedings of the First International Conference on Global WordNet, Mysore, India, January 21-25, 2002, pp. 293-302.
  • Japanese Wordnet – Hitoshi Isahara, Francis Bond, Kiyotaka Uchimoto, Masao Utiyama and Kyoko Kanzaki (2008) Development of Japanese WordNet. In LREC-2008, Marrakech.
  • Multilingual Central Repository – Aitor Gonzalez-Agirre, Egoitz Laparra and German Rigau (2012) Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base. In Proceedings of the 6th Global WordNet Conference (GWC 2012) Matsue, Japan.
  • Wordnet Bahasa – Nurril Hirfana Mohamed Noor, Suerya Sapuan and Francis Bond (2011) Creating the open Wordnet Bahasa In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation (PACLIC 25) pages 258–267. Singapore. http://web.mysites.ntu.edu.sg/fcbond/open/pubs/2011-wn-bahasa.pdf
  • Norwegian Wordnet – Fjeld, Ruth Vatvedt and Nygaard, Lars (2009) NorNet – a monolingual wordnet of modern Norwegian In Proceedings of the NODALIDA 2009 workshop WordNets and other Lexical Semantic Resources — between Lexical Semantics, Lexicography, Terminology and Formal Ontologies. pages 13–16. Estonia http://dspace.utlib.ee/dspace/handle/10062/9837
  • plWordNet – Maciej Piasecki, Stanisław Szpakowicz and Bartosz Broda. (2009) A Wordnet from the Ground Up. Wroclaw: Oficyna Wydawnicza Politechniki Wroclawskiej, Poland. http://www.plwordnet.pwr.wroc.pl/main/content/files/publications/A_Wordnet_from_the_Ground_Up.pdf
  • OpenWN-PT – Valeria de Paiva and Alexandre Rademaker (2012) Revisiting a Brazilian wordnet. In Proceedings of Global Wordnet Conference, Matsue. Global Wordnet Association. (also with Gerard de Melo's contribution)
  • sloWNet – Fišer, Darja, and Novak, Jernej, and Eejavec, Tomaž (2012) sloWNet 3.0: development, extension and cleaning. In Proceedings of the 6th International Global Wordnet Conference (GWC 2012).. The Global WordNet Association, pp. 113-117
  • SALDO – Borin, Lars and Forsberg, Markus and Lönngren, Lennart (2013) SALDO: a touch of yin to WordNet's yang. Language Resources and Evaluation 47(4):1191–1211, 2013 http://dx.doi.org/10.1007/s10579-013-9233-4
  • Thai Wordnet – Thoongsup S., Charoenporn T., Robkop K., Sinthurahat T., Mokarat C., Sornlertlamvanich V., Isahara H. (2009) Thai Wordnet Construction Proceedings of The 7th Workshop on Asian Language Resources (ALR7), Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics (ACL) and the 4th International Joint Conference on Natural Language Processing (IJCNLP) Suntec, Singapore

Word Segmentation

Not all languages feature unambiguous word boundaries and thus the following algorithms are used to perform word segmentation for their respective languages.

  • Chinese word segmentation performed by Stanford Chinese Word Segmenter using the Peking University Standard (http://nlp.stanford.edu/software/segmenter.shtml). Pi-Chuan Chang, Michel Galley and Chris Manning. 2008. Optimizing Chinese Word Segmentation for Machine Translation Performance. In WMT. http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf
  • Japanese word segmentation performed by KyTea (http://www.phontron.com/kytea/). Graham Neubig, Yosuke Nakata, Shinsuke Mori. Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis. The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Portland, Oregon, USA. June 2011
  • Thai word segmentation performed by Smart Word Analysis for THai (SWATH) (http://www.cs.cmu.edu/~paisarn/software.html) running in Part of Speech Bigram mode. Surapant Meknavin, Paisarn Charoenpornsawat, and Boonserm Kijsirikul, 1997. Feature-based Thai Word Segmentation. In Proceedings of the Natural Language Processing Pacific Rim Symposium 1997(NLPRS’97), Phuket, Thailand. http://www.cs.cmu.edu/~paisarn/papers/nlprs97.pdf
  • Vietnamese word segmentation performed by JVnSegmenter http://jvnsegmenter.sourceforge.net/. Cam-Tu Nguyen and Xuan-Hieu Phan, "JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool", http://jvnsegmenter.sourceforge.net/, 2007.