GDELT monitors the world today in more than 400 written languages, while its collaborations with the Internet Archive's TV News Archive involve more than 150 spoken and onscreen languages and dialects. Multilingualism is thus at the center of everything GDELT does. When it comes to television news, this means that search must be able to properly handle spoken and onscreen broadcast transcripts that span many languages, often at the same time or in rapid succession and in the most surprising places. Spanner has a built-in multilingual fulltext tokenizer that allows us to simply load data directly into Spanner and leave tokenization entirely in its hands, rather than having to build our own massively multilingual normalization and tokenization infrastructure that must be constantly updated. Let's take a closer look at how well it works by using Gemini to craft a series of highly multilingual passages and compare its tokenization with that of Spanner's tokenizer. We find that Spanner does a superb job at handling even extremely multilingual passages that heavily intersperse both space-segmented and scriptio continua languages and handles all common scriptio continua languages well, though it does struggle to tokenize some less common long-tail scriptio continua languages. Most importantly, Spanner's multilingual tokenization is completely automatic, requiring no a priori knowledge of what languages to expect and allowing us to offload the entire tokenization process to Spanner.
To maximally stress Spanner's tokenizer, we'll ask Gemini 3.5 Flash to come up with some highly multilingual code-switched text:
give me a paragraph of text that mixes english, spanish russian, chinese, japanese, thai, vietnamese, burmese, and a bunch of other languages that interleaves them so i can test multilingual word segmentation software.
It returns the following passage, along with this description:
Here is a sample paragraph designed to challenge multilingual word segmentation software. It incorporates English, Spanish, Russian, Chinese, Japanese, Thai, Vietnamese, Burmese, Arabic, Hindi, and Korean.
To make the test more rigorous, some transitions between scripts include spaces, while others are deliberately run together (unspaced) to see if your segmenter can correctly identify script boundaries and language-specific word boundaries.
Key challenges within this text:
-
- English & Spanish (Latin to Latin): Transitioning from English to Spanish (complex, but hoy en día nosotros must analyze) tests if the tokenizer handles standard spaces and basic Latin diacritics.
- Latin to Cyrillic: The transition from English to Russian (analyze how искусственный интеллект) tests alphabetical script boundary detection.
- Cyrillic to Chinese (Spaced): The transition интеллект 能够 tests boundaries between a spaced, alphabetic script and an unspaced, logographic script.
- Chinese to Japanese (Unspaced): 能够处理各种不同言語の文字境界を contains no spaces. The system must segment the Chinese characters (Hanzi) and then transition into Japanese Kanji and Hiragana (言語の文字境界を) without a space separating the two languages.
- Japanese to Thai (Unspaced): The transition 境界をเรียนรู้ has no space, forcing the segmenter to split between Japanese Hiragana (を) and the Thai alphabet (เร).
- Thai (Unspaced Word Segmentation): เรียนรู้เพื่อแบ่งวรรคตอนคำศัพท์ is a continuous string of Thai. The software must segment this into individual Thai words without relying on spaces.
- Thai to Vietnamese: คำศัพท์ một cách chính xác nhất transitions from Thai to Vietnamese. Vietnamese is spaced by syllable rather than by word, which often tests whether tokenizers over-segment or group syllables into words.
- Vietnamese to Burmese (Unspaced): The transition nhấtစာလုံးများကို has no space, pushing the segmenter to split Latin characters from the Burmese script.
- Burmese (Unspaced Word Segmentation): စာလုံးများကိုခွဲခြားရန် requires segmenting Burmese syllables and words without spaces.
- Bi-directional (LTR to RTL) Transition: ခွဲခြားရန် وفي نفس الوقت transitions from Burmese (Left-to-Right) to Arabic (Right-to-Left), which tests the layout and boundary detection engine.
- RTL to Devanagari (Hindi): الوقت विभिन्न भाषाओं को transitions back from Arabic (RTL) to Hindi (LTR, Devanagari script).
- Hindi to Korean (Unspaced Korean): को 올바르게구분해야합니다. transitions into Korean Hangul. The Korean segment 올바르게구분해야합니다 ("must distinguish correctly") is deliberately written without standard spaces to test morphological analysis.
The text:
The global language system is complex, but hoy en día nosotros must analyze how искусственный интеллект 能够处理各种不同言語の文字境界をเรียนรู้เพื่อแบ่งวรรคตอนคำศัพท์ một cách chính xác nhấtစာလုံးများကိုခွဲခြားရန် وفي نفس الوقت विभिन्न भाषाओं को 올바르게구분해야합니다.
Gemini's segmentation of that text:
The,global,language,system,is,complex,\, ,but,hoy,en,día,nosotros,must,analyze,how,искусственный,интеллект,能够,处理,各种,不同,言語,の,文字,境界,を,เรียนรู้,เพื่อ,แบ่ง,วรรคตอน,คำศัพท์,một,cách,chính xác,nhất,စာလုံးများ,ကို,ခွဲခြား,ရန်,وفي,نفس,الوقت,विभिन्न,भाषाओं,को,올바르게,구분해야합니다,.
Thankfully, Spanner makes it trivial to visualize how its tokenizer splits a given text with a built-in debugging function:
SELECT DEBUG_TOKENLIST(TOKENIZE_FULLTEXT(""))
This yields the following tokenization, which we can see impressively handles the combined set of languages, though it differs slightly for the scriptio continua languages and misses Burmese:
"the(boundary), global, language, system, is, complex, but, hoy, en, día, nosotros, must, analyze, how, искусственный, интеллект, 能够, 处理, 各, 种, 不同, 言語, の, 文字, 境界, を, เรียน, รู้, เพื่อ, แบ่ง, วรรค, ตอน, คำ, ศัพท์, một, cách, chính, xác, nhất, စာလုံးများကိုခွဲခြားရန်, وفي, نفس, الوقت, विभिन्न, भाषाओं, को, 올바르게, 구분, 해야, 합니다(end_boundary)"
Let's ask for a larger and more linguistically diverse text block:
give me a block of text that mixes brief snippets in each of the top 100 languages of the world (and make sure to include ALL scriptio continua languages) and interleaves them so i can test multilingual word segmentation software. give me the original block of text for me to test. then give me a gold standard version that has commas between all of the "words" so i can test my segmentation software. finally, give me an explanation broken down by language that shows each snippet broken out and what each word/piece is/means and why it is segmented that way.
The text:
hello world hola mundo नमस्ते दुनिया مرحبا العالم ওহে বিশ্ব olá mundo привет мир ہیلو دنیا halo dunia hallo welt bonjour monde 我们学习 日本語を話す ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ halo donya 侬好上海 హలో ప్రపంచం नमस्कार जग merhaba dünya 안녕 세상 xin chào வணக்கம் உலகம் ciao mondo سلام دنیا witaj świecie привіт світ નમસ્તે દુનિયા ನಮಸ್ಕಾರ ವಿಶ್ವ ഹലോ ലോകം നമസ്കാരം വിശ്വം salam dünya 大家好 halo dunya salut lume hallo wereld pẹlẹ o ndewo ụwa sannu duniya ሰላም ዓለም akkam jirtu silav cîhan kamusta mundo سلام نړۍ هيلو دنيا salom dunyo salaam adduunka γειά σου ahoj světe helló világ hej världen שלום עולם sawubona mhlaba molo hlabathi прывітанне свет сәлем әлем здравей свят здраво свете bok svijete ahoj svet hej verden hei maailma hej verden hola món labas pasauli sveika pasaule tere maailm გამარჯობა სამყარო ბარև աշխարհ салам дүйнө салом ҷαҳон salam dünýä сайн уу ياخشىمۇسىز دۇنيا kaixo mundua ola mundo dia duit helo byd halò saoghal ave mundus saluton mondo hallo wêreld halo kalibutan manao ahoana kia ora aloha honua හෙලෝ ලෝකය नमस्ते संसार 汝好 侬好 大家好 你好 përshëndetje botë здраво свет pozdravljen svet merħba dinja ภาษาไทยน่ารู้ ພາສາລາວດີ ភាសាខ្មែរពិរោះ မြန်မာစာလှတယ် བོད་སྐད་ཡིན།
Gemini's segmentation:
hello, world, hola, mundo, नमस्ते, दुनिया, مرحبا, العالم, ওহে, বিশ্ব, olá, mundo, привет, мир, ہیلو, دنیا, halo, dunia, hallo, welt, bonjour, monde, 我们, 学习, 日本語, を, 話す, ਸਤਿ, ਸ੍ਰੀ, ਅਕਾਲ, halo, donya, 侬好, 上海, హలో, ప్రపంచం, नमस्कार, जग, merhaba, dünya, 안녕, 세상, xin, chào, வணக்கம், உலகம், ciao, mondo, سلام, دنیا, witaj, świecie, привіт, світ, નમસ્તે, દુનિયા, ನಮಸ್ಕಾರ, ವಿಶ್ವ, ഹലോ, ലോകം, നമസ്കാരം, വിശ്വം, salam, dünya, 大家, 好, halo, dunya, salut, lume, hallo, wereld, pẹlẹ, o, ndewo, ụwa, sannu, duniya, ሰላም, ዓለም, akkam, jirtu, silav, cîhan, kamusta, mundo, سلام, نړۍ, هيلو, دنيا, salom, dunyo, salaam, adduunka, γειά, σου, ahoj, světe, helló, világ, hej, världen, שלום, עולם, sawubona, mhlaba, molo, hlabathi, прывітанне, свет, сәлем, әлем, здравей, свят, здраво, свете, bok, svijete, ahoj, svet, hej, verden, hei, maailma, hej, verden, hola, món, labas, pasauli, sveika, pasaule, tere, maailm, გამარჯობა, სამყარო, ბარև, աշխարհ, салам, дүйнө, салом, ҷαҳон, salam, dünýä, сайн, уу, ياخشىمۇسىز, دۇنيا, kaixo, mundua, ola, mundo, dia, duit, helo, byd, halò, saoghal, ave, mundus, saluton, mondo, hallo, wêreld, halo, kalibutan, manao, ahoana, kia, ora, aloha, honua, හෙලෝ, ලෝකය, नमस्ते, संसार, 汝好, 侬好, 大家, 好, 你好, përshëndetje, botë, здраво, свет, pozdravljen, svet, merħba, dinja, ภาษา, ไทย, น่า, รู้, ພາສາ, ລາວ, ດີ, ភាសា, ខ្មែរ, ពិរោះ, မြန်မာစာ, လှ, တယ်, བོད་སྐད་, ཡིན, །
Spanner's tokenization can be seen below. Again, it flawlessly handles the fully mixed set of languages, though it isn't able to properly split several of the long tail scriptio continua languages:
hello(boundary), world, hola, mundo, नमस्ते, दुनिया, مرحبا, العالم, ওহে, বিশ্ব, olá, mundo, привет, мир, ہیلو, دنیا, halo, dunia, hallo, welt, bonjour, monde, 我们, 学习, 日本, 語, を, 話す, ਸਤਿ, ਸ੍ਰੀ, ਅਕਾਲ, halo, donya, 侬, 好, 上海, హలో, ప్రపంచం, नमस्कार, जग, merhaba, dünya, 안녕, 세상, xin, chào, வணக்கம், உலகம், ciao, mondo, سلام, دنیا, witaj, świecie, привіт, світ, નમસ્તે, દુનિયા, ನಮಸ್ಕಾರ, ವಿಶ್ವ, ഹലോ, ലോകം, നമസ്കാരം, വിശ്വം, salam, dünya, 大家, 好, halo, dunya, salut, lume, hallo, wereld, pẹlẹ, o, ndewo, ụwa, sannu, duniya, ሰላም, ዓለም, akkam, jirtu, silav, cîhan, kamusta, mundo, سلام, نړۍ, هيلو, دنيا, salom, dunyo, salaam, adduunka, γειά, σου, ahoj, světe, helló, világ, hej, världen, שלום, עולם, sawubona, mhlaba, molo, hlabathi, прывітанне, свет, сәлем, әлем, здравей, свят, здраво, свете, bok, svijete, ahoj, svet, hej, verden, hei, maailma, hej, verden, hola, món, labas, pasauli, sveika, pasaule, tere, maailm, გამარჯობა, სამყარო, ბარեւ, աշխարհ, салам, дүйнө, салом, ҷαҳон, salam, dünýä, сайн, уу, ياخشىمۇسىز, دۇنيا, kaixo, mundua, ola, mundo, dia, duit, helo, byd, halò, saoghal, ave, mundus, saluton, mondo, hallo, wêreld, halo, kalibutan, manao, ahoana, kia, ora, aloha, honua, හෙලෝ, ලෝකය, नमस्ते, संसार, 汝, 好, 侬, 好, 大家, 好, 你, 好, përshëndetje, botë, здраво, свет, pozdravljen, svet, merħba, dinja, ภาษา, ไทย, น่า, รู้, ພາສາລາວດີ, ភាសា, ខ្មែរ, ពិរោះ, မြန်မာစာလှတယ်, བོད, སྐད, ཡིན(end_boundary)"
Another passage:
give me a block of text that mixes brief snippets of 3-4 words each in each of the top 100 languages of the world (and make sure to include ALL scriptio continua languages) and interleaves them so i can test multilingual word segmentation software. give me the original block of text for me to test. then give me a gold standard version that has commas between all of the "words" so i can test my segmentation software. finally, give me an explanation broken down by language that shows each snippet broken out and what each word/piece is/means and why it is segmented that way.
The text:
We are traveling today to London我们今天去北京旅游Nosotros viajamos hoy a Madrid私は今日お寿司を食べますวันนี้พวกเราไปเที่ยวกรุงเทพມື້ນີ້ພວກເຮົາໄປທ່ຽວວຽງຈັນကျွန်တော်ယနေ့ရန်ကုန်သို့သွားမည်ថ្ងៃនេះពួកយើងទៅលេងភ្នំពេញདེ་རིང་ང་ཚོ་ལྷ་སར་འགྲོ་གི་ཡིན།ꦢꦶꦤꦲꦶꦏꦶꦲꦏꦸꦭꦸꦔꦩꦼꦚꦁꦱꦺꦴꦭꦺꦴᬤᬶᬦᬦᬶᬓᬶᬢᬶᬳᬂᬮᬸᬂᬲᬸᬃᬓᬧᬾᬓᬾᬦ᭄हम आज दिल्ली की यात्रा करेंगेআমরা আজ ঢাকা ভ্রমণ করতে যাচ্ছিنحن نسافر اليوم إلى القاهرةہم آج لاہور جا رہے ہیںМы сегодня летим в МосквуNós viajamos hoje para LisboaHari ini kami pergi ke JakartaWir besuchen heute den Hauptbahnhof우리는 오늘 서울로 여행을 갑니다אנחנו נוסעים היום אל ירושליםእኛ፡ዛሬ፡ወደ፡አዲስ፡አበባ፡እንሄዳለንHôm nay chúng tôi đi Hà NộiΕμείς σήμερα ταξιδεύουμε στην ΑθήναBiz bugün güzel Ankara'ya hızlıca gidiyoruz
Gemini segmentation:
We, are, traveling, today, to, London, 我们, 今天, 去, 北京, 旅游, Nosotros, viajamos, hoy, a, Madrid, 私, は, 今日, お寿司, を, 食べます, วันนี้, พวกเรา, ไป, เที่ยว, กรุงเทพ, ມື້ນີ້, ພວກເຮົາ, ໄປ, ທ່ຽວ, ວຽງຈัน, ကျွန်တော်, ယနေ့, ရန်ကုန်, သို့, သွား, မည်, ថ្ងៃនេះ, ពួកយើង, ទៅ, លេង, ភ្នំពេញ, དེ་རིང་, ང་ཚོ་, ལྷ་སར་, འགྲོ་, གི་ཡིན, ꦢꦶꦤ, ꦲꦶꦏꦶ, ꦲꦏꦸ, ꦭꦸꦔ, ꦩꦼꦚꦁ, ꦱꦺꦴꦭꦺꦴ, ᬤᬶᬦ, ᬦᬶᬓᬶ, ᬢᬶᬳᬂ, ᬮᬸᬂᬲᬸᬃ, ᬓ, ᬧᬾᬓᬾᬦ᭄, हम, आज, दिल्ली, की, यात्रा, करेंगे, আমরা, আজ, ঢাকা, ভ্রমণ, করতে, যাচ্ছি, نحن, نسافر, اليوم, إلى, القاهرة, ہم, آج, لاہور, جا, رہے, ہیں, Мы, сегодня, летим, в, Москву, Nós, viajamos, hoje, para, Lisboa, Hari, ini, kami, pergi, ke, Jakarta, Wir, besuchen, heute, den, Hauptbahnhof, 우리는, 오늘, 서울로, 여행을, 갑니다, אנחנו, נוסעים, היום, אל, ירושלים, እኛ, ዛሬ, ወደ, አዲስ, አበባ, እንሄዳለን, Hôm nay, chúng tôi, đi, Hà Nội, Εμείς, σήμερα, ταξιδεύουμε, στην, Αθήνα, Biz, bugün, güzel, Ankara'ya, hızlıca, gidiyoruz
Spanner tokenization:
"we(boundary), are, traveling, today, to, london, 我们, 今天, 去, 北京, 旅游, nosotros, viajamos, hoy, a, madrid, 私, は, 今日, お, 寿司, を, 食べ, ます, วัน, นี้, พวก, เรา, ไป, เที่ยว, กรุงเทพ, ມື້ນີ້ພວກເຮົາໄປທ່ຽວວຽງຈັນ, ကျွန်တော်ယနေ့ရန်ကုန်သို့သွားမည်, ថ្ងៃ, នេះ, ពួក, យើង, ទៅ, លេង, ភ្នំពេញ, དེ, རིང, ང, ཚོ, ལྷ, སར, འགྲོ, གི, ཡིན, ꦢꦶꦤꦲꦶꦏꦶꦲꦏꦸꦭꦸꦔꦩꦼꦚꦁꦱꦺꦴꦭꦺꦴᬤᬶᬦᬦᬶᬓᬶᬢᬶᬳᬂᬮᬸᬂᬲᬸᬃᬓᬧᬾᬓᬾᬦ᭄हम, आज, दिल्ली, की, यात्रा, करेंगेআমরা, আজ, ঢাকা, ভ্রমণ, করতে, যাচ্ছিنحن, نسافر, اليوم, إلى, القاهرةہم, آج, لاہور, جا, رہے, ہیںмы, сегодня, летим, в, москвуnós, viajamos, hoje, para, lisboahari, ini, kami, pergi, ke, jakartawir, besuchen, heute, den, hauptbahnhof, 우리, 는, 오늘, 서울, 로, 여행, 을, 갑니다, אנחנו, נוסעים, היום, אל, ירושליםእኛ, ዛሬ, ወደ, አዲስ, አበባ, እንሄዳለንhôm, nay, chúng, tôi, đi, hà, nộiεμείσ, σήμερα, ταξιδεύουμε, στην, αθήναbiz, bugün, güzel, ankara, ya, hızlıca, gidiyoruz(end_boundary)"
Another:
The quick brown fox jumps我每天早上喝咖啡私は熱いお茶を飲むแมวสีน้ำตาลกระโดดข้ามรั้วແມວສີນ້ຳຕານເຕັ້ນຂ້າມຮົ້ວឆ្មាពណ៌ត្នោតលោតរំលងរបងကျွန်တော်ပန်းသီးစားချင်တယ်ང་ཚོས་ཇ་དྲོན་མོ་འཐུང་གི་ཡོད།ꦏꦸꦭꦱꦼꦤꦼꦁꦩꦔꦤ꧀ꦱꦼꦒꦒꦺꦴꦫꦺꦁᬢᬶᬢᬶᬬᬂᬮᬓᬃᬮᬸᬯᬲ᭄ᬓᬧᬲᬶᬄᮊᮥᮛᮤᮀᮛᮨᮞᮨᮕ᮪ᮓᮠᮁᮞᮍᮥᮌᮧᮛᮨᮀCon cáo nâu nhảy qua hàng rào الثعلب البني السريع يقفز عاليا השועל החום המהיר קופץ גבוה तेज़ भूरी लोमड़ी कूदती है 빠른 갈색 여우가 높이 뛴다 Быстрая бурая лиса прыгает высоко Der schnelle braune Fuchs springt Hızlı kahverengi tilki yükseğe zıplar Mbweha wa kahawia haraka anaruka வேகமான பழுப்பு நரி உயரமாக குதிக்கிறது ᐃᓄᒃ ᑕᑯᔪᖅ ᐊᖏᔪᖅ ᕿᒻᒥᖅ ᐅᓪᓗᒥ Η γρήγορη καφέ αλεπού πηδάει La rapida bruna vulpo saltas ፈጣኑ ቡናማ ቀበሮ ወደላይ ይዘላል
Gemini:
The, quick, brown, fox, jumps, 我, 每天, 早上, 喝, 咖啡, 私, は, 熱い, お茶, を, 飲む, แมว, สีน้ำตาล, กระโดด, ข้าม, รั้ว, ແມວ, ສີນ້ຳຕານ, ເຕັ້ນ, ຂ້າມ, ຮົ້ວ, ឆ្មា, ពណ៌ត្នោត, លោត, រំលង, របង, ကျွန်တော်, ပန်းသီး, စား, ချင်, တယ်, ང་ཚོས, ཇ, དྲོན་མོ, འཐུང་གི, ཡོད།, ꦏꦸꦭ, ꦱꦼꦤꦼꦁ, ꦩꦔꦤ꧀, ꦱꦼꦒ, ꦒꦺꦴꦫꦺꦁ, ᬢᬶᬢᬶᬬᬂ, ᬮᬓᬃ, ᬮᬸᬯᬲ᭄, ᬓ, ᬧᬲᬶᬄ, ᮊᮥᮛᮤᮀ, ᮛᮨᮞᮨᮕ᮪, ᮓᮠᮁ, ᮞᮍᮥ, ᮌᮧᮛᮨᮀ, Con, cáo, nâu, nhảy, qua, hàng rào, الثعلب, البني, السريع, يقفز, عاليا, השועל, החום, המהיר, קופץ, גבוה, तेज़, भूरी, लोमड़ी, कूदती, है, 빠른, 갈색, 여우, 가, 높이, 뛴다, Быстрая, бурая, лиса, прыгает, высоко, Der, schnelle, braune, Fuchs, springt, Hızlı, kahverengi, tilki, yükseğe, zıplar, Mbweha, wa, kahawia, haraka, anaruka, வேகமான, பழுப்பு, நரி, உயரமாக, குதிக்கிறது, ᐃᓄᒃ, ᑕᑯᔪᖅ, ᐊᖏᔪᖅ, ᕿᒻᒥᖅ, ᐅᓪᓗᒥ, Η, γρήγορη, καφέ, αλεπού, πηδάει, La, rapida, bruna, vulpo, saltas, ፈጣኑ, ቡናማ, ቀበሮ, ወደላይ, ይዘላል
Spanner:
"the(boundary), quick, brown, fox, jumps, 我, 每天, 早上, 喝, 咖啡, 私, は, 熱い, お茶, を, 飲む, แมว, สี, น้ำตาล, กระโดด, ข้าม, รั้ว, ແມວສີນ້ໍາຕານເຕັ້ນຂ້າມຮົ້ວ, ឆ្មា, ពណ៌, ត្នោត, លោត, រំលង, របង, ကျွန်တော်ပန်းသီးစားချင်တယ်, ང, ཚོས, ཇ, དྲོན, མོ, འཐུང, གི, ཡོད, ꦏꦸꦭꦱꦼꦤꦼꦁꦩꦔꦤ꧀ꦱꦼꦒꦒꦺꦴꦫꦺꦁᬢᬶᬢᬶᬬᬂᬮᬓᬃᬮᬸᬯᬲ᭄ᬓᬧᬲᬶᬄᮊᮥᮛᮤᮀᮛᮨᮞᮨᮕ᮪ᮓᮠᮁᮞᮍᮥᮌᮧᮛᮨᮀcon, cáo, nâu, nhảy, qua, hàng, rào, الثعلب, البني, السريع, يقفز, عاليا, השועל, החום, המהיר, קופץ, גבוה, तेज़, भूरी, लोमड़ी, कूदती, है, 빠른, 갈색, 여우, 가, 높이, 뛴다, быстрая, бурая, лиса, прыгает, высоко, der, schnelle, braune, fuchs, springt, hızlı, kahverengi, tilki, yükseğe, zıplar, mbweha, wa, kahawia, haraka, anaruka, வேகமான, பழுப்பு, நரி, உயரமாக, குதிக்கிறது, ᐃᓄᒃ, ᑕᑯᔪᖅ, ᐊᖏᔪᖅ, ᕿᒻᒥᖅ, ᐅᓪᓗᒥ, η, γρήγορη, καφέ, αλεπού, πηδάει, la, rapida, bruna, vulpo, saltas, ፈጣኑ, ቡናማ, ቀበሮ, ወደላይ, ይዘላል(end_boundary)"
Another:
The red cat 我喜欢猫 मेरा नाम राज Casa muy grande Chat noir ici كتاب جميل جدا আমি ভাত খাই Я люблю тебя Gato branco dorme میرا نام علی Saya makan nasi Ich bin hier 猫が好き तो मुलगा आहे నేను బాగున్నాను ఇప్పుడు Ben çok yorgunum நான் நன்றாக இருக்கிறேன் 我食咗飯 Tôi thích thể thao Mahal kita po 나는 밥을 먹는다 من رفتم خانه Ina son ku انا بحب مصر Mimi ninapenda wewe ꦲꦏꦸꦩꦔꦤ꧀ꦱꦼꦒ Mangio una mela ਮੈਂ ਠੀਕ ਹਾਂ હું જાઉં છું ฉันกินข้าว እኔ እወዳለሁ በጣም ನಾನು ಹೋಗುತ್ತೇನೆ ಈಗ हमार नाम बा 我吃過飯 𪜶是學生 𠊎食飽了 ကျွန်တော်ထမင်းစားတယ် Mo fẹran rẹ Men seni sevaman हम आबि रहल مان ويندو آهيان Ani nan deema ഞാൻ പോകുന്നു ഇപ്പോൾ ᮃᮘ᮪ᮓᮤᮔᮥᮏᮥᮒᮥᮃᮀ A chọrọ m Eu merg acolo ମୁଁ ଖାଉଛି ଏବେ Ik ben blij Ez te hezdikim ຂ້ອຍກິນເຂົ້າ ខ្ញុំញ៉ាំបាយ ང་བོད་པ་ཡིན ང་འབྲུག་པ་ཨིན ᬢᬶᬬᬂᬫᬤᬵᬃᬦᬲᬶ මම කෑම කනවා Waan ku jeclahay Gihigugma ko ikaw म घर जान्छु Gou aeu gwn Tiako ianao izao মই ভাত খাওঁ Ndinokuda chaizvo nhasi Mido yidi ma Mám tě rád Είμαι πολύ καλά Ngiyakuthanda kakhulu wena Ñuqa munani qamta Мен сені сүйемін مەن سېنى سۆيىمەن Én is itt Ndagukunda cyane pe Me pɛ sɛ Ayayaten ka unay Ndiyakuthanda kakhulu mna ᨡ᩶ᩣᨠᩥ᩠ᨶᨡᩮᩢ᩶ᩣ Mwen renmen ou Мин сине яратам အဲစသုၚ်ပုၚ် Palangga ko ikaw Men seni söýýärin ယအီၣ်မ့ၣ် ᱤᱧ ᱫᱚ ᱪᱟᱞᱟᱜ Ndingwenda gwika uguo Ek is lief Би чамд хайртай Ke a tsamaya Che rohayhu iterei Naya munasmawa jichha Kuv hlub koj Ja lubię to Jag gillar dig Ja ťa ľúbim Jeg elsker dig Minä rakastan sinua Jeg elsker deg Ja te volim Ја те волим Molt de temps Aš myliu tave მე შენ მიყვარხარ Я тебе люблю Jaz te ljubim Mən səni sevirəm
Gemini:
The, red, cat, 我, 喜欢, 猫, मेरा, नाम, राज, Casa, muy, grande, Chat, noir, ici, كتاب, جميل, جدا, আমি, ভাত, খাই, Я, люблю, тебя, Gato, branco, dorme, میرا, نام, علی, Saya, makan, nasi, Ich, bin, hier, 猫, が, 好き, तो, मुलगा, आहे, నేను, బాగున్నాను, ఇప్పుడు, Ben, çok, yorgunum, நான், நன்றாக, இருக்கிறேன், 我, 食, 咗, 飯, Tôi, thích, thể thao, Mahal, kita, po, 나는, 밥을, 먹는다, من, رفتم, خانه, Ina, son, ku, انا, بحب, مصر, Mimi, ninapenda, wewe, ꦲꦏꦸ, ꦩꦔꦤ꧀, ꦱꦼꦒ, Mangio, una, mela, ਮੈਂ, ਠੀਕ, ਹਾਂ, હું, જાઉં, છું, ฉัน, กิน, ข้าว, እኔ, እወዳለሁ, በጣም, ನಾನು, ಹೋಗುತ್ತೇನೆ, ಈಗ, हमार, नाम, बा, 我, 吃, 過, 飯, 𪜶, 是, 學生, 𠊎, 食, 飽, 了, ကျွန်တော်, ထမင်း, စား, တယ်, Mo, fẹran, rẹ, Men, seni, sevaman, हम, आबि, रहल, مان, ويندو, آهيان, Ani, nan, deema, ഞാൻ, പോകുന്നു, ഇപ്പോൾ, ᮃᮘ᮪ᮓᮤ, ᮔᮥᮏᮥ, ᮒᮥᮃᮀ, A, chọrọ, m, Eu, merg, acolo, ମୁଁ, ଖାଉଛି, ଏବେ, Ik, ben, blij, Ez, te, hezdikim, ຂ້ອຍ, ກິນ, ເຂົ້າ, ខ្ញុំ, ញ៉ាំ, បាយ, ང་, བོད་པ་, ཡིན, ང་, འབྲུག་པ་, ཨིན, ᬢᬶᬬᬂ, ᬫᬤᬵᬃ, ᬦᬲᬶ, මම, කෑම, කනවා, Waan, ku, jeclahay, Gihigugma, ko, ikaw, म, घर, जान्छु, Gou, aeu, gwn, Tiako, ianao, izao, মই, ভাত, খাওঁ, Ndinokuda, chaizvo, nhasi, Mido, yidi, ma, Mám, tě, rád, Είμαι, πολύ, καλά, Ngiyakuthanda, kakhulu, wena, Ñuqa, munani, qamta, Мен, сені, сүйемін, مەن, سېنى, سۆيىمەن, Én, is, itt, Ndagukunda, cyane, pe, Me, pɛ, sɛ, Ayayaten, ka, unay, Ndiyakuthanda, kakhulu, mna, ᨡ᩶ᩣ, ᨠᩥ᩠ᨶ, ᨡᩮᩢ᩶ᩣ, Mwen, renmen, ou, Мин, сине, яратам, အဲ, စသုၚ်, ပုၚ်, Palangga, ko, ikaw, Men, seni, söýýärin, ယ, အီၣ်, မ့ၣ်, ᱤᱧ, ᱫᱚ, ᱪᱟᱞᱟᱜ, Ndingwenda, gwika, uguo, Ek, is, lief, Би, чамд, хайртай, Ke, a, tsamaya, Che, rohayhu, iterei, Naya, munasmawa, jichha, Kuv, hlub, koj, Ja, lubię, to, Jag, gillar, dig, Ja, ťa, ľúbim, Jeg, elsker, dig, Minä, rakastan, sinua, Jeg, elsker, deg, Ja, te, volim, Ја, те, волим, Molt, de, temps, Aš, myliu, tave, მე, შენ, მიყვარხარ, Я, тебе, люблю, Jaz, te, ljubim, Mən, səni, sevirəm
Spanner:
"the(boundary), red, cat, 我, 喜欢, 猫, मेरा, नाम, राज, casa, muy, grande, chat, noir, ici, كتاب, جميل, جدا, আমি, ভাত, খাই, я, люблю, тебя, gato, branco, dorme, میرا, نام, علی, saya, makan, nasi, ich, bin, hier, 猫, が, 好き, तो, मुलगा, आहे, నేను, బాగున్నాను, ఇప్పుడు, ben, çok, yorgunum, நான், நன்றாக, இருக்கிறேன், 我, 食, 咗, 飯, tôi, thích, thể, thao, mahal, kita, po, 나는, 밥, 을, 먹는다, من, رفتم, خانه, ina, son, ku, انا, بحب, مصر, mimi, ninapenda, wewe, ꦲꦏꦸꦩꦔꦤ꧀ꦱꦼꦒ, mangio, una, mela, ਮੈਂ, ਠੀਕ, ਹਾਂ, હું, જાઉં, છું, ฉัน, กิน, ข้าว, እኔ, እወዳለሁ, በጣም, ನಾನು, ಹೋಗುತ್ತೇನೆ, ಈಗ, हमार, नाम, बा, 我, 吃, 過, 飯, 𪜶, 是, 學生, 𠊎, 食, 飽, 了, ကျွန်တော်ထမင်းစားတယ်, mo, fẹran, rẹ, men, seni, sevaman, हम, आबि, रहल, مان, ويندو, آهيان, ani, nan, deema, ഞാന്, പോകുന്നു, ഇപ്പോള്, ᮃᮘ᮪ᮓᮤᮔᮥᮏᮥᮒᮥᮃᮀ, a, chọrọ, m, eu, merg, acolo, ମୁଁ, ଖାଉଛି, ଏବେ, ik, ben, blij, ez, te, hezdikim, ຂ້ອຍກິນເຂົ້າ, ខ្ញុំ, ញ៉ាំ, បាយ, ང, བོད, པ, ཡིན, ང, འབྲུག, པ, ཨིན, ᬢᬶᬬᬂᬫᬤᬵᬃᬦᬲᬶ, මම, කෑම, කනවා, waan, ku, jeclahay, gihigugma, ko, ikaw, म, घर, जान्छु, gou, aeu, gwn, tiako, ianao, izao, মই, ভাত, খাওঁ, ndinokuda, chaizvo, nhasi, mido, yidi, ma, mám, tě, rád, είμαι, πολύ, καλά, ngiyakuthanda, kakhulu, wena, ñuqa, munani, qamta, мен, сені, сүйемін, مەن, سېنى, سۆيىمەن, én, is, itt, ndagukunda, cyane, pe, me, pɛ, sɛ, ayayaten, ka, unay, ndiyakuthanda, kakhulu, mna, ᨡ᩶ᩣᨠᩥ᩠ᨶᨡᩮᩢ᩶ᩣ, mwen, renmen, ou, мин, сине, яратам, အဲစသုၚ်ပုၚ်, palangga, ko, ikaw, men, seni, söýýärin, ယအီၣ်မ့ၣ်, ᱤᱧ, ᱫᱚ, ᱪᱟᱞᱟᱜ, ndingwenda, gwika, uguo, ek, is, lief, би, чамд, хайртай, ke, a, tsamaya, che, rohayhu, iterei, naya, munasmawa, jichha, kuv, hlub, koj, ja, lubię, to, jag, gillar, dig, ja, ťa, ľúbim, jeg, elsker, dig, minä, rakastan, sinua, jeg, elsker, deg, ja, te, volim, ја, те, волим, molt, de, temps, aš, myliu, tave, მე, შენ, მიყვარხარ, я, тебе, люблю, jaz, te, ljubim, mən, səni, sevirəm(end_boundary)"
And finally:
the quick brown fox 我在北京工作 यह एक किताब है el sol brilla le chat noir هذا ولد ذكي আমি ভাত খাই o dia está bom красивый синий цветок یہ میرا گھر ہے saya suka membaca der große hund 美しいリンゴを食べる ਮੈਂ ਸੇਬ ਖਾਂਦਾ ਹਾਂ मी आंबा खातो నేను అన్నం తింటున్నాను bu çok güzel நான் பழம் சாப்பிடுகிறேன் 我想去旅行 tôi thích đọc sách gusto ko ito 나는 밥을 먹는다 من کتاب میخوانم ina son abinci mimi ninapenda kusoma ꦏꦸꦭꦱꦼꦤꦼꦁꦱꦼꦒ il cielo è azzurro ನಾನು ಪುಸ್ತಕ ಓದುತ್ತೇನೆ હું કેરી ખાઉં છું ฉันชอบกินข้าว እኔ ምግብ እወዳለሁ men kitob oʻqiyman lubię czytać książki mo fẹ́ rà á я люблю читати atụla egwu ọ bụla mən kitab oxuyuram fata citește o carte हम आम खात बानी ကျွန်တော်စာဖတ်နေတယ် 食飽吂 το βιβλίο είναι καλό ez egy szép nap en stor röd bil de kat slaapt hier mám rád teplé léto زه کتاب لولم مان سنڌي ڳالهائيندڙ آهيان ez kurdî dizanim ngithanda ukufunda izincwadi dit is baie goed הילד אוהב לקרוא offset ខ្ញុំស្រឡាញ់ភាសាខ្មែរ ຂ້ອຍຮັກປະເທດລາວ ང་བོད་སྐད་ཤེས་ aan wada hadalno म नेपाली सिक्दैछु මම සිංහල කතා කරමි minä puhun suomea pekný teplý deň en stor hund jeg liker å lese saya suka belajar ես սիրում եմ քեզ მე მიყვარს საქართველო би ном уншдаг ман китоб мехонам мен кітап оқимын men kitap okaýaryn мен китеп окуйм мин китап укыйм مەن كىتاب ئوقۇيمەن аз чета книга ја читам књигу ja čitam knjigu ja čitam knjigu јас читам книга aš skaitau knygą es lasu grāmatu ma loen raamatut jo llegeixo un llibre eu leo un libro nik liburua irakurtzen dut y ddraig goch fawr is maith liom leabhair unë lexoj një libër jaz berem knjigo jien naqra ktieb ég les bók he pai te ra aloha au ia oe ñuqa ñawinchani liwruta che alee peteĩ aranduka ndinoda kuverenga mabhuku ndimakonda kuwerenga mabuku tia mamaky boky aho bëgg naa jàng téere ndithanda ukufunda iincwadi ᬢᬶᬬᬂᬤᭂᬫᭂᬦ᭄ᬫᬘ अहंपुस्तकंपठामि
Gemini:
the, quick, brown, fox, 我, 在, 北京, 工作, यह, एक, किताब, है, el, sol, brilla, le, chat, noir, هذا, ولد, ذكي, আমি, ভাত, খাই, o, dia, está, bom, красивый, синий, цветок, یہ, میرا, گھر, ہے, saya, suka, membaca, der, große, hund, 美しい, リンゴ, を, 食べる, ਮੈਂ, ਸੇਬ, ਖਾਂਦਾ, ਹਾਂ, मी, आंबा, खातो, నేను, అన్నం, తింటున్నాను, bu, çok, güzel, நான், பழம், சாப்பிடுகிறேன், 我, 想, 去, 旅行, tôi, thích, đọc, sách, gusto, ko, ito, 나는, 밥을, 먹는다, من, کتاب, میخوانم, ina, son, abinci, mimi, ninapenda, kusoma, ꦏꦸꦭ, ꦱꦼꦤꦼꦁ, ꦱꦼꦒ, il, cielo, è, azzurro, ನಾನು, ಪುಸ್ತಕ, ಓದುತ್ತೇನೆ, હું, કેરી, ખાઉં, છું, ฉัน, ชอบ, กิน, ข้าว, እኔ, ምግብ, እወዳለሁ, men, kitob, oʻqiyman, lubię, czytać, książki, mo, fẹ́, rà, á, я, люблю, читати, atụla, egwu, ọ, bụla, mən, kitab, oxuyuram, fata, citește, o, carte, हम, आम, खात, बानी, ကျွန်တော်, စာ, ဖတ်, နေတယ်, 食, 飽, 吂, το, βιβλίο, είναι, καλό, ez, egy, szép, nap, en, stor, röd, bil, de, kat, slaapt, hier, mám, rád, teplé, léto, زه, کتاب, لولم, مان, سنڌي, ڳالهائيندڙ, آهيان, ez, kurdî, dizanim, ngithanda, ukufunda, izincwadi, dit, is, baie, goed, הילด, אוהב, לקרוא, ខ្ញុំ, ស្រឡាញ់, ភាសា, ខ្មែរ, ຂ້ອຍ, ຮັກ, ປະເທດ, ລາວ, ང་, བོད་སྐད་, ཤེས་, aan, wada, hadalno, म, नेपाली, सिक्दैछु, මම, සිංහල, කතා, කරමි, minä, puhun, suomea, pekný, teplý, deň, en, stor, hund, jeg, liker, å, lese, saya, suka, belajar, ես, սիրում, եմ, քեզ, მე, მიყვარს, საქართველო, би, ном, уншдаг, ман, китоб, мехонам, мен, кітап, оқимын, men, kitap, okaýaryn, мен, китеп, окуйм, мин, китап, укыйм, مەن, كىتاب, ئوقۇيمەن, аз, чета, книга, ја, читам, књигу, ja, čitam, knjigu, ja, čitam, knjigu, јас, читам, книга, aš, skaitau, knygą, es, lasu, grāmatu, ma, loen, raamatut, jo, llegeixo, un, llibre, eu, leo, un, libro, nik, liburua, irakurtzen, dut, y, ddraig, goch, fawr, is, maith, liom, leabhair, unë, lexoj, një, libër, jaz, berem, knjigo, jien, naqra, ktieb, ég, les, bók, he, pai, te, ra, aloha, au, ia, oe, ñuqa, ñawinchani, liwruta, che, alee, peteĩ, aranduka, ndinoda, kuverenga, mabhuku, ndimakonda, kuwerenga, mabuku, tia, mamaky, boky, aho, bëgg, naa, jàng, téere, ndithanda, ukufunda, iincwadi, ᬢᬶᬬᬂ, ᬤᭂᬫᭂᬦ᭄, ᬫᬘ, अहं, पुस्तकं, पठामि
Spanner:
"the(boundary), quick, brown, fox, 我, 在, 北京, 工作, यह, एक, किताब, है, el, sol, brilla, le, chat, noir, هذا, ولد, ذكي, আমি, ভাত, খাই, o, dia, está, bom, красивый, синий, цветок, یہ, میرا, گھر, ہے, saya, suka, membaca, der, große, hund, 美しい, リンゴ, を, 食べる, ਮੈਂ, ਸੇਬ, ਖਾਂਦਾ, ਹਾਂ, मी, आंबा, खातो, నేను, అన్నం, తింటున్నాను, bu, çok, güzel, நான், பழம், சாப்பிடுகிறேன், 我, 想, 去, 旅行, tôi, thích, đọc, sách, gusto, ko, ito, 나는, 밥, 을, 먹는다, من, کتاب, میخوانم, ina, son, abinci, mimi, ninapenda, kusoma, ꦏꦸꦭꦱꦼꦤꦼꦁꦱꦼꦒ, il, cielo, è, azzurro, ನಾನು, ಪುಸ್ತಕ, ಓದುತ್ತೇನೆ, હું, કેરી, ખાઉં, છું, ฉัน, ชอบ, กิน, ข้าว, እኔ, ምግብ, እወዳለሁ, men, kitob, oʻqiyman, lubię, czytać, książki, mo, fẹ́, rà, á, я, люблю, читати, atụla, egwu, ọ, bụla, mən, kitab, oxuyuram, fata, citește, o, carte, हम, आम, खात, बानी, ကျွန်တော်စာဖတ်နေတယ်, 食, 飽, 吂, το, βιβλίο, είναι, καλό, ez, egy, szép, nap, en, stor, röd, bil, de, kat, slaapt, hier, mám, rád, teplé, léto, زه, کتاب, لولم, مان, سنڌي, ڳالهائيندڙ, آهيان, ez, kurdî, dizanim, ngithanda, ukufunda, izincwadi, dit, is, baie, goed, הילד, אוהב, לקרוא, offset, ខ្ញុំ, ស្រឡាញ់, ភាសា, ខ្មែរ, ຂ້ອຍຮັກປະເທດລາວ, ང, བོད, སྐད, ཤེས, aan, wada, hadalno, म, नेपाली, सिक्दैछु, මම, සිංහල, කතා, කරමි, minä, puhun, suomea, pekný, teplý, deň, en, stor, hund, jeg, liker, å, lese, saya, suka, belajar, ես, սիրում, եմ, քեզ, მე, მიყვარს, საქართველო, би, ном, уншдаг, ман, китоб, мехонам, мен, кітап, оқимын, men, kitap, okaýaryn, мен, китеп, окуйм, мин, китап, укыйм, مەن, كىتاب, ئوقۇيمەن, аз, чета, книга, ја, читам, књигу, ja, čitam, knjigu, ja, čitam, knjigu, јас, читам, книга, aš, skaitau, knygą, es, lasu, grāmatu, ma, loen, raamatut, jo, llegeixo, un, llibre, eu, leo, un, libro, nik, liburua, irakurtzen, dut, y, ddraig, goch, fawr, is, maith, liom, leabhair, unë, lexoj, një, libër, jaz, berem, knjigo, jien, naqra, ktieb, ég, les, bók, he, pai, te, ra, aloha, au, ia, oe, ñuqa, ñawinchani, liwruta, che, alee, peteĩ, aranduka, ndinoda, kuverenga, mabhuku, ndimakonda, kuwerenga, mabuku, tia, mamaky, boky, aho, bëgg, naa, jàng, téere, ndithanda, ukufunda, iincwadi, ᬢᬶᬬᬂᬤᭂᬫᭂᬦ᭄ᬫᬘ, अहंपुस्तकंपठामि(end_boundary)"