Friday, June 3, 2022
This week, we introduced an algorithmic improvement that identifies documents where the title
element is written in a different language or script from its content, and chooses a title that
is similar to the language and script of the document. This is based on the general principle
that a document’s title should be written by the language or script of its primary contents.
It’s one of the reasons where
we might go beyond title elements
for web result titles.
Multilingual titles
Multilingual titles repeat the same phrase with two different languages or scripts.
The most popular pattern is to append an English version to the original title text.
गीतांजलि की जीवनी – Geetanjali Biography in Hindi
In this example, the title consists of two parts (divided by a hyphen), and they express the
same contents in different languages (Hindi and English). While the title is in both languages,
the document itself is written only in Hindi. Our system detects such inconsistency and might use
only the Hindi headline text, like:
गीतांजलि की जीवनी
Latin scripted titles
Transliteration is when content is written from one language into a different language that uses
a different script or alphabet. For example, consider a page title for a song written in Hindi
but transliterated to use Latin characters rather than Hindi’s native Devanagari script:
jis desh me holi kheli jati hai
In such a case, our system tries to find an alternative title using the script that’s predominant on the page, which in this case could be:
जिस देश में होली खेली जाती है
Summary
In general, our systems tend to use the title element of the page. In cases with multi-language
or transliterated titles, our systems may seek alternatives that match the predominant language of the
page. This is why it’s a good practice to provide a title that matches the language and/or the
script of the page’s main content.
We welcome further feedback in our forum,
including existing threads on this topic in
English
and
Japanese.