[lg policy] Searching for Information in the Tower of Babel

Wed Mar 14 15:23:30 UTC 2018

 Searching for Information in the Tower of Babel
[image: Martin White]

By Martin White <https://www.cmswire.com/author/martin-white/> | *Mar 13,
2018*

CHANNEL: Information Management
<https://www.cmswire.com/information-management/>
[image: Pieter Bruegel the Elder, The Tower of Babel] Enterprise search
already poses enough challenges. Add in multilingual search, and the
challenges only grow PHOTO: public domain

We've only started to recognize the impacts and implications of employees
in multinational organizations working in more than one language
<https://harzing.com/research/language-in-international-business> in the
last few years.

The concept of a definitive single corporate language is fast disappearing
as these organizations adopt collaboration as the default mode of work and
introduce multiple social media channels. And while employees may indicate
language skills on their profiles, the level of competence (a term whose
definition is open to debate) at speaking, reading, writing and
understanding a foreign language may vary considerably.

So it stands to reason that employees in multilingual organizations are
enlisting search applications  to locate information in a wide range of
languages, but I suspect these organizations lack a strategy that will
support the resources and action to make this possible.
* Accidental Corporate Language Policies *

All too often intranet teams define a corporate language policy almost by
accident as requests come in from around the world to be able to publish in
local languages. Language policies don't only cover which languages will be
supported but also which pairs of languages will be supported.

Machine translation is adequate in some contexts, but insufficient in the
case of contracts and other official documents as well as for press
releases from publicly-owned companies. A slight machine mistranslation of
a corporate press release could have unexpected and undesirable effects on
the share price.
* The Difference Between Multilingual and Cross-Lingual Search *

Multilingual and cross-lingual search are two very different processes that
are often confused. Multilingual search is where a query in English will
search content in English, a query in French will search for content in
French, and so on. Ideally the index for each language will be of a similar
quality, having been created by the appropriate stemming and lemmatization
tools and with equally appropriate stop words. Integrating content in
English, French and German into the same index is a very poor approach.

Cross-lingual search is where a search term in English is used and the
search application uses taxonomies, thesauri and maybe machine translation
to match the meaning of the term in multiple languages. This is seriously
challenging, not just from the query/index management standpoint, but also
raises questions of how to present the results. There are many options.
* The Special Challenges of Multilingual Documents *

When in doubt, assume documents contain more than one language. A document
in German may quote the text of a local contract in French and have an
executive summary in English. Further adding complexity is the practice of
adding metadata in English to a German document because English is the
‘corporate’ language. These situations pose a challenge not only to the
language identification algorithms in the search application but also to
the ranking of the document.

Mixing languages has implications for the weighting of terms in retrieval.
A short snippet of French appearing in predominantly German text would give
more weight to the French words, since they are relatively uncommon. But
mix those in with documents written predominantly in French, and the short
French snippets now have much less weight.

Metadata presents a special problem. A document in Chinese may end up at
the top of a list of relevant documents if it has been helpfully tagged in
English, given an English title and the metadata is given substantial
weight in the ranking algorithm.
* The Joys of Query Language Identification *

Although a number of language detectors are available, most need at least
200 characters to confirm a language. This is not really long enough for
microblogs, though langid <http://www.aclweb.org/anthology/P12-3005> does a
pretty good job.

Identifying a language prior to indexing is not going to create any latency
the user will be aware of. That is not the case with a query, where the
identification has to be done on the fly. In these instances, it may help
to prompt the user to ensure the language identified for the query term(s)
is correct. A common use case is where an employee is working in France, is
logged in to the French language pages of the intranet, but then conducts a
search in English. A single word may not be a major problem, but a noun
phrase could be — check this point with your search vendor.
* Plan for Multilingual Search *

Carol Peters, Martin Braschler and Paul Clough published the definitive
text on multilingual search <https://www.springer.com/gb/book/9783642230073> in
2010. The book runs over 200 pages, which may give you an indication of how
complex this is to implement. Having a corporate language policy is an
important umbrella, as well as identifying and assessing the extent of
bilingual or even trilingual content.

If multilingual search is important to your organization, make sure you
understand how your current application is managing the process before
writing a specification for a new search application. As is so often the
case with search, the devil is in the details.
About the Author

Martin White is Managing Director of Intranet Focus, Ltd. and is based in
Horsham, UK. An information scientist by profession, he has been involved
in information retrieval and search for nearly four decades as a
consultant, author and columnist.

-- 
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

 Harold F. Schiffman

Professor Emeritus of
 Dravidian Linguistics and Culture
Dept. of South Asia Studies
University of Pennsylvania
Philadelphia, PA 19104-6305

Phone:  (215) 898-7475
Fax:  (215) 573-2138

Email:  haroldfs at gmail.com
http://ccat.sas.upenn.edu/~haroldfs/

-------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lgpolicy-list/attachments/20180314/607209fc/attachment.htm>
-------------- next part --------------
_______________________________________________
This message came to you by way of the lgpolicy-list mailing list
lgpolicy-list at groups.sas.upenn.edu
To manage your subscription unsubscribe, or arrange digest format: https://groups.sas.upenn.edu/mailman/listinfo/lgpolicy-list