<div dir="ltr">
<div class="gmail-col-md-12"> <h1> Searching for Information in the Tower of Babel </h1> </div> <div class="gmail-content-body gmail-col-md-8 gmail-col-sm-6 gmail-col-xs-12"> <div class="gmail-article-author"> <img src="https://www.cmswire.com/~/media/2bb8eb50539946c5a9e76c71995d784b.jpg?w=200&h=200&as=1&hash=48132188BB4390D3B44CE1EA5323D0E883C11710" alt="Martin White" class="gmail-author-avatar" width="200" height="200"> <p> By <span class="gmail-author"><a href="https://www.cmswire.com/author/martin-white/" title="Martin White's profile" rel="author">Martin White</a></span> <span class="gmail-separator">|</span> <em>Mar 13, 2018</em> </p> <p class="gmail-category gmail-hidden-sm gmail-hidden-xs"> CHANNEL: <a href="https://www.cmswire.com/information-management/">Information Management</a> </p> </div> <div class="gmail-article-smallad"> <div id="gmail-div-gpt-ad-1336852434508-2" class="gmail-"> <div id="gmail-google_ads_iframe_/1003060/Article-Top_468x15_0__container__" style="border-color:currentcolor;border-style:none;border-width:0pt"></div></div> </div> <div class="gmail-article-body"> <figure class="gmail-image-figure gmail-image-figure_featured-image gmail-article-vert gmail-insidevert1"> <img src="https://www.cmswire.com/~/media/4de168997c3e493c9a862af1aded4473.jpg?mw=320&mh=240&hash=16602CF34E61E957474F7F9EB83C6056AA9AD6F7" alt="Pieter Bruegel the Elder, The Tower of Babel" class="gmail-block"> <figcaption> <span class="gmail-photo-cutline"><i class="gmail-fa gmail-fa-camera"></i>Enterprise search already poses enough challenges. Add in multilingual search, and the challenges only grow</span> <span class="gmail-photo-credit"> PHOTO:
public domain </span> </figcaption> </figure> <p>We've only started to recognize the impacts and implications of employees in multinational organizations <a href="https://harzing.com/research/language-in-international-business" target="_blank">working in more than one language</a> in the last few years. </p> <p>The
concept of a definitive single corporate language is fast disappearing
as these organizations adopt collaboration as the default mode of work
and introduce multiple social media channels. And while employees may
indicate language skills on their profiles, the level of competence (a
term whose definition is open to debate) at speaking, reading, writing
and understanding a foreign language may vary considerably. </p> <p>So
it stands to reason that employees in multilingual organizations are
enlisting search applications to locate information in a wide range of
languages, but I suspect these organizations lack a strategy that will
support the resources and action to make this possible. </p> <strong> <h2>Accidental Corporate Language Policies</h2> </strong> <p>All
too often intranet teams define a corporate language policy almost by
accident as requests come in from around the world to be able to publish
in local languages. Language policies don't only cover which languages
will be supported but also which pairs of languages will be supported. </p><div id="gmail-div-gpt-ad-1336852434508-3" class="gmail-article-vert gmail-insidevert2"> <div id="gmail-google_ads_iframe_/1003060/Inline-300x250--01_0__container__" style="border-color:currentcolor;border-style:none;border-width:0pt"></div></div> <p>Machine
translation is adequate in some contexts, but insufficient in the case
of contracts and other official documents as well as for press releases
from publicly-owned companies. A slight machine mistranslation of a
corporate press release could have unexpected and undesirable effects on
the share price. </p> <strong> <h2>The Difference Between Multilingual and Cross-Lingual Search</h2> </strong> <p>Multilingual
and cross-lingual search are two very different processes that are
often confused. Multilingual search is where a query in English will
search content in English, a query in French will search for content in
French, and so on. Ideally the index for each language will be of a
similar quality, having been created by the appropriate stemming and
lemmatization tools and with equally appropriate stop words. Integrating
content in English, French and German into the same index is a very
poor approach. </p> <p>Cross-lingual search is where a search term in
English is used and the search application uses taxonomies, thesauri and
maybe machine translation to match the meaning of the term in multiple
languages. This is seriously challenging, not just from the query/index
management standpoint, but also raises questions of how to present the
results. There are many options. </p> <strong> <h2>The Special Challenges of Multilingual Documents</h2> </strong> <p>When
in doubt, assume documents contain more than one language. A document
in German may quote the text of a local contract in French and have an
executive summary in English. Further adding complexity is the practice
of adding metadata in English to a German document because English is
the ‘corporate’ language. These situations pose a challenge not only to
the language identification algorithms in the search application but
also to the ranking of the document. </p> <p>Mixing languages has
implications for the weighting of terms in retrieval. A short snippet of
French appearing in predominantly German text would give more weight to
the French words, since they are relatively uncommon. But mix those in
with documents written predominantly in French, and the short French
snippets now have much less weight. </p> <p>Metadata presents a special
problem. A document in Chinese may end up at the top of a list of
relevant documents if it has been helpfully tagged in English, given an
English title and the metadata is given substantial weight in the
ranking algorithm. </p> <strong> <h2>The Joys of Query Language Identification</h2> </strong> <p>Although
a number of language detectors are available, most need at least 200
characters to confirm a language. This is not really long enough for
microblogs, though <a href="http://www.aclweb.org/anthology/P12-3005" target="_blank">langid</a> does a pretty good job. </p> <p>Identifying
a language prior to indexing is not going to create any latency the
user will be aware of. That is not the case with a query, where the
identification has to be done on the fly. In these instances, it may
help to prompt the user to ensure the language identified for the query
term(s) is correct. A common use case is where an employee is working in
France, is logged in to the French language pages of the intranet, but
then conducts a search in English. A single word may not be a major
problem, but a noun phrase could be — check this point with your search
vendor.</p> <strong> <h2>Plan for Multilingual Search</h2> </strong> <p>Carol Peters, Martin Braschler and Paul Clough published the definitive text on <a href="https://www.springer.com/gb/book/9783642230073" target="_blank">multilingual search</a> in
2010. The book runs over 200 pages, which may give you an indication of
how complex this is to implement. Having a corporate language policy is
an important umbrella, as well as identifying and assessing the extent
of bilingual or even trilingual content. </p> <p>If multilingual search
is important to your organization, make sure you understand how your
current application is managing the process before writing a
specification for a new search application. As is so often the case with
search, the devil is in the details.</p> </div> <h2>About the Author</h2> <p>Martin
White is Managing Director of Intranet Focus, Ltd. and is based in
Horsham, UK. An information scientist by profession, he has been
involved in information retrieval and search for nearly four decades as a
consultant, author and columnist.</p></div>
<br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+<br><br> Harold F. Schiffman<br><br>Professor Emeritus of <br> Dravidian Linguistics and Culture <br>Dept. of South Asia Studies <br>University of Pennsylvania<br>Philadelphia, PA 19104-6305<br><br>Phone: (215) 898-7475<br>Fax: (215) 573-2138 <br><br>Email: <a href="mailto:haroldfs@gmail.com" target="_blank">haroldfs@gmail.com</a><br><a href="http://ccat.sas.upenn.edu/~haroldfs/" target="_blank">http://ccat.sas.upenn.edu/~haroldfs/</a> <br><br>-------------------------------------------------</div>
</div>