17.1298, Qs: Anyone to Trade Multilingual Dictionary Databases?

Fri Apr 28 00:49:32 UTC 2006

LINGUIST List: Vol-17-1298. Thu Apr 27 2006. ISSN: 1068 - 4875.

Subject: 17.1298, Qs: Anyone to Trade Multilingual Dictionary Databases?

Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org) 
        Sheila Dooley, U of Arizona  
        Terry Langendoen, U of Arizona  

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: James Rider <rider at linguistlist.org>
================================================================  

We'd like to remind readers that the responses to queries are usually
best posted to the individual asking the question. That individual is
then strongly encouraged to post a summary to the list. This policy was
instituted to help control the huge volume of mail on LINGUIST; so we
would appreciate your cooperating with it whenever it seems appropriate.

In addition to posting a summary, we'd like to remind people that it
is usually a good idea to personally thank those individuals who have
taken the trouble to respond to the query.

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 23-Apr-2006
From: Joel Shapiro < jrs_14618 at yahoo.com >
Subject: Anyone to Trade Multilingual Dictionary Databases? 

-------------------------Message 1 ---------------------------------- 
Date: Thu, 27 Apr 2006 20:46:55
From: Joel Shapiro < jrs_14618 at yahoo.com >
Subject: Anyone to Trade Multilingual Dictionary Databases? 

Hello All, 

I am a Windows Automated Robot Script Programmer 
with an interest in multi-lingual applications. 
I program my robot scripts using a powerful 
automated robot scripting tool named Macro 
Scheduler by Mjtnet (www.mjtnet.com) 

My current project has the objective of enabling 
the user to perform very effective web search 
engine queries in languages the user has even total 
unfamiliarity. 

In a nutshell users E-mail my computer (server) a 
search engine request list of search terms in their 
native language font or characters.  My automated 
robot script does a dictionary word-for-word or 
word-for-short term translation of the words in the 
E-mail request to the user's designated or 'target' 
language for a search engine search in the native 
font or characters of the target language. 

Currently the database for my dictionaries in 
several languages are in (English) Excel 
spreadsheets because without getting too technical 
Macro Scheduler has specialized commands that makes 
interacting with Excel a trivial proposition versus 
one that would otherwise be complicated. 
Fortunately Non-English unicode text characters keep 
their attributes just fine in English Excel.  Thus, 
using Excel as a text parsing and calculation 
intermediary is recommended by other Macro Scheduler 
programmers. 

The user can also designate the target language 
to be the same as that of the E-mail request.  In 
this case words/terms from the request are directly 
implemented in the search URL (which I will further 
describe in more detail shortly) and no dictionary 
translation is required. 

In the current application my automated robot 
E-Mails the results back to the user in an attached 
Excel spreadsheet. 

The key to the effectiveness of my multi-lingual 
search engine interface is the establishment of 
dictionaries in all languages in their respective 
native Unicode font or character sets not just 
with respect to ''regular'' dictionary words but 
geographic location and proper (i.e. people's) 
names as well. 

The crux of my post is to inquire if anyone has 
developed an application or for that matter just 
extensively uses an application in 'native' Unicode 
fonts or characters if you would be receptive to 
the idea of trading your word database with mine. 

This could make say your present Russian program 
(Cyrillic text characters) or application truly 
multi-lingual/multi-national ... perhaps with a 
little help from an automated robot script with 
respect to either gleaning the words/terms or 
making others language characters applicable in 
your program. 

I will address these topics in further detail 
shortly but first ... 

Because Google is the current world search 
engine leader it was/is my first choice for 
implementing my automated robot scripts on it. 

Google provides and advertises an API or 
''Application Programming Interface'' which provides 
the user essentially some robot capability for 
automated searches of their famed search engine.  I 
naively figured Google had no qualms or opposition 
to automated scripts interfacing with their search 
engine provided the number of accesses do not exceed 
the limit Google sets for their API. 

In other words I figured if the user is not directly 
interacting with the Google's main page via their 
API or my robot script which has much more in the 
way of custom specialized functionality and 
capability; it would be a ''wash''. 

Wrong! 

Google in its Terms of Service verbiage 
specifically prohibits automated robot activity 
or interaction to its services from its users 
unless authorized by them. 

For a few moments after I read the Google's 
explicit prohibition it didn't make sense.  But 
then it occurred to me Google's main order of 
business is their search engine and their 
carnivorous assimilation of data from its users. 
In such ''third party'' automated script robots 
such as mine the explicit association between 
the user's search request and the user's IP 
address is lost ... as well as one of the crown 
jewels of Google's company interests that 
separate it from other search engine providers. 

Interestingly, other search engines I've 
investigated appear not to have such explicit 
Terms of Service prohibitions as Google against 
automated scripts accessing them.  Perhaps the 
others have other primary business interests and 
directions where the association of user and IP 
address is not so paramount. 

Also, I found that the same concept of my ''packing'' 
the ''search URL'' even easier with other search 
engines!  Where Google requires a different search 
URL ''string'' for each language as will shortly be 
described, other search engines have one search URL 
''template'' or cookie cutter format where all the 
robot has to do is plug in the Unicode characters 
for any language in a standard search URL and ... 
Viola! It works! 

So, where the following examples are all with 
respect to Google, the actual robot searches will 
not be using Google but other search engines. 
However, importantly the underlying concept and 
mechanics are all the same. 

As I mentioned earlier my Multi-Lingual Macro 
Scheduler automated robot search engine 
interface has the following format: 

The user sends my computer (server) an E-mail 
of a list of words for a Google search in his/her 
preferred or native language in the native Unicode 
font or character set and designates the language 
the for which the search engine (Google) search is 
to be performed. 

The robot automatically scans for new E-mails 
and upon recognizing a valid request: valid user 
login and password, a language that is operational 
and the request is valid format so the robot can 
act on it etc., the first thing the robot does is 
make a word-for-word or word-for-short phase 
dictionary translation of the word list. 

These will be the search engine (Google) search 
terms ... again, in the target language's native 
font or character set and in the order the user 
lists them in the request. 

The request and the target language can be the 
same.  In this case no dictionary translation need 
take place and the words from the request are 
directly transferred ''as is'' to the search 
processing portion of the application. 

Probably most of you reading this post are aware 
Google has a ''main page'' for various languages 
in a continuing worldwide collaborative effort. 
The portal to this capability is selecting the 
''Language Tools'' link on the ''regular'' English 
Google web page: www.google.com 

Interestingly, after implementing a Google search 
any one of its various foreign language main pages 
the result URLs contain not only search words/terms 
in the native font, but the URLs respectively for 
each language consistently maintain their format. 

With Google every language has its own search URL 
which can be replaced by English. 

For instance the Urdu search string for famous 
world traveler and explorer Marco Polo doing a 
using English characters is: 

http://www.google.com/search?hl=ur&q=Marco+Polo&btnG=%D8%AA%D9%84%D8%... 

The Greek search string for Marco Polo using 
English characters is: 

http://www.google.com/search?hl=el&q=Marco+Polo&btnG=%CE%91%CE%BD%CE%... 

Google provides by default the first 10 results on 
and the first result page.  The ''next 10'' Google 
URLs for Urdu and Greek respectively are: 

http://www.google.com/search?q=Marco+Polo&hl=ur&lr=lang_en&start=10&sa=N 
http://www.google.com/search?q=Marco+Polo&hl=el&lr=lang_id&start=10&sa=N 

Likewise I've found there is an equivalent of these 
standard ''next 10'' URLs in other search engines as 
well. 

Once my robot has parsed the search words or terms 
from the E-mail request and performed a dictionary 
translation if required, it ''plugs in'' the terms in 
the search URL and deploys it bypassing the need to 
interact with Google's main page for the given 
language or, for that matter, the main page of any 
search engine. 

The Marco and Polo delineated by a plus '+' sign 
are replaced respectively with the native Unicode 
renditions of Macro and Polo in Urdu and Greek. 
Deploying the search URL in the respective native 
font/characters renditions of Marco and Polo will 
yield different, often more effective results 
depending on the context. 

More importantly where just text parsing and 
processing is the objective not only don't I need 
to interface with a search engine's main page ... 
I don't need to use a graphic browser such as 
Microsoft Internet Explorer (IE), Firefox, Netscape 
etc. to deploy the search URL's. 

Macro Scheduler has an HTTPRequest command which 
gleans the text whether it it be standard ASCII 
English text or the Unicode text for various 
foreign languages in a fraction of a second versus 
waiting for graphics of web page to stabilize in 
standard browsers. 

For applications where pure text and no graphical 
(i.e. picture) aspects are involved, a Macro 
Scheduler solution is an order of magnitude more 
efficient and robust than an automated robot 
solution that interacts with a browser. 

The results of the search URL are URLs of web pages 
that contain and/or pertain to the search criteria. 
My robot recognizes these URLs and in a most 
expedited and efficient manner; again using Macro 
Scheduler's HTTPRequest command does an HTTPRequest 
of the result URLs and finds instances of the words 
and terms of the search request their frequency in 
the result URLs. 

The result URLs and presence/frequency data of the 
search terms are ported into an Excel spreadsheet 
and E-mailed back to the user as an attachment. 

Macro Scheduler also has specialized commands for 
making the aspect of scanning, receiving and 
sending E-mail posts trivial as well. 

I hope in this post I have adequately conveyed the 
gist of my Multi-Lingual Automated Robot Search 
Engine Interface (MLARSEI).  However, feature rich 
I can make it, it is inherently limited by the 
extent of the dictionaries. 

Thank you for your interest and consideration. 

Regards, 

Joel S. 
Rochester, New York 
jrs_14618 at yahoo.com 

Linguistic Field(s): Translation

-----------------------------------------------------------
LINGUIST List: Vol-17-1298