17.1298, Qs: Anyone to Trade Multilingual Dictionary Databases?
linguist at LINGUISTLIST.ORG
linguist at LINGUISTLIST.ORG
Fri Apr 28 00:49:32 UTC 2006
LINGUIST List: Vol-17-1298. Thu Apr 27 2006. ISSN: 1068 - 4875.
Subject: 17.1298, Qs: Anyone to Trade Multilingual Dictionary Databases?
Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
Reviews (reviews at linguistlist.org)
Sheila Dooley, U of Arizona
Terry Langendoen, U of Arizona
Homepage: http://linguistlist.org/
The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.
Editor for this issue: James Rider <rider at linguistlist.org>
================================================================
We'd like to remind readers that the responses to queries are usually
best posted to the individual asking the question. That individual is
then strongly encouraged to post a summary to the list. This policy was
instituted to help control the huge volume of mail on LINGUIST; so we
would appreciate your cooperating with it whenever it seems appropriate.
In addition to posting a summary, we'd like to remind people that it
is usually a good idea to personally thank those individuals who have
taken the trouble to respond to the query.
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
===========================Directory==============================
1)
Date: 23-Apr-2006
From: Joel Shapiro < jrs_14618 at yahoo.com >
Subject: Anyone to Trade Multilingual Dictionary Databases?
-------------------------Message 1 ----------------------------------
Date: Thu, 27 Apr 2006 20:46:55
From: Joel Shapiro < jrs_14618 at yahoo.com >
Subject: Anyone to Trade Multilingual Dictionary Databases?
Hello All,
I am a Windows Automated Robot Script Programmer
with an interest in multi-lingual applications.
I program my robot scripts using a powerful
automated robot scripting tool named Macro
Scheduler by Mjtnet (www.mjtnet.com)
My current project has the objective of enabling
the user to perform very effective web search
engine queries in languages the user has even total
unfamiliarity.
In a nutshell users E-mail my computer (server) a
search engine request list of search terms in their
native language font or characters. My automated
robot script does a dictionary word-for-word or
word-for-short term translation of the words in the
E-mail request to the user's designated or 'target'
language for a search engine search in the native
font or characters of the target language.
Currently the database for my dictionaries in
several languages are in (English) Excel
spreadsheets because without getting too technical
Macro Scheduler has specialized commands that makes
interacting with Excel a trivial proposition versus
one that would otherwise be complicated.
Fortunately Non-English unicode text characters keep
their attributes just fine in English Excel. Thus,
using Excel as a text parsing and calculation
intermediary is recommended by other Macro Scheduler
programmers.
The user can also designate the target language
to be the same as that of the E-mail request. In
this case words/terms from the request are directly
implemented in the search URL (which I will further
describe in more detail shortly) and no dictionary
translation is required.
In the current application my automated robot
E-Mails the results back to the user in an attached
Excel spreadsheet.
The key to the effectiveness of my multi-lingual
search engine interface is the establishment of
dictionaries in all languages in their respective
native Unicode font or character sets not just
with respect to ''regular'' dictionary words but
geographic location and proper (i.e. people's)
names as well.
The crux of my post is to inquire if anyone has
developed an application or for that matter just
extensively uses an application in 'native' Unicode
fonts or characters if you would be receptive to
the idea of trading your word database with mine.
This could make say your present Russian program
(Cyrillic text characters) or application truly
multi-lingual/multi-national ... perhaps with a
little help from an automated robot script with
respect to either gleaning the words/terms or
making others language characters applicable in
your program.
I will address these topics in further detail
shortly but first ...
Because Google is the current world search
engine leader it was/is my first choice for
implementing my automated robot scripts on it.
Google provides and advertises an API or
''Application Programming Interface'' which provides
the user essentially some robot capability for
automated searches of their famed search engine. I
naively figured Google had no qualms or opposition
to automated scripts interfacing with their search
engine provided the number of accesses do not exceed
the limit Google sets for their API.
In other words I figured if the user is not directly
interacting with the Google's main page via their
API or my robot script which has much more in the
way of custom specialized functionality and
capability; it would be a ''wash''.
Wrong!
Google in its Terms of Service verbiage
specifically prohibits automated robot activity
or interaction to its services from its users
unless authorized by them.
For a few moments after I read the Google's
explicit prohibition it didn't make sense. But
then it occurred to me Google's main order of
business is their search engine and their
carnivorous assimilation of data from its users.
In such ''third party'' automated script robots
such as mine the explicit association between
the user's search request and the user's IP
address is lost ... as well as one of the crown
jewels of Google's company interests that
separate it from other search engine providers.
Interestingly, other search engines I've
investigated appear not to have such explicit
Terms of Service prohibitions as Google against
automated scripts accessing them. Perhaps the
others have other primary business interests and
directions where the association of user and IP
address is not so paramount.
Also, I found that the same concept of my ''packing''
the ''search URL'' even easier with other search
engines! Where Google requires a different search
URL ''string'' for each language as will shortly be
described, other search engines have one search URL
''template'' or cookie cutter format where all the
robot has to do is plug in the Unicode characters
for any language in a standard search URL and ...
Viola! It works!
So, where the following examples are all with
respect to Google, the actual robot searches will
not be using Google but other search engines.
However, importantly the underlying concept and
mechanics are all the same.
As I mentioned earlier my Multi-Lingual Macro
Scheduler automated robot search engine
interface has the following format:
The user sends my computer (server) an E-mail
of a list of words for a Google search in his/her
preferred or native language in the native Unicode
font or character set and designates the language
the for which the search engine (Google) search is
to be performed.
The robot automatically scans for new E-mails
and upon recognizing a valid request: valid user
login and password, a language that is operational
and the request is valid format so the robot can
act on it etc., the first thing the robot does is
make a word-for-word or word-for-short phase
dictionary translation of the word list.
These will be the search engine (Google) search
terms ... again, in the target language's native
font or character set and in the order the user
lists them in the request.
The request and the target language can be the
same. In this case no dictionary translation need
take place and the words from the request are
directly transferred ''as is'' to the search
processing portion of the application.
Probably most of you reading this post are aware
Google has a ''main page'' for various languages
in a continuing worldwide collaborative effort.
The portal to this capability is selecting the
''Language Tools'' link on the ''regular'' English
Google web page: www.google.com
Interestingly, after implementing a Google search
any one of its various foreign language main pages
the result URLs contain not only search words/terms
in the native font, but the URLs respectively for
each language consistently maintain their format.
With Google every language has its own search URL
which can be replaced by English.
For instance the Urdu search string for famous
world traveler and explorer Marco Polo doing a
using English characters is:
http://www.google.com/search?hl=ur&q=Marco+Polo&btnG=%D8%AA%D9%84%D8%...
The Greek search string for Marco Polo using
English characters is:
http://www.google.com/search?hl=el&q=Marco+Polo&btnG=%CE%91%CE%BD%CE%...
Google provides by default the first 10 results on
and the first result page. The ''next 10'' Google
URLs for Urdu and Greek respectively are:
http://www.google.com/search?q=Marco+Polo&hl=ur&lr=lang_en&start=10&sa=N
http://www.google.com/search?q=Marco+Polo&hl=el&lr=lang_id&start=10&sa=N
Likewise I've found there is an equivalent of these
standard ''next 10'' URLs in other search engines as
well.
Once my robot has parsed the search words or terms
from the E-mail request and performed a dictionary
translation if required, it ''plugs in'' the terms in
the search URL and deploys it bypassing the need to
interact with Google's main page for the given
language or, for that matter, the main page of any
search engine.
The Marco and Polo delineated by a plus '+' sign
are replaced respectively with the native Unicode
renditions of Macro and Polo in Urdu and Greek.
Deploying the search URL in the respective native
font/characters renditions of Marco and Polo will
yield different, often more effective results
depending on the context.
More importantly where just text parsing and
processing is the objective not only don't I need
to interface with a search engine's main page ...
I don't need to use a graphic browser such as
Microsoft Internet Explorer (IE), Firefox, Netscape
etc. to deploy the search URL's.
Macro Scheduler has an HTTPRequest command which
gleans the text whether it it be standard ASCII
English text or the Unicode text for various
foreign languages in a fraction of a second versus
waiting for graphics of web page to stabilize in
standard browsers.
For applications where pure text and no graphical
(i.e. picture) aspects are involved, a Macro
Scheduler solution is an order of magnitude more
efficient and robust than an automated robot
solution that interacts with a browser.
The results of the search URL are URLs of web pages
that contain and/or pertain to the search criteria.
My robot recognizes these URLs and in a most
expedited and efficient manner; again using Macro
Scheduler's HTTPRequest command does an HTTPRequest
of the result URLs and finds instances of the words
and terms of the search request their frequency in
the result URLs.
The result URLs and presence/frequency data of the
search terms are ported into an Excel spreadsheet
and E-mailed back to the user as an attachment.
Macro Scheduler also has specialized commands for
making the aspect of scanning, receiving and
sending E-mail posts trivial as well.
I hope in this post I have adequately conveyed the
gist of my Multi-Lingual Automated Robot Search
Engine Interface (MLARSEI). However, feature rich
I can make it, it is inherently limited by the
extent of the dictionaries.
Thank you for your interest and consideration.
Regards,
Joel S.
Rochester, New York
jrs_14618 at yahoo.com
Linguistic Field(s): Translation
-----------------------------------------------------------
LINGUIST List: Vol-17-1298
More information about the LINGUIST
mailing list