[Lexicog] Re: Lexique Pro file from scratch

Mike Maxwell maxwell at LDC.UPENN.EDU
Wed Feb 2 13:13:35 UTC 2005


Christopher Manning wrote:
> ...Imagine a tool
> for entering dictionary entries that looks more like a Word document but
> you put pieces of text in fields by selecting them and choosing a field
> (much as you apply styles in MS Word).  Perhaps it would color code
> pieces of text for different fields, and would object if there were text
> other than field-separator punctuation that wasn't assigned to a
> field. A cleverer program might even attempt to automatically assign
> text for new entries to fields based on assumed consistency in content
> and punctuation with existing entries.  This tool would load and output
> dictionaries in a structured format such as Standard Format or XML.

If I understand correctly, you're talking about converting existing
dictionaries which are Word docs or some such.  (Hence the part about
selecting pieces of text.)

Out of necessity, I've been doing a form of the 'cleverer program'
approach to convert existing dictionaries from html to xml, and I can
say that it's non-trivial.

In part this is due to inconsistencies in the dictionaries I've seen
(one dictionary used a different format for words beginning with the
letter 'N'--I suspect the lexicographer started there, and later changed
his mind), and in part it's due to the large variety of formats.  Just
when I think my parser handles all the cases, I find a new one.

I think the problem with starting with unstructured data, or implicitly
structured data (such as the html dictionaries, where various kinds of
formatting have been used to mimic traditional printed dictionary
formats), is that people can be (1) inconsistent, and (2) imaginative.

That said, there's obviously a need for importation schemes.  It's just
not easy, and the closer the tool whose data you're importing
approximates  free-form, the more inconsistency you'll find, and the
greater the difficulty in importing.
--
	Mike Maxwell
	Linguistic Data Consortium
	maxwell at ldc.upenn.edu


------------------------ Yahoo! Groups Sponsor --------------------~-->
Has someone you know been affected by illness or disease?
Network for Good is THE place to support health awareness efforts!
http://us.click.yahoo.com/RzSHvD/UOnJAA/79vVAA/HKE4lB/TM
--------------------------------------------------------------------~->


Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list