[Corpora-List] comparison of language varieties

Tue Jul 15 09:41:09 UTC 2008

Dear all,

This is a general query about comparing language variety corpora 

following Asim's questions (see below).

I am looking for any automated corpus studies and tools 

for comparing the varieties of a language, 

in order to take them as a basis for further research

on the development of tools for the systematic and automated

comparison of linguistic varieties on the basis of text corpora.

Up to now I have contacted researchers of several variety corpus projects,

e.g. the 'International Corpus of English' ICE, 

the 'Trésor de la Langue Française informatisé' TLFi, or

the 'Proyecto para el Estudio Sociolingüístico del Español de España y América' PRESEA.

I got pointed to semi-automatic studies on the lexical level, 

e.g. at the Centro de Linguística da Universidade de Lisboa (CLUL).

As far as I can see now, there have not been any publications 

on automated comparison tools for higher levels of linguistic description, 

e.g. on collocations, syntactic differences or even on the textual level.

So I'd appreciate references to such studies, starting from the lexical level.

In addition, I'd be grateful about any other ideas on contrasting 'similar' corpora / data sets,

which might also come from quite different research fields.

I will post a summary with the replies I get.

Thank you for any kinds of hints,

Stefanie

--

Stefanie Anstein

Institute for Specialised Communication and Multilingualism

EURAC research

Viale Druso 1, I-39100 Bolzano

t +39 0471 055 135

f +39 0471 055 199

stefanie.anstein at eurac.edu

www.eurac.edu 

This transmission is intended only for the use of the addressee and may contain confidential or legally privileged information. 

If you receive this transmission by error, please notify the author immediately by mail and delete all copies of this transmission and any attachments. 

Any use or dissemination of this communication is strictly prohibited by the "Privacy-Code", D.Lgs. 196/2003 and may conduct to penal prosecution and liability for damages.

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Asim
Sent: Tuesday, 27 May, 2008 19:41
To: corpora at uib.no
Subject: [Corpora-List] request for parsing and making the data in a form tobe used by wordsmith

Hello

I am working on Pakistani English. I have compiled a 2.1 million word corpus of written Pakistani English. It is the first ever corpus of Pakistani English .

I want to study the features of Pakistani variety of English. Could any tell me how to locate them. Any suggestion would be welcome.

I have tagged it and now trying to analyse it using both top down and bottom up approaches.

I want to study the verb particles and for this I want to parse the data as  I think it is the only possibility that I can get the confirmation that either it is a preposition or particle. If there is any other way except manual study just tell me and I will be obliged.

Another  issue is when I use some online available demo parsers like LFG  how to store the results to be used with wordsmith 4 and use them to locate all the particles from my data .

Is there any solution.

Wish to hear from you soon.

Regards

Asim

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080715/72f37c49/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora