[Corpora-List] Linguistics, corpus linguistics, and diglossia
Adam Kilgarriff
adam at lexmasterclass.com
Thu Dec 16 06:06:56 UTC 2010
Mike
It's all about getting the right corpus. It's almost always harder to get
informal than formal text types. The spoken-conversation part of the BNC is
a great role-model.
A delight of the web is that it has lots of informal language in it,
specially in blogs and similar, so, with a little application, we can gather
text of informal types. Our work on web corpus collection always has that
in mind.
> saying that corpus linguistics was exactly the wrong way to build a
dictionary
That's just a counsel of failure. What does she propose doing instead?
Guess (sorry, introspect - mustn't be rude)? Copy existing? Ask her friends?
adam
On 15 December 2010 23:40, Mike Maxwell <maxwell at umiacs.umd.edu> wrote:
> I was talking this afternoon with a lexicographer who is working on western
> Panjabi (the variety--or varieties--spoken in Pakistan, and written in a
> Perso-Arabic script). She was saying that corpus linguistics was exactly
> the wrong way to build a dictionary of colloquial Panjabi, because of a
> somewhat diglossic situation: the written/ standardized language is not what
> most people speak.
>
> There are of course many diglossic language situations around the world,
> particularly in situations where a single "language" has been written for
> centuries or millenia. I put "language" in scare quotes because of course
> all languages will have changed over that period of time, to the point of
> non-mutual intelligibility (if you can find any 2000 year old speakers :-)).
>
> At any rate, this certainly matters if you're trying to do dictionaries--or
> any other study of the spoken or colloquial language, or non-standard
> dialects. I don't recall seeing much discussion of the issues of doing
> corpus linguistics in diglossic languages, the following being one
> exception:
> @article{fonseca2003radical,
> title={{On the radical difference between the subject personal pronouns in
> written and spoken European French}},
> author={Fonseca-Greber, B. and Waugh, L.R.},
> journal={Language and Computers},
> volume={46},
> number={1},
> pages={225--240},
> issn={0921-5034},
> year={2003},
> publisher={Rodopi}
> }
> They resort to some small corpora of transcribed spoken French, and remark
> that they know about some usages that are not attested in these corpora.
> --
> Mike Maxwell
> maxwell at umiacs.umd.edu
> "A library is the best possible imitation, by human beings,
> of a divine mind, where the whole universe is viewed and
> understood at the same time... we have invented libraries
> because we know that we do not have divine powers, but we
> try to do our best to imitate them." --Umberto Eco
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
--
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director Lexical Computing
Ltd<http://www.sketchengine.co.uk/>
Visiting Research Fellow University of
Leeds<http://leeds.ac.uk>
*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>
*DANTE: a lexical database for
English<http://www.webdante.com>
*
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101216/a86e9052/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list