[Corpora-List] Linguistics, corpus linguistics, and diglossia

Adam Kilgarriff adam at lexmasterclass.com
Thu Dec 16 06:06:56 UTC 2010


Mike

It's all about getting the right corpus.  It's almost always harder to get
informal than formal text types.  The spoken-conversation part of the BNC is
a great role-model.

A delight of the web is that it has lots of informal language in it,
specially in blogs and similar, so, with a little application, we can gather
text of informal types.  Our work on web corpus collection always has that
in mind.

> saying that corpus linguistics was exactly the wrong way to build a
dictionary

That's just a counsel of failure.  What does she propose doing instead?
Guess (sorry, introspect - mustn't be rude)? Copy existing? Ask her friends?

adam

On 15 December 2010 23:40, Mike Maxwell <maxwell at umiacs.umd.edu> wrote:

> I was talking this afternoon with a lexicographer who is working on western
> Panjabi (the variety--or varieties--spoken in Pakistan, and written in a
> Perso-Arabic script).  She was saying that corpus linguistics was exactly
> the wrong way to build a dictionary of colloquial Panjabi, because of a
> somewhat diglossic situation: the written/ standardized language is not what
> most people speak.
>
> There are of course many diglossic language situations around the world,
> particularly in situations where a single "language" has been written for
> centuries or millenia.  I put "language" in scare quotes because of course
> all languages will have changed over that period of time, to the point of
> non-mutual intelligibility (if you can find any 2000 year old speakers :-)).
>
> At any rate, this certainly matters if you're trying to do dictionaries--or
> any other study of the spoken or colloquial language, or non-standard
> dialects.  I don't recall seeing much discussion of the issues of doing
> corpus linguistics in diglossic languages, the following being one
> exception:
> @article{fonseca2003radical,
>  title={{On the radical difference between the subject personal pronouns in
> written and spoken European French}},
>  author={Fonseca-Greber, B. and Waugh, L.R.},
>  journal={Language and Computers},
>  volume={46},
>  number={1},
>  pages={225--240},
>  issn={0921-5034},
>  year={2003},
>  publisher={Rodopi}
> }
> They resort to some small corpora of transcribed spoken French, and remark
> that they know about some usages that are not attested in these corpora.
> --
>        Mike Maxwell
>        maxwell at umiacs.umd.edu
>        "A library is the best possible imitation, by human beings,
>        of a divine mind, where the whole universe is viewed and
>        understood at the same time... we have invented libraries
>        because we know that we do not have divine powers, but we
>        try to do our best to imitate them." --Umberto Eco
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for
English<http://www.webdante.com>
                  *
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101216/a86e9052/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list