[Corpora-List] Do we still need language corpora?

Janne Bondi Johannessen jannebj at iln.uio.no
Fri Feb 4 14:53:12 UTC 2011


Dear all.
I have been responsible for developing many corpora at the University of
Oslo, and I can safely say that there is hardly any of them that have any
competition from the web. Leaving aside the question of user interface,
there are many features that are not present in web documents, and that are
important for users (linguists, text researchers and language
technologists). Here are some:

- spoken language
- dialects
- speech situations
- dialogue
- source and translated texts
- free choice of text types and genres
- grammatical annotation (and other linguistic annotation)
- background information on the text producers (age, gender, mother tongue,
place of birth, place of living, education etc.)

For more information on our corpora, see:
http://www.hf.uio.no/iln/english/about/organization/text-laboratory/

Best wishes,
Janne Bondi Johannessen



2011/2/4 Mark Davies <Mark_Davies at byu.edu>

> Martin,
>
> I would imagine that one motivation for the question is the availability of
> "corpora" like Google/Web and Google Books. Of course, one needs to
> distinguish between:
>
> corpus = textual corpus (i.e. words and sentences + metadata)
> and
> corpus = textual corpus + architecture and interface for accessing the
> information
>
> Many wonderful textual corpora are "trapped" inside an architecture and
> interface that don't allow users to do much with them. As everyone dealing
> with "Web as Corpus" knows, effectively and efficiently using
> Web/Google/Books data -- especially via the native Google interface -- is a
> real challenge.
>
> Two pages that might be relevant:
>
> http://corpus.byu.edu/coha/compare-googleBooks.asp
>
> http://corpus.byu.edu/coca/compare-google.asp
>
> Best,
>
> Mark D.
>
> ============================================
> Mark Davies
> Professor of (Corpus) Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
> Web: http://davies-linguistics.byu.edu
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
Janne Bondi Johannessen
Professor, The Text Laboratory, ILN, http://www.hf.uio.no/tekstlab/
President, NEALT, http://omilia.uio.no/nealt/
University of Oslo
P.O.Box 1102 Blindern, N-0317 Oslo, Norway
Tel: +47 22 85 68 14, mob.: +47 928 966 34
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110204/1c1b6fdd/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list