[Corpora-List] Do we still need language corpora?

Alberto Simões albie at alfarrabio.di.uminho.pt
Fri Feb 4 16:36:32 UTC 2011

yes, we need!

On 04/02/2011 14:53, Janne Bondi Johannessen wrote:
> Dear all.
> I have been responsible for developing many corpora at the University of
> Oslo, and I can safely say that there is hardly any of them that have
> any competition from the web. Leaving aside the question of user
> interface, there are many features that are not present in web
> documents, and that are important for users (linguists, text researchers
> and language technologists). Here are some:
> - spoken language
> - dialects
> - speech situations
> - dialogue
> - source and translated texts
> - free choice of text types and genres
> - grammatical annotation (and other linguistic annotation)
> - background information on the text producers (age, gender, mother
> tongue, place of birth, place of living, education etc.)
> For more information on our corpora, see:
> http://www.hf.uio.no/iln/english/about/organization/text-laboratory/
> Best wishes,
> Janne Bondi Johannessen
> 2011/2/4 Mark Davies <Mark_Davies at byu.edu <mailto:Mark_Davies at byu.edu>>
>     Martin,
>     I would imagine that one motivation for the question is the
>     availability of "corpora" like Google/Web and Google Books. Of
>     course, one needs to distinguish between:
>     corpus = textual corpus (i.e. words and sentences + metadata)
>     and
>     corpus = textual corpus + architecture and interface for accessing
>     the information
>     Many wonderful textual corpora are "trapped" inside an architecture
>     and interface that don't allow users to do much with them. As
>     everyone dealing with "Web as Corpus" knows, effectively and
>     efficiently using Web/Google/Books data -- especially via the native
>     Google interface -- is a real challenge.
>     Two pages that might be relevant:
>     http://corpus.byu.edu/coha/compare-googleBooks.asp
>     http://corpus.byu.edu/coca/compare-google.asp
>     Best,
>     Mark D.
>     ============================================
>     Mark Davies
>     Professor of (Corpus) Linguistics
>     Brigham Young University
>     (phone) 801-422-9168 / (fax) 801-422-0906
>     Web: http://davies-linguistics.byu.edu
>     ** Corpus design and use // Linguistic databases **
>     ** Historical linguistics // Language variation **
>     ** English, Spanish, and Portuguese **
>     ============================================
>     _______________________________________________
>     Corpora mailing list
>     Corpora at uib.no <mailto:Corpora at uib.no>
>     http://mailman.uib.no/listinfo/corpora
> --
> Janne Bondi Johannessen
> Professor, The Text Laboratory, ILN, http://www.hf.uio.no/tekstlab/
> President, NEALT, http://omilia.uio.no/nealt/
> University of Oslo
> P.O.Box 1102 Blindern, N-0317 Oslo, Norway
> Tel: +47 22 85 68 14, mob.: +47 928 966 34
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

Alberto Simões

Corpora mailing list
Corpora at uib.no

More information about the Corpora mailing list