[Corpora-List] British corpus containing instances of profanity?

Mark Davies Mark_Davies at byu.edu
Tue Feb 25 21:58:33 UTC 2014


You might look at just the GB (Great Britain) portion of the 1.9 billion word GloWbE corpus (Global Web-Based English) - http://corpus2.byu.edu/glowbe/.

The GB portion of GloWbE contains about 400 million words.

Your desiderata:

- Be very recent (after 2000), since the phenomenon on which I focus is a relatively new one
GloWbE-UK is from 2012-2013

- Focus on the U.K.
Yep

- Be composed of naturally occurring conversations to be able to grasp instances of profanity
Mainly blogs; minimal if any censorship

- Provide at least basic information on the informants (such as age, gender, location, socio-economic situation, ethnic origin...)
- Provide contextual information regarding the conversation and the link(s) between speakers
There are direct links to the actual web pages (blogs); you'd have to look at the "About" etc pages at the blogs to see the background of the speakers.

It'[s not a perfect source, but might work...

Best,

Mark Davies

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Michaël GAUTHIER
Sent: Tuesday, February 25, 2014 7:21 AM
To: Corpora at uib.no
Subject: [Corpora-List] British corpus containing instances of profanity?

Dear all,

I am contacting the whole CORPORA list to try to get information on a corpus which could suit my needs, because up to now, all my efforts to find corresponding ones have been in vain.

I am a PhD student investigating the use and perception of profanity among British speakers. Immediately, one difficulty which comes up is that instances of profanity are not easy to record, but there are other factors I need to take into consideration, thus my requirements imply that the corpus would have to:

- Be very recent (after 2000), since the phenomenon on which I focus is a relatively new one
- Focus on the U.K.
- Be composed of naturally occurring conversations to be able to grasp instances of profanity
- Provide at least basic information on the informants (such as age, gender, location, socio-economic situation, ethnic origin...)
- Provide contextual information regarding the conversation and the link(s) between speakers

I know this is a lot to ask, but these requirements are the ones I have in the most ideal situation. As I said, all the corpora I have been reviewing up to now do not correspond. A short list of the main corpora I have reviewed would be: the BNC, Bank of English, Collins Corpus (this one seems great, with 5 billion words, but it is apparently only available to the lexicographers from Collins, I contacted them but got no answer...), COLT, CANCODE, Longman British Spoken Corpus, Limerick Corpus, Scottish Corpus of texts and speech, IViE, London-Lund Corpus of Spoken English, Cambridge English Corpus (same thing as the Collins Corpus...), International Corpus of English, Diachronic Corpus of Present-day Spoken English, British English Speech Dat.

This is it for the main ones, but as I said, no one corresponded perfectly. Thus, I would be more than happy if someone could point at a corpus I would have missed, even if it does not perfectly correspond. At this point, any new hint would be very welcome. If nothing comes up, I think I will have to "sacrifice" some of my requirements to be able to carry out this study, which by the way is a pilot study, so it would not be that tragic a situation, but if I have the opportunity to find something which perfectly corresponds this is even better!

Sorry for the length of this email, I just tried to be as clear as possible... I hope I was...

Thank you in advance for any idea/hint/plan/solution/revelation any one of you may have!

Best regards

Michaël GAUTHIER
Université Lumière Lyon 2
France
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140225/5b70f6cc/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list