[Corpora-List] labels of COLT files in BNC spoken
Sebastian Hoffmann
sebhoff at es.unizh.ch
Thu Nov 13 13:54:50 UTC 2003
At 1:00 PM +0000 11/13/03, Eric Atwell wrote:
>Lou,
>thanks for this expert clarification.
>Demo chatbots trained with a variety of BNC files are now on my web-page
>http://www.comp.leeds.ac.uk/eric/ and we can add more ....
>
>- I have a follow-up question: can you suggest any specific BNC spoken
> files which illustrate particularly "interesting" / idiosyncratic
> language use? For example, the BNC file with the most swearing? :)
> We want to identify a selection of "unusual" files, to train
> a collection of noticeably different chatbots.
>
>thanks
>
>Eric
>
Eric,
Here's some output from BNCweb which will probably help you with your
search for the text with the most swearing - however, I didn't spend
much time compiling the list of "bad words"... ;-)
Your query "<stext>#((fuck|fucks|fucking|fucked|shit|arsehole|
bastard|cunt|dickhead|bitch|prick))" returned 4032 matches in 144
different texts
It was most frequently found in the following files (only texts with
at least three occurrences are considered)
Name of Text | Number of words | Number of hits | Freq. pmw
KE5 5,121 92 17965.24
KDA 75,783 1,098 14488.74
KP9 6,963 71 10196.75
KD9 13,908 124 8915.73
KE1 21,001 180 8571.02
KR2 8,090 69 8529.05
KPH 12,070 75 6213.75
KPT 7,553 41 5428.31
KDN 46,326 251 5418.12
KP4 34,712 182 5243.14
KCU 53,859 279 5180.19
KPP 8,112 42 5177.51
KNV 7,853 37 4711.58
KP7 1,938 9 4643.96
KSU 2,388 11 4606.37
KB4 902 4 4434.59
KR1 5,453 24 4401.25
KP0 7,869 34 4320.75
KSP 1,543 6 3888.53
KPG 45,229 145 3205.91
Your query was least frequently found in the following files (only
texts with at least one occurrence are considered)
Name of Text | Number of words | Number of hits | Freq. pmw
KRT 158,430 1 6.31
KCT 104,104 1 9.61
KBW 123,017 2 16.26
KDM 115,661 2 17.29
KBH 51,340 1 19.48
KC2 47,809 1 20.92
KS7 43,335 1 23.08
KBB 81,085 2 24.67
KDV 29,392 1 34.02
KCS 25,055 1 39.91
FUK 20,220 1 49.46
KR0 20,183 1 49.55
JYN 19,468 1 51.37
KB2 37,597 2 53.20
KBF 111,948 6 53.60
KP1 70,999 4 56.34
KDJ 17,227 1 58.05
K6W 17,142 1 58.34
FUL 16,591 1 60.27
HMA 16,298 1 61.36
As the following list shows, more than 50% of all instances are
covered by "fucking":
There are 11 types and 4032 tokens in your sorted query result
No. | Lexical item | No. of occurrences | Percent
1 fucking 2162 53.62%
2 shit 701 17.39%
3 fuck 579 14.36%
4 bastard 198 4.91%
5 bitch 138 3.42%
6 cunt 95 2.36%
7 fucked 63 1.56%
8 prick 34 0.84%
9 arsehole 29 0.72%
10 dickhead 23 0.57%
11 fucks 10 0.25%
If you'd like me to compile similar information for different lists
of lexical items, just let me know.
Best,
Sebastian
--
Sebastian Hoffmann
Englisches Seminar der Univ. Zürich
Plattenstrasse 47
CH-8032 Zürich
Tel: +41-1-634 3551
Fax: +41-1-634 4908
More information about the Corpora
mailing list