[Corpora-List] labels of COLT files in BNC spoken

Sebastian Hoffmann sebhoff at es.unizh.ch
Thu Nov 13 13:54:50 UTC 2003


At 1:00 PM +0000 11/13/03, Eric Atwell wrote:
>Lou,
>thanks for this expert clarification.
>Demo chatbots trained with a variety of BNC files are now on my web-page
>http://www.comp.leeds.ac.uk/eric/  and we can add more ....
>
>- I have a follow-up question:  can you suggest any specific BNC spoken
>   files which illustrate particularly "interesting" / idiosyncratic
>   language use?  For example, the BNC file with the most swearing? :)
>   We want to identify a selection of "unusual" files, to train
>   a collection of noticeably different chatbots.
>
>thanks
>
>Eric
>

Eric,
Here's some output from BNCweb which will probably help you with your
search for the text with the most swearing - however, I didn't spend
much time compiling the list of "bad words"... ;-)

Your query "<stext>#((fuck|fucks|fucking|fucked|shit|arsehole|
bastard|cunt|dickhead|bitch|prick))" returned 4032 matches in 144
different texts

It was most frequently found in the following files (only texts with
at least three occurrences are considered)

  Name of Text | Number of words | Number of hits | Freq. pmw
  KE5	5,121	92	17965.24
  KDA	75,783	1,098	14488.74
  KP9	6,963	71	10196.75
  KD9	13,908	124	8915.73
  KE1	21,001	180	8571.02
  KR2	8,090	69	8529.05
  KPH	12,070	75	6213.75
  KPT	7,553	41	5428.31
  KDN	46,326	251	5418.12
  KP4	34,712	182	5243.14
  KCU	53,859	279	5180.19
  KPP	8,112	42	5177.51
  KNV	7,853	37	4711.58
  KP7	1,938	9	4643.96
  KSU	2,388	11	4606.37
  KB4	902	4	4434.59
  KR1	5,453	24	4401.25
  KP0	7,869	34	4320.75
  KSP	1,543	6	3888.53
  KPG	45,229	145	3205.91

Your query was least frequently found in the following files (only
texts with at least one occurrence are considered)

  Name of Text | Number of words | Number of hits | Freq. pmw
  KRT	158,430	1	6.31
  KCT	104,104	1	9.61
  KBW	123,017	2	16.26
  KDM	115,661	2	17.29
  KBH	51,340	1	19.48
  KC2	47,809	1	20.92
  KS7	43,335	1	23.08
  KBB	81,085	2	24.67
  KDV	29,392	1	34.02
  KCS	25,055	1	39.91
  FUK	20,220	1	49.46
  KR0	20,183	1	49.55
  JYN	19,468	1	51.37
  KB2	37,597	2	53.20
  KBF	111,948	6	53.60
  KP1	70,999	4	56.34
  KDJ	17,227	1	58.05
  K6W	17,142	1	58.34
  FUL	16,591	1	60.27
  HMA	16,298	1	61.36

As the following list shows, more than 50% of all instances are
covered by "fucking":

There are 11 types and 4032 tokens in your sorted query result
No. | Lexical item | No. of occurrences | Percent
1	fucking	2162	53.62%
2	shit	701	17.39%
3	fuck	579	14.36%
4	bastard	198	4.91%
5	bitch	138	3.42%
6	cunt	95	2.36%
7	fucked	63	1.56%
8	prick	34	0.84%
9	arsehole	29	0.72%
10	dickhead	23	0.57%
11	fucks	10	0.25%

If you'd like me to compile similar information for different lists
of lexical items, just let me know.

Best,
Sebastian


--


Sebastian Hoffmann
Englisches Seminar der Univ. Zürich
Plattenstrasse 47
CH-8032 Zürich
Tel: +41-1-634 3551
Fax: +41-1-634 4908



More information about the Corpora mailing list