[Corpora-List] "Tajweed" in English dictionaries and corpora
Michael Rundell
michael.rundell at lexmasterclass.com
Sat Mar 2 14:48:56 UTC 2013
Eric, Mark
A couple of points:
(1) corpus support for tajwid/tajweed
You're right - BNC isn't helpful here. But most UK publishers would be a
using corpora 10 or 20 times larger than BNC now (and more up to date -
BNC's constituent text are 22 years old minimium - and that could have a
bearing here). We have 'reasonable' evidence for tajwid/tajweed, and good
evidence for 'fiqh', so that's going in too
(2) dictionary inclusion criteria.
You mention the OED's (see also this flowchart:
http://oxforddictionaries.com/words/how-a-new-word-enters-an-oxford-dictionary),
but none of this is quite fit for purpose when we move from print to
digital. Traditional criteria were established in the context of space being
(very) finite, so you had to be selective (and try to achieve balance across
domains). Now all bets are off, and we're trying to come up with robust new
criteria. My starting point would be rather to ask: is there a good reason
for NOT including this (then look at some of the reasons you might want to
exclude things). The Oxford flowchart implies an old-style 'gatekeeper'
role, and its 'research process' seems to entail a long probationary period
where a word has to prove itself. But i doubt this would impress younger
searchers: in the old days, if a word wasn't in 'the dictionary', people
might say 'well, it can't be a good word then'. Now they're more likely to
say 'well, this can't be a good dictionary then'.
Michael
----- Original Message -----
From: "Mark Davies" <Mark_Davies at byu.edu>
To: "CORPORA discussion forum" <corpora at uib.no>; "Eric Atwell"
<E.S.Atwell at leeds.ac.uk>; "Michael Rundell"
<michael.rundell at lexmasterclass.com>
Sent: Friday, March 01, 2013 11:10 PM
Subject: RE: [Corpora-List] "Tajweed" in English dictionaries and corpora
Eric Atwell wrote:
>> Michael, you said "Thanks for 'tajweed', which corpus data suggests we
>> should include" - what corpus data? Presumably not the BNC.
The upcoming two billion word corpus of GLObal Web-Based English (GloWebE --
available in May 2013), has about 300 tokens for "tajweed":
http://corpus2.byu.edu/web/?c=web&q=21412274
Not surprisingly, it is the most common (normalized frequency) in Pakistan,
then Bangladesh, then Tanzania, and then the UK. Both the UK and US
components of GloWbE have about 400 million words, but the UK has 101 tokens
of "tajweed", while the US only has 1 token.
One of the advantages of GloWebE (when it is available -- soon) will be the
ability to easily compare frequency across twenty different countries, as we
see in the "tajweed" example.
Mark D.
============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
=============================================
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list