[Corpora-List] "Tajweed" in English dictionaries and corpora

Michael Rundell michael.rundell at lexmasterclass.com
Sat Mar 2 14:48:56 UTC 2013


Eric, Mark

A couple of points:
(1) corpus support for tajwid/tajweed
You're right - BNC isn't helpful here. But most UK publishers would be a 
using corpora 10 or 20 times larger than BNC now (and more up to date - 
BNC's constituent text are 22 years old minimium - and that could have a 
bearing here). We have 'reasonable' evidence for tajwid/tajweed, and good 
evidence for 'fiqh', so that's going in too

(2) dictionary inclusion criteria.
You mention the OED's  (see also this flowchart: 
http://oxforddictionaries.com/words/how-a-new-word-enters-an-oxford-dictionary), 
but none of this is quite fit for purpose when we move from print to 
digital. Traditional criteria were established in the context of space being 
(very) finite, so you had to be selective (and try to achieve balance across 
domains). Now all bets are off, and we're trying to come up with robust new 
criteria. My starting point would be rather to ask: is there a good reason 
for NOT including this (then look at some of the reasons you might want to 
exclude things). The Oxford flowchart implies an old-style 'gatekeeper' 
role, and its 'research process' seems to entail a long probationary period 
where a word has to prove itself. But i doubt this would impress younger 
searchers: in the old days, if a word wasn't in 'the dictionary', people 
might say 'well, it can't be a good word then'. Now they're more likely to 
say 'well, this can't be a good dictionary then'.

Michael


----- Original Message ----- 
From: "Mark Davies" <Mark_Davies at byu.edu>
To: "CORPORA discussion forum" <corpora at uib.no>; "Eric Atwell" 
<E.S.Atwell at leeds.ac.uk>; "Michael Rundell" 
<michael.rundell at lexmasterclass.com>
Sent: Friday, March 01, 2013 11:10 PM
Subject: RE: [Corpora-List] "Tajweed" in English dictionaries and corpora


Eric Atwell wrote:

>> Michael, you said "Thanks for 'tajweed', which corpus data suggests we 
>> should include" - what corpus data?  Presumably not the BNC.

The upcoming two billion word corpus of GLObal Web-Based English (GloWebE --  
available in May 2013), has about 300 tokens for "tajweed":

http://corpus2.byu.edu/web/?c=web&q=21412274

Not surprisingly, it is the most common (normalized frequency) in Pakistan, 
then Bangladesh, then Tanzania, and then the UK. Both the UK and US 
components of GloWbE have about 400 million words, but the UK has 101 tokens 
of "tajweed", while the US only has 1 token.

One of the advantages of GloWebE (when it is available -- soon) will be the 
ability to easily compare frequency across twenty different countries, as we 
see in the "tajweed" example.

Mark D.

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================= 


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list