[Corpora-List] R: Corpora Digest, Vol 23, Issue 17

Sun May 17 16:44:42 UTC 2009

Gentile Prof. Nogara,
Ho appena salvato suo file che ora apriro' la ringrazio infinitamente
G Balossi
-----Messaggio originale-----
Da: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] Per conto di
corpora-request at uib.no
Inviato: domenica 17 maggio 2009 15.00
A: corpora at uib.no
Oggetto: Corpora Digest, Vol 23, Issue 17

Today's Topics:

   1.  Tijdelijk niet aanwezig (Arjan Loeffen)
   2. Re:  extrapolating to 1 million (James L. Fidelholtz)
   3. Re:  extrapolating to 1 million (Oliver Mason)
   4.  CFP SIGIR 2009 Workshop on the Future of IR	Evaluation - New
      Deadline (Jaap Kamps)
   5. Re:  extrapolating to 1 million (Orion Montoya)
   6.  Software engineer- National Centre for Text Mining,
      University of Manchester (Sophia Ananiadou)

----------------------------------------------------------------------

Message: 1
Date: Sat, 16 May 2009 15:19:33 +0200 (CEST)
From: Arjan Loeffen <arjan.loeffen at validvision.com>
Subject: [Corpora-List] Tijdelijk niet aanwezig
To: <corpora at uib.no>

Ik ben afwezig t/m 24 mei 2009. 
Neem voor belangrijke vragen contact op met Maarten Kroon,
maarten.kroon at validvision.com, of stuur een SMS naar 06-12918997.
Groet,
Arjan

------------------------------

Message: 2
Date: Sat, 16 May 2009 12:30:27 -0500
From: "James L. Fidelholtz" <fidelholtz at gmail.com>
Subject: Re: [Corpora-List] extrapolating to 1 million
To: Adam Kilgarriff <adam at lexmasterclass.com>
Cc: corpora at uib.no

Hi, Adam,

Well, I think and believe, so to speak, that that may depend on just how
large the numbers are: if we're talking about *the* most frequent words
(pun
intended) (eg, so-called 'stop words'), we can be fairly sure of finding
at
least similar frequencies. I'll grant you some exceptions for, say, that
book written so as not to have any e's. If that's your corpus (and other
similar texts), then 'the' won't appear, and so won't be the commonest
word.
For any other non-artificial corpus, it would be surprising indeed if
'the'
weren't the most, or at worst second-most, common word (of course,
talking
about English corpora). On the other hand, we would *expect* a decent
amount
of variation in place of, say, the twentieth word, and not be surprised
if
in another biggish corpus it came in 15th or 25th. If it came in, say,
1000th in a second corpus, we could be damned sure that it's a very
'bursty'
sort of word, and we just got unlucky (?or lucky?) in our first corpus.
Likewise, the further down the list we go, the bigger we would expect
the
variation to be.

I'm not disputing, really, what you say. It's partly a question of
focus,
perhaps, and the question of  'burstiness' is obviously very important
in,
say, dividing up subcorpora by the field they belong to. I'm just
saying, I
guess, that linguists may very well study, in depth, 'stop words', while
corpus linguists are most unlikely to study them, since they usually
look
for other things in their corpora.

Jim

On 5/15/09, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
>
> Sorry, Jim, law of large numbers is not relevant as it assumes
independent
> effects.  In language, effects are never independent (for details see
Language
> is never ever ever
random.<http://kilgarriff.co.uk/Publications/2005-K-lineer.pdf>)
>
> So the short answer to Tina's question -
>
> "Could you tell me what the frequency would be in a corpus of 1
million if
> I extrapolated from the frequency of  20 in a corpus of 300K?"
>
>  is "no".  It all depends on the structure and composition of the
300,000
> corpus, the structure and composition of the (probably hypothetical)
1m
> corpus, how 'bursty' the word is, and how the two corpora relate to
each
> other. (For burstiness, see Ken Church's "*Empirical estimates of
> adaptation: the chance of two noriegas is closer to p/2 than p2"* ) If
the
> word in question is term-like and all 20 occurrences come from one
doc, then
> it is likely that the frequency in the 1m corpus will be 20 (if we
include
> the doc in the first corpus) or 0 (if we don't).
>
> Extrapolation of frequencies from corpora is a risky business, highly
> dependent on the sampling procedure for the corpus and the nature of
the
> term for which the frequency is being extrapolated.  It's generally
safer to
> extrapolate on the basis of document frequencies (eg, how many docs
does the
> word/term appear in) than word/term frequencies, though still, think
hard
> about the nature of the corpus and its claims to representativeness.
>
>  Adam
>
> 2009/5/15 James L. Fidelholtz <fidelholtz at gmail.com>
>
>
>> Hi, Lluis, Tina, & Al.,
>>
>> Firstly, the math is a little kinky (though Lluis is right--it's
roughly
>> OK): it should be 20 * 1M/300K, or 63.3....
>>
>> The point Lluis makes about the corpus containing more rarer words as
we
>> augment the size of the corpus is, of course, correct. Nevertheless
(here I
>> haven't done much work, but I just appeal to common sense and the
'law of
>> large numbers' (not sure this is relevant, but 300K is a *pretty*
large
>> number)), we should expect, even with more obscure words to muddy up
the
>> picture, that the percentage of *common* words in the 300K corpus
should be
>> roughly the same in a corpus of 1M words, especially (but not quite
only,
>> for the more common words) if the corpora are selected from similar
>> universes. Naturally, different selection criteria might affect even
very
>> common words, and it has been shown many times that the 'rarer' the
words
>> are, the more variable the exact percentage can be, but I wouldn't
expect a
>> priori that ever bigger corpora should lower the percentages of
common (or
>> even necessarily of rare) words. Indeed, for the hapax legomena, say,
that
>> enter in the new 'complement' to the corpus, their percentage even
>> *increases* from 0 to 0.0001, correspondingly more for the other new
words.
>>
>> Of course there can always be variations in the percentages. But,
equally
>> always, we *expect* that our sampling of the universe will give us
for a
>> word W something reasonably close to its real percentage frequency.
And that
>> when we repeat the process (or augment it), we will again get
reasonably
>> close to its 'real' frequency, so that we expect both frequencies to
be
>> close to each other. The real world often lets us down (and don't bet
the
>> family farm on any of this), but I guess statisticians tend to be
optimists
>> in this regard. And mathematicians even more (after all, we have an
edge,
>> and so tend to gain 5 family farms for each one we lose). In this
sense,
>> think: Bell curve, which, with the appropriate tweaks, is the exact
>> representation of what our expectations should be in a particular
case.
>>
>> Jim
>>
>>   On 5/15/09, Lluís Padró <padro at lsi.upc.edu> wrote:
>>
>>>   En/na Tina Waldman ha escrit:
>>>
>>> Dear members
>>> Could you tell me what the frequency would be in a corpus of 1
million if
>>> I extrapolated from the frequency of  20 in a corpus of 300K?
>>>
>>> Would it be 60 - 20 x 3 ?
>>>
>>>
>>>    As a rough estimate, that may work.
>>>
>>>    Nevertheless, due to Zipf's laws, when you go from 300K to 1M,
you're
>>> getting lots of previously unseen words with very low frequencies,
but they
>>> modify the proability distribution
>>>
>>>    For this and other reasons, relative frequencies seem to be less
>>> stable than that when you use larger corpora.
>>>
>>>    You can find out more about it in:
>>> Baroni M., Evert S., "Words and echoes: assessing and mitigating the
>>> non-randomness problem in word frequency distribution modeling".
>>> In:Proceedings of ACL 2007, East Stroudsburg PA: ACL, 2007. p.
904-911, Atti
>>> del convegno: "Association for Computational Linguistics (ACL)",
Prague,
>>> 23rd-30th June 2007.
>>>
>>>    best,
>>>
>>>
>>> --
>>> ------------------------------
>>>  *Lluís Padró* Despatx ?-S112 Campus Nord UPC C/ Jordi Girona 1-3
08034
>>> Barcelona, Spain
>>>   Tel: +34 934 134 015 Fax: +34 934 137 833
>>> padro at lsi.upc.edu <padro at lsi.upc.es>
www.lsi.upc.edu/~padro<http://www.lsi.upc.es/~padro>
>>>   ------------------------------
>>> UNIVERSITAT POLITÈCNICA DE CATALUNYA Dept. Llenguatges i Sistemes
>>> Informàtics <http://www.lsi.upc.es/> TALP
<http://www.talp.upc.es/>Research Center
>>> ------------------------------
>>>
>>
>> _______________________________________________ Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>>
>
>
> --
> James L. Fidelholtz
> Posgrado en Ciencias del Lenguaje
> Instituto de Ciencias Sociales y Humanidades
> Benemérita Universidad Autónoma de Puebla, MÉXICO
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
>
>
>  --
> ================================================
> Adam Kilgarriff
> http://www.kilgarriff.co.uk
> Lexical Computing Ltd                   http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> Universities of Leeds and Sussex       adam at lexmasterclass.com
> ================================================
>
>

-- 
James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 12220 bytes
Desc: not available
Url :
http://www.uib.no/mailman/public/corpora/attachments/20090516/99a70946/a
ttachment.txt 

------------------------------

Message: 3
Date: Sat, 16 May 2009 20:16:26 +0100
From: Oliver Mason <O.Mason at bham.ac.uk>
Subject: Re: [Corpora-List] extrapolating to 1 million
To: "James L. Fidelholtz" <fidelholtz at gmail.com>
Cc: corpora at uib.no

But what's the point in extrapolating?  It's scientifically unsound,
as you basically invent things of which you don't know how they're
going to be distributed.  Any claims based on that are extremely shaky
if not outright invalid.

So the right answer should be: "No, you can't, and you should tell
anybody who asks you to do this that this is not appropriate."

Otherwise, why would we ever have needed to build corpora other than
Brown or LOB?  Just extrapolate...

Oliver

-- 
Dr Oliver Mason
Technical Director of the Centre for Corpus Research
School of English, Drama, and ACS
The University of Birmingham
Birmingham B15 2TT

------------------------------

Message: 4
Date: Sat, 16 May 2009 22:11:58 +0200
From: Jaap Kamps <kamps at science.uva.nl>
Subject: [Corpora-List] CFP SIGIR 2009 Workshop on the Future of IR
	Evaluation - New Deadline
To: corpora at uib.no

SIGIR 2009 Workshop on the Future of IR Evaluation
July 23, Boston
http://staff.science.uva.nl/~kamps/ireval/

Submissions due: June 15

Call for Papers

Evaluation is at the core of information retrieval: virtually all
progress owes directly or indirectly to test collections built within
the so-called Cranfield paradigm.  However, in recent years, IR
researchers are routinely pursuing tasks outside the traditional
paradigm, by taking a broader view on tasks, users, and context.
There is a fast moving evolution in content from traditional static
text to diverse forms of dynamic, collaborative, and multilingual
information sources.  Also industry is embracing "operational"
evaluation based on the analysis of endless streams of queries and
clicks.

We invite the submission of papers that think outside the box:

- Are you working on an interesting new retrieval task or aspect?  Or
   on its broader task or user context?  Or on a complete system with
   novel interface?  Or on interactive/adaptive search?  Or ...?
   Please explain why this is of interest, and what would be an
   appropriate way of evaluating.

- Do you feel that the current evaluation tools fail to do justice to
   your research?  Is there a crucial aspect missing?  Or are you
   interested in specific, rare, phenomena that have little impact on
   the average scores?  Or ...?  Please explain why this is of
   interest, and what would be an appropriate way of evaluating.

- Do you have concrete ideas how to evaluate such a novel IR task?  Or
   ideas for new types of experimental or operational evaluation?  Or
   new measures or ways of re-using existing data?  Or ...?  Please
   explain why this is of interest, and what would be an appropriate
   way of evaluating.

The workshop brings together all stake-holders ranging from those with
novel evaluation needs, such as a PhD candidate pursuing a new
IR-related problem, to senior IR evaluation experts.  Desired outcomes
are insight into how to make IR evaluation more "realistic," and at
least one concrete idea for a retrieval track or task (at CLEF, INEX,
NTCIR, TREC) that would not have happened otherwise.

Help us shape the future of IR evaluation!

- Submit a short 2-page poster or position paper explaining your key
   wishes or key points,

- and take actively part in the discussion at the Workshop.

The *revised* deadline is Monday June 15, 2009, further submission
details are on http://staff.science.uva.nl/~kamps/ireval/

Shlomo Geva, INEX & QUT, Australia
Jaap Kamps, INEX & University of Amsterdam, The Netherlands
Carol Peters, CLEF & ISTI-CNR, Italy
Tetsuya Sakai, NTCIR & Microsoft Research Asia, China
Andrew Trotman, INEX & University of Otago, New Zealand
Ellen Voorhees, TREC/TAC & NIST, USA

------------------------------

Message: 5
Date: Sat, 16 May 2009 19:04:46 -0400
From: Orion Montoya <orion at mdcclv.com>
Subject: Re: [Corpora-List] extrapolating to 1 million
To: corpora at uib.no

On May 16, 2009, at 3:16 PM, Oliver Mason wrote:
> But what's the point in extrapolating?  It's scientifically unsound...
> So the right answer should be: "No, you can't, and you should tell
> anybody who asks you to do this that this is not appropriate."

I've grown increasingly curious over the years, as people have posted  
frequency queries here, about what applications people are using *any*  
frequency data for, for which a high level of accuracy is important --  
and how a degraded level of accuracy would be detectable.

I don't mean to question or deny the whole premise of frequency info.  
I know frequency can be very useful for all kinds of guessing and  
deciding and other NLP things. I mean to ask  why, and to whom, it  
matters whether something shows up as the 10,000th-most-frequent-token  
in a corpus of a given size, rather than the 15,999th, or the  
26,000th.  Certainly, the first few tiers matter: top 10, top 100, top  
1,000, top 10,000;  but it seems to me that the farther you get down  
the list -- the farther out on this maybe-logarithmic scale --  the  
less meaningful any degree of accuracy becomes.

Given the arbitraryness of what might be included in any given corpus,  
any overweening degree of precision seems likely to point to false (or  
meaningless) conclusions about the "language," and only really be  
reflective of the composition of the corpus. This gives all the more  
reason to heed Adam's advice about document frequency over raw whole- 
corpus counts.

Is there a crucial application am I not thinking of?

Yrs,

Orion

------------------------------

Message: 6
Date: Sun, 17 May 2009 11:03:57 +0100
From: "Sophia Ananiadou" <Sophia.Ananiadou at manchester.ac.uk>
Subject: [Corpora-List] Software engineer- National Centre for Text
	Mining, University of Manchester
To: "corpora at uib.no" <corpora at uib.no>

The National Centre for Text Mining (www.nactem.ac.uk
<http://www.nactem.ac.uk/> ) seeks to appoint a self motivated and
experienced software engineer with expertise in the design and
development of web application user interfaces.  The National Centre for
Text Mining, hosted by the School of Computer Science, provides
next-generation text mining services to the community using natural
language processing techniques to build advanced search systems in a
number of domains. You will be part of a strong and dynamic team of text
miners and software engineers.

You should have an MSc and a good first degree (minimum 2:1) in Computer
Science or Software Engineering; at least 4 years of software
development in web interfaces and/or visualisation, experience working
in a Linux or *NIX environment, demonstrable achievements in developing
applications or web based interfaces or visualization tools, experience
in Java/JSP, Javascript, PHP, python, Ajax, C, C++, xhtml, JSP or
related technologies, and experience building data visualization
systems. 

The post is available immediately for a period of 2 years.

Further details:
http://www.manchester.ac.uk/_contentlibrary/_vacancies/furtherparticular
smax10mbpdf,154565,en.pdf 

Reference: EPS/90598

The closing date for applications is: 31 May 2009

=========================================================

Dr Sophia Ananiadou, Reader in Text Mining, School of Computer Science

Director, National Centre for Text Mining, www.nactem.ac.uk
<http://www.nactem.ac.uk> 

Manchester Interdisciplinary Biocentre, www.mib.ac.uk
<http://www.mib.ac.uk> ,

University of Manchester

131 Princess Street, M1 7DN

http://personalpages.manchester.ac.uk/staff/sophia.ananiadou/
<http://personalpages.manchester.ac.uk/staff/sophia.ananiadou/>  

sophia.ananiadou at manchester.ac.uk
<mailto:sophia.ananiadou at manchester.ac.uk>  

tel: +44 161 306 3092

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 7391 bytes
Desc: not available
Url :
http://www.uib.no/mailman/public/corpora/attachments/20090517/b393fe74/a
ttachment.txt 

----------------------------------------------------------------------
Send Corpora mailing list submissions to
	corpora at uib.no

To subscribe or unsubscribe via the World Wide Web, visit
	http://mailman.uib.no/listinfo/corpora
or, via email, send a message with subject or body 'help' to
	corpora-request at uib.no

You can reach the person managing the list at
	corpora-owner at uib.no

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Corpora digest..."

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

End of Corpora Digest, Vol 23, Issue 17
***************************************

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora