[Corpora-List] CFP: The fifth Web as Corpus workshop, 7 September, 2009

Serge Sharoff s.sharoff at leeds.ac.uk
Mon Jan 19 15:35:50 UTC 2009


Call for Papers

We invite papers on various topics concerning the use of Web resources
for corpus research and NLP applications, including (but not limited to)
the following: 
      * linguistic Web crawler technology and Web corpus collection
        projects 
      * applications of Web-derived corpora and other kinds of Web data 
      * how far does the “easy way” get you? (using search engines, or
        Google's n-gram lists; we are particularly interested in a
        critical discussion of the usefulness and limitations of such
        approaches) 
      * methods and tools for “cleaning” Web pages to turn them into a
        corpus 
      * automatic linguistic annotation of Web data: tokenisation, POS
        tagging, lemmatisation, semantic tagging, etc. (established
        tools often perform very poorly on Web data) 
      * search engine architectures for linguists: bringing linguistics
        to commercial search engines, or high-performance search
        technology to linguistics? 
      * search engine-related topics such as result ranking (e.g. how to
        identify “typical” uses rather than returning 50 very similar
        matches on the first page) 
      * duplicate detection, interactive query refinement, etc. 
      * reviews and clever uses of search engine APIs (Google, Yahoo,
        Altavista, and in particular Microsoft's current generous Live
        Search API) 

The workshop will be held on 7 September, 2009, in San Sebastian,
preceding SEPLN, the Spanish NLP conference:
http://ixa2.si.ehu.es/sepln2009/ 

We particularly welcome submissions on the use of languages other than
English. One of the bottlenecks in corpus linguistic research on a
particular language consists in availability of corpora for this
language: translation studies for, say, Ukrainian or Vietnamese are
limited by the existence of diverse corpora for these languages. The Web
gives the opportunity to alleviate this bottleneck, but we still do not
know many parameters of what is there and how useful it is for
translation, language teaching, linguistics research, etc.

The deadline for submissions is: 17 April, 2009

For more information about the workshop and submission procedure, see
our webpage:
http://www.sigwac.org.uk/

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list