[Corpora-List] Corpora Digest, Vol 39, Issue 31

Thu Sep 30 04:45:27 UTC 2010

Hello!
I am working on Adverbials in Pakistani English for my M.Phil thesis. The idea 
is to prove Pakistani English as an independent variety. For this purpose i have 
tagged corpus of Pakistani Written English. I want to get this data parsed. can 
you help me in this regard?
Plz reply and comment.

Nasir Ali
M.Phil Scholar
Dept. Applied Linguistics
G.C University Faisalabad.

Lecturer English 
Depatrment of Humanities and Social Sciences
National Textile University Faisalabad
Pakistan

________________________________
From: "corpora-request at uib.no" <corpora-request at uib.no>
To: corpora at uib.no
Sent: Wed, September 29, 2010 3:00:01 PM
Subject: Corpora Digest, Vol 39, Issue 31

Today's Topics:

  1.  3 Research Internships at FBK-irst (Marcello Federico)
  2.  Job announcement (Stefanie Dipper)
  3. Re:  Help Regarding Cognates Identification (chris brew)
  4.  A word of advice needed on Pakistani English (Ahmad Ahmad)
  5.  Release of SemEval2010-Task1 datasets (Coreference
      Resolution in Multiple Languages) (Marta Recasens)
  6.  thesis / internship in MT inquiry for a student speaking
      Polish (native) English, French, Spanish, and basic Swedish
      (Sandra Weiss)
  7. Re:  Looking for toolkit for verb tense detection
      (Amaç Herda?delen)
  8. Re:  Looking for toolkit for verb tense detection (Alexander Yeh)
  9. Re:  Looking for toolkit for verb tense detection
      (liqiearth at gmail.com)
  10. Re:  Looking for toolkit for verb tense detection
      (liqiearth at gmail.com)
  11.  Phonemic variations across languages (Yuri Tambovtsev)

----------------------------------------------------------------------

Message: 1
Date: Tue, 28 Sep 2010 14:32:59 +0000
From: Marcello Federico <federico at fbk.eu>
Subject: [Corpora-List] 3 Research Internships at FBK-irst
To: "corpora at uib.no" <corpora at uib.no>

The ?Human Language Technologies? Research Unit of Bruno Kessler Foundation 
(FBK)  is seeking 

candidates for three research internship positions:

-  two in the field of Statistical Machine Translation 
-  one in the field of Content Processing. 

The internship are intended to provide a strong theoretical and experimental 
background to graduate 

students interested in applying later for a PhD scholarship.

Requirements,  salary, and  application procedure are specified  in:
http://risorseumane.fbk.eu/sites/risorseumane.fbk.eu/files/Call%20HLT_INTERNSHIP2010.pdf

Closing date: 12 October 2010

------------------------------

Message: 2
Date: Tue, 28 Sep 2010 18:12:49 +0200 (CEST)
From: Stefanie Dipper <dipper at linguistics.rub.de>
Subject: [Corpora-List] Job announcement
To: Corpora at uib.no

#########################
University: Ruhr-University Bochum
Department: Linguistics Department
Job Rank: 2 Project researchers, salary scale 13 TV-L (65%, PhD positions)
Specialty Areas: Computational Linguistics, Corpus Linguistics
Application Deadline: 1-Nov-2010 (open until filled)
Duration: Initial contract 2 years, with potential extension for an 
additional year
#########################

We are looking for two enthusiastic PhD students, prepared to collaborate 
in a team. The positions are within an interdisciplinary project group, 
funded by the DFG (German Research Foundation), which involve people from 
medieval studies, historical linguistics, and computational linguistics. 
The project deals with texts from Early New High German (14th-16th 
centuries).

The project has three main goals: (i) developing automatic methods for the 
analysis of historical language data (e.g. POS tagging, normalization, 
alignment); (ii) computing similarity of historical dialects (at different 
linguistic levels); (iii) adapting and extending an existing corpus tool 
for the data.

Applicants for both positions should have graduated from a master's 
program, or expect to graduate in the near future. They are supposed to 
pursue a PhD within the project context. Both positions require a 
background in corpus and/or computational linguistics, and good 
programming skills (e.g. Java, Perl, Python; experience in software 
development on Linux systems is a plus). Knowledge of German is desirable, 
and interest in working with historical language data is indispensable.

Ideal candidates will have a strong background in two or more of the 
following areas:

- Development of corpus tools (annotation tools, search tools)
- (Semi-)automatic annotation (POS tagging, chunking, alignment, etc.)
- Data mining, clustering
- Historical linguistics

The positions are at the rank of 'Wissenschaftliche Mitarbeiter'. The 
salary is determined by the pay scale for German public employees 
('Entgeltgruppe' TV-L E13, 65%). The initial contract length is two years, 
with a potential extension for an additional year.

Applications received by November 1, 2010 will receive full consideration. 
However, the search will remain open until the positions are filled. 
Candidates should send:

- a short letter of interest
- a detailed CV with a summary of research experiences/interests
- one or two sample publications or a copy of the master thesis
- the names and contact information of two referees
- diploma copies.

Please submit PDF files only. Email address for inquiries and 
applications: dipper AT linguistics.rub.de

The Ruhr-University Bochum is an equal opportunity employer. Severely 
disabled persons with equal qualifications will be prioritized.

Jun.Prof. Dr. Stefanie Dipper
Institute of Linguistics
Ruhr-University Bochum
D - 44780 Bochum
Germany
http://www.linguistics.ruhr-uni-bochum.de/~dipper

------------------------------

Message: 3
Date: Tue, 28 Sep 2010 08:41:59 -0400
From: chris brew <cbrew at ling.osu.edu>
Subject: Re: [Corpora-List] Help Regarding Cognates Identification
To: Padmini priyadharsini <padminipriyadharsini at gmail.com>
Cc: corpora at uib.no

I understand the definition of "cognate" to be about the history of words,
not just about
similarities in the surface form. There are at least three ways that words
can come to be
similar across languages

1) words have a common ancestry: language A and language B have words that
can be traced back to the same root word in some ancestor language. These
cases can be interesting, like
*étoile and star, *or routine, like night and nacht. In the interesting
one, a systematic process  has happened that makes the letter before the t
turn up as e-acute in French but s in English.

2) language A borrows the word from language B.

3) an accident happens. The words in language A and language B look the
same, but there is
no common ancestry and no borrowing.

Most people call pairs that fit under case (1) cognates, and the other two
"false cognates".
It is a very interesting problem to write programs to detect and take
advantage of systematic sound correspondences like the star/etoile thing.
Kondrak has worked on this extensively.

There was a good workshop on computational approaches to this stuff at ACL
2007

@InProceedings{nerbonne-ellison-kondrak:2007:CompHistPhon,
  author    = {Nerbonne, John  and  Ellison, T. Mark  and  Kondrak, Grzegorz},
  title    = {Computing and Historical Phonology},
  booktitle = {Proceedings of Ninth Meeting of the ACL Special
Interest Group in Computational Morphology and Phonology},
  month    = {June},
  year      = {2007},
  address  = {Prague, Czech Republic},
  publisher = {Association for Computational Linguistics},
  pages    = {1--5},
  url      = {http://www.aclweb.org/anthology/W/W07/W07-1301}
}

Personally, I wouldn't want to call borrowings cognates, and I would tend to
see references to named entities as similar to borrowings, because very
often the borrowed word is unchanged, or changed only as much as necessary
to make it minimally acceptable phonologically, so the
Japanese word "sekkusu" is just the way the language borrows the English
word "sex", not
the trace of some historically exciting process.

On Tue, Sep 28, 2010 at 1:23 AM, Padmini priyadharsini <
padminipriyadharsini at gmail.com> wrote:

> Hi All,
>
> Kindly let me know the availeble tools and used techniques for
> cognates identification.
>
> I will be summarizing all the reply to the list.
>
> Thanks,
> Padmini
>
> --
> Life is beautiful and enjoy its simplicity :)
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- 
Chris Brew, Ohio State University
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 5584 bytes
Desc: not available
URL: 
<http://www.uib.no/mailman/public/corpora/attachments/20100928/76b9c631/attachment.txt>

------------------------------

Message: 4
Date: Mon, 27 Sep 2010 05:36:58 -0800
From: Ahmad Ahmad <escholer at gmail.com>
Subject: [Corpora-List] A word of advice needed on Pakistani English
To: corpora at uib.no

*Dear Fellows*

I am working on *vowels in Pakistani English* for my M Phil dissertation at
GC University Faisalabad. The idea is to work on the vowels of PE to prove
it an indigenous variety. It is a corpus based study using PRAAT, finding
Formant Values of the vowels of PE.

Do comment if you can be of any help to me in this regard. Details of the
proposal of dissertation may be shared.

Regards

Hafiz Ahmad Bilal

M Phil Scholar

Dept of Applied Linguistics

GC University, Faisalabad

Lecturer, Dept of English

University of Sargodha
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 2162 bytes
Desc: not available
URL: 
<http://www.uib.no/mailman/public/corpora/attachments/20100927/b07d2924/attachment.txt>

------------------------------

Message: 5
Date: Tue, 28 Sep 2010 17:02:10 +0200
From: Marta Recasens <mrecasens at ub.edu>
Subject: [Corpora-List] Release of SemEval2010-Task1 datasets
    (Coreference Resolution in Multiple Languages)
To: corpora at uib.no

We are pleased to announce the release of the SemEval2010-Task1 datasets for 
coreference resolution in multiple languages including Catalan, Dutch, German, 
Italian, and Spanish. The English data will be released by LDC early 2011.

The data is now freely available for downloading from the task website, at 
http://stel.ub.edu/semeval2010-coref/download/

General formatting is shared by all languages and is inspired by the previous 
CoNLL shared tasks (2008/2009 editions: http://ufal.mff.cuni.cz/conll2009-st). 
The data are displayed in a uniform column-based format with information about 
coreference, lemma, PoS, morphological features, head, dependency relations, 
NEs, and semantic dependencies. Both gold-standard and automatically predicted 
information are provided (availability depending on the language). The existence 
of gold and automatic preprocessing as well as of already published results 
makes it an ideal resource for training and testing coreference resolution 
systems, especially when cross-language portability is to be achieved.

REFERENCE
Marta Recasens, Lluís Màrquez, Emili Sapena, M. Antònia Martí, Mariona Taulé, 
Véronique Hoste, Massimo Poesio, and Yannick Versley. 2010. SemEval-2010 Task 1: 
Coreference Resolution in Multiple Languages. In Proceedings of the 5th 
International Workshop on Semantic Evaluation (SemEval-2010), ACL 2010, pages 
1-8, Uppsala, Sweden.

We will be happy to publish your results or articles related to any of these 
datasets on our website: http://stel.ub.edu/semeval2010-coref/posttask
Please feel free to let us know.

ORGANIZERS
* Marta Recasens, M. Antònia Martí, Mariona Taulé
  University of Barcelona
  {mrecasens,amarti,mtaule}@ub.edu
* Lluís Màrquez, Emili Sapena
  Technical University of Catalonia
  {lluism,esapena}@lsi.upc.edu
*  Massimo Poesio
  University of Essex / University of Trento
*  Véronique Hoste
  University College Ghent
*  Yannick Versley
  University of Tübingen

------------------------------

Message: 6
Date: Tue, 28 Sep 2010 22:54:51 +0200
From: Sandra Weiss <sandre17 at gmail.com>
Subject: [Corpora-List] thesis / internship in MT inquiry for a
    student speaking Polish (native) English, French, Spanish, and basic
    Swedish
To: Corpora at uib.no

**
Dear Corpora members,

I am* a polish student* looking for *an internship related to Machine
Translation* that could be the base for my master thesis.
So far I have obtained a *BA in French and Spanish* at an english university
which involved a lot of translation between the 2 languages plus into
English.
(including a year of *french-spanish translation and vice-versa, also into
English* at a french university Jean Moulin 3 in Lyon).
I have taken *courses in Machine Translation* where worked mainly on
*pre-editing
and post-editing* of texts run through *MT engines *plus worked with
translation memory* WordFast.* I have also completed single courses in *General
Linguistics*.
I am now a second year student of *Language and Culture in Europe master
programme at the Linkoping's University in Sweden* plus I am in process of
learning *C language* and *Perl*.
I would like to work in the future in the feild of *MT* and that is why this
semester I am preparing myself to take up a *master thesis* next semester
hopefully *related to MT*.

*I am willing to take on any courses or training if required.*

Any help or information will be very appreciated,

kind regards,

*Sandra Weiss*
*Master student of Language and Culture in Europe*
*Linkoping, Sweden*
*tel: *0046760812503
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 2163 bytes
Desc: not available
URL: 
<http://www.uib.no/mailman/public/corpora/attachments/20100928/db71aad4/attachment.txt>

------------------------------

Message: 7
Date: Wed, 29 Sep 2010 00:44:01 +0200
From: Amaç Herda?delen <amac at herdagdelen.com>
Subject: Re: [Corpora-List] Looking for toolkit for verb tense
    detection
To: Corpora at uib.no, "Qi Li" <liqiearth at gmail.com>

Hello Qi Li,

I think you might be interested in Nodebox's Linguistics Library: 
http://nodebox.net/code/index.php/Linguistics#verb_conjugation It provides a 
simple interface that does what you want (in Python).

Here is the example from the documentation.

----
print en.verb.is_tense("wasn't", "1st singular past", negated=True)
print en.verb.is_present("does", person=1)
print en.verb.is_present_participle("doing")
print en.verb.is_past_participle("done")
>>> True
>>> False
>>> True
>>> True
----

Cheers,

Amaç Herda?delen

On Mon, 27 Sep 2010 21:55:15 +0200, Qi Li <liqiearth at gmail.com> wrote:

> Hi Corpora members,
>
> Does anyone know if there are some toolkits for word tense detection? I'm
> doing research on IE system, and need tense of verbs as feature in
> classifier. There are lot of papers mentioned word tense, but nobody said
> how they got the tense, it seems easy to do and even get high performance.
>
> Thanks much for any help.
>
> best,
>
> Qi Li
> ==========================
> Department of Computer Sci.
> Graduate Center, CUNY
> Email: liqiearth at gmail.com

------------------------------

Message: 8
Date: Tue, 28 Sep 2010 19:20:09 -0400
From: Alexander Yeh <asy at mitre.org>
Subject: Re: [Corpora-List] Looking for toolkit for verb tense
    detection
To: Qi Li <liqiearth at gmail.com>
Cc: "Corpora at uib.no" <Corpora at uib.no>

Qi Li wrote:
> Hi Corpora members,
>
> Does anyone know if there are some toolkits for word tense detection?
> I'm doing research on IE system, and need tense of verbs as feature in
> classifier. There are lot of papers mentioned word tense, but nobody
> said how they got the tense, it seems easy to do and even get high
> performance.

Many part-of-speech (p-o-s) finders will differentiate between certain 
tenses of verbs.

For example, part-of-speech finders trained on the Penn Tree bank will 
try to distinguish between the following (from
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html):

27.      VB      Verb, base form
28.     VBD     Verb, past tense
29.     VBG     Verb, gerund or present participle
30.     VBN     Verb, past participle
31.     VBP     Verb, non-3rd person singular present
32.     VBZ     Verb, 3rd person singular present

Hope this helps
-Alex Yeh

>
> Thanks much for any help.
>
> best,
>
> Qi Li
> ==========================
> Department of Computer Sci.
> Graduate Center, CUNY
> Email: liqiearth at gmail.com <mailto:liqiearth at gmail.com>

------------------------------

Message: 9
Date: Tue, 28 Sep 2010 23:07:09 +0000
From: liqiearth at gmail.com
Subject: Re: [Corpora-List] Looking for toolkit for verb tense
    detection
To: "=?utf-8?B?QW1hw6cgSGVyZGHEn2RlbGVu?=" <amac at herdagdelen.com>,
    Corpora at uib.no

Hi,
Thanks a lot, I'll check it, if there is java version, that would be great. I 
don't know why nobody mentioned it in papers, is it so simple to do and people 
always do it themselves?

Anyway, thanks again

Qi Li
------Original Message------
From: Amaç Herda?delen
To: Corpora at uib.no
To: Qi Li
Subject: Re: [Corpora-List] Looking for toolkit for verb tense detection
Sent: Sep 28, 2010 18:44

Hello Qi Li,

I think you might be interested in Nodebox's Linguistics Library: 
http://nodebox.net/code/index.php/Linguistics#verb_conjugation It provides a 
simple interface that does what you want (in Python).

Here is the example from the documentation.

----
print en.verb.is_tense("wasn't", "1st singular past", negated=True)
print en.verb.is_present("does", person=1)
print en.verb.is_present_participle("doing")
print en.verb.is_past_participle("done")
>>> True
>>> False
>>> True
>>> True
----

Cheers,

Amaç Herda?delen

On Mon, 27 Sep 2010 21:55:15 +0200, Qi Li <liqiearth at gmail.com> wrote:

> Hi Corpora members,
>
> Does anyone know if there are some toolkits for word tense detection? I'm
> doing research on IE system, and need tense of verbs as feature in
> classifier. There are lot of papers mentioned word tense, but nobody said
> how they got the tense, it seems easy to do and even get high performance.
>
> Thanks much for any help.
>
> best,
>
> Qi Li
> ==========================
> Department of Computer Sci.
> Graduate Center, CUNY
> Email: liqiearth at gmail.com

Qi

------------------------------

Message: 10
Date: Wed, 29 Sep 2010 04:25:23 +0000
From: liqiearth at gmail.com
Subject: Re: [Corpora-List] Looking for toolkit for verb tense
    detection
To: "Alexander Yeh" <asy at mitre.org>
Cc: "Corpora at uib.no" <Corpora at uib.no>

That makes sense! Thanks a lot, POS tagger is more informative than I thought.

Best,

Qi Li
------Original Message------
From: Alexander Yeh
To: Qi Li
Cc: Corpora at uib.no
Subject: Re: [Corpora-List] Looking for toolkit for verb tense detection
Sent: Sep 28, 2010 19:20

Qi Li wrote:
> Hi Corpora members,
>
> Does anyone know if there are some toolkits for word tense detection?
> I'm doing research on IE system, and need tense of verbs as feature in
> classifier. There are lot of papers mentioned word tense, but nobody
> said how they got the tense, it seems easy to do and even get high
> performance.

Many part-of-speech (p-o-s) finders will differentiate between certain 
tenses of verbs.

For example, part-of-speech finders trained on the Penn Tree bank will 
try to distinguish between the following (from
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html):

27.      VB      Verb, base form
28.     VBD     Verb, past tense
29.     VBG     Verb, gerund or present participle
30.     VBN     Verb, past participle
31.     VBP     Verb, non-3rd person singular present
32.     VBZ     Verb, 3rd person singular present

Hope this helps
-Alex Yeh

>
> Thanks much for any help.
>
> best,
>
> Qi Li
> ==========================
> Department of Computer Sci.
> Graduate Center, CUNY
> Email: liqiearth at gmail.com <mailto:liqiearth at gmail.com>

Qi

------------------------------

Message: 11
Date: Wed, 29 Sep 2010 16:24:08 +0700
From: "Yuri Tambovtsev" <yutamb at mail.ru>
Subject: [Corpora-List] Phonemic variations across languages
To: <corpora at uib.no>

Dear Corpora List members, do you use the coefficient of variation and 
Chi-square to study functioning of gerund, participle, phonemes or prepositions 
in language? In fact, the application of coefficient of variation and Chi-square 
to investigate the variation of linguistic elements in language may stop endless 
debates about language variation problems because they can be sucessfully 
solved. I use the coefficient of variation and Chi-square all right. They proved 
quite useful. With their help I also studied variation of phomenes and groups of 
phonemes (labials, velars, sonorants, fricatives, etc.) across languages. 
Usually they were used to study the variation of phonemes in texts. I did that 
as well. I wonder if you read my publications? Who is researching in the same 
area? I call this area phonostatistical typology. We have also studied different 
texts of English, American, Russian and Ukrainian authors from the point of view 
of the use of different linguistic units. The results of the statistical study 
of the corpora were published by us. Looking forward to hearing from you soon to 
yutamb at mail.ru Yours sincerely Yuri Tambovtsev, Novosibirsk, Russia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 1637 bytes
Desc: not available
URL: 
<http://www.uib.no/mailman/public/corpora/attachments/20100929/8c61f4d4/attachment.txt>

----------------------------------------------------------------------
Send Corpora mailing list submissions to
    corpora at uib.no

To subscribe or unsubscribe via the World Wide Web, visit
    http://mailman.uib.no/listinfo/corpora
or, via email, send a message with subject or body 'help' to
    corpora-request at uib.no

You can reach the person managing the list at
    corpora-owner at uib.no

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Corpora digest..."

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

End of Corpora Digest, Vol 39, Issue 31
***************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100929/d4a2669a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora