[Corpora-List] Corpora Digest, Vol 39, Issue 31
nasir saim
nasir_saim at yahoo.com
Thu Sep 30 04:45:27 UTC 2010
Hello!
I am working on Adverbials in Pakistani English for my M.Phil thesis. The idea
is to prove Pakistani English as an independent variety. For this purpose i have
tagged corpus of Pakistani Written English. I want to get this data parsed. can
you help me in this regard?
Plz reply and comment.
Nasir Ali
M.Phil Scholar
Dept. Applied Linguistics
G.C University Faisalabad.
Lecturer English
Depatrment of Humanities and Social Sciences
National Textile University Faisalabad
Pakistan
________________________________
From: "corpora-request at uib.no" <corpora-request at uib.no>
To: corpora at uib.no
Sent: Wed, September 29, 2010 3:00:01 PM
Subject: Corpora Digest, Vol 39, Issue 31
Today's Topics:
1. 3 Research Internships at FBK-irst (Marcello Federico)
2. Job announcement (Stefanie Dipper)
3. Re: Help Regarding Cognates Identification (chris brew)
4. A word of advice needed on Pakistani English (Ahmad Ahmad)
5. Release of SemEval2010-Task1 datasets (Coreference
Resolution in Multiple Languages) (Marta Recasens)
6. thesis / internship in MT inquiry for a student speaking
Polish (native) English, French, Spanish, and basic Swedish
(Sandra Weiss)
7. Re: Looking for toolkit for verb tense detection
(Amaç Herda?delen)
8. Re: Looking for toolkit for verb tense detection (Alexander Yeh)
9. Re: Looking for toolkit for verb tense detection
(liqiearth at gmail.com)
10. Re: Looking for toolkit for verb tense detection
(liqiearth at gmail.com)
11. Phonemic variations across languages (Yuri Tambovtsev)
----------------------------------------------------------------------
Message: 1
Date: Tue, 28 Sep 2010 14:32:59 +0000
From: Marcello Federico <federico at fbk.eu>
Subject: [Corpora-List] 3 Research Internships at FBK-irst
To: "corpora at uib.no" <corpora at uib.no>
The ?Human Language Technologies? Research Unit of Bruno Kessler Foundation
(FBK) is seeking
candidates for three research internship positions:
- two in the field of Statistical Machine Translation
- one in the field of Content Processing.
The internship are intended to provide a strong theoretical and experimental
background to graduate
students interested in applying later for a PhD scholarship.
Requirements, salary, and application procedure are specified in:
http://risorseumane.fbk.eu/sites/risorseumane.fbk.eu/files/Call%20HLT_INTERNSHIP2010.pdf
Closing date: 12 October 2010
------------------------------
Message: 2
Date: Tue, 28 Sep 2010 18:12:49 +0200 (CEST)
From: Stefanie Dipper <dipper at linguistics.rub.de>
Subject: [Corpora-List] Job announcement
To: Corpora at uib.no
#########################
University: Ruhr-University Bochum
Department: Linguistics Department
Job Rank: 2 Project researchers, salary scale 13 TV-L (65%, PhD positions)
Specialty Areas: Computational Linguistics, Corpus Linguistics
Application Deadline: 1-Nov-2010 (open until filled)
Duration: Initial contract 2 years, with potential extension for an
additional year
#########################
We are looking for two enthusiastic PhD students, prepared to collaborate
in a team. The positions are within an interdisciplinary project group,
funded by the DFG (German Research Foundation), which involve people from
medieval studies, historical linguistics, and computational linguistics.
The project deals with texts from Early New High German (14th-16th
centuries).
The project has three main goals: (i) developing automatic methods for the
analysis of historical language data (e.g. POS tagging, normalization,
alignment); (ii) computing similarity of historical dialects (at different
linguistic levels); (iii) adapting and extending an existing corpus tool
for the data.
Applicants for both positions should have graduated from a master's
program, or expect to graduate in the near future. They are supposed to
pursue a PhD within the project context. Both positions require a
background in corpus and/or computational linguistics, and good
programming skills (e.g. Java, Perl, Python; experience in software
development on Linux systems is a plus). Knowledge of German is desirable,
and interest in working with historical language data is indispensable.
Ideal candidates will have a strong background in two or more of the
following areas:
- Development of corpus tools (annotation tools, search tools)
- (Semi-)automatic annotation (POS tagging, chunking, alignment, etc.)
- Data mining, clustering
- Historical linguistics
The positions are at the rank of 'Wissenschaftliche Mitarbeiter'. The
salary is determined by the pay scale for German public employees
('Entgeltgruppe' TV-L E13, 65%). The initial contract length is two years,
with a potential extension for an additional year.
Applications received by November 1, 2010 will receive full consideration.
However, the search will remain open until the positions are filled.
Candidates should send:
- a short letter of interest
- a detailed CV with a summary of research experiences/interests
- one or two sample publications or a copy of the master thesis
- the names and contact information of two referees
- diploma copies.
Please submit PDF files only. Email address for inquiries and
applications: dipper AT linguistics.rub.de
The Ruhr-University Bochum is an equal opportunity employer. Severely
disabled persons with equal qualifications will be prioritized.
Jun.Prof. Dr. Stefanie Dipper
Institute of Linguistics
Ruhr-University Bochum
D - 44780 Bochum
Germany
http://www.linguistics.ruhr-uni-bochum.de/~dipper
------------------------------
Message: 3
Date: Tue, 28 Sep 2010 08:41:59 -0400
From: chris brew <cbrew at ling.osu.edu>
Subject: Re: [Corpora-List] Help Regarding Cognates Identification
To: Padmini priyadharsini <padminipriyadharsini at gmail.com>
Cc: corpora at uib.no
I understand the definition of "cognate" to be about the history of words,
not just about
similarities in the surface form. There are at least three ways that words
can come to be
similar across languages
1) words have a common ancestry: language A and language B have words that
can be traced back to the same root word in some ancestor language. These
cases can be interesting, like
*étoile and star, *or routine, like night and nacht. In the interesting
one, a systematic process has happened that makes the letter before the t
turn up as e-acute in French but s in English.
2) language A borrows the word from language B.
3) an accident happens. The words in language A and language B look the
same, but there is
no common ancestry and no borrowing.
Most people call pairs that fit under case (1) cognates, and the other two
"false cognates".
It is a very interesting problem to write programs to detect and take
advantage of systematic sound correspondences like the star/etoile thing.
Kondrak has worked on this extensively.
There was a good workshop on computational approaches to this stuff at ACL
2007
@InProceedings{nerbonne-ellison-kondrak:2007:CompHistPhon,
author = {Nerbonne, John and Ellison, T. Mark and Kondrak, Grzegorz},
title = {Computing and Historical Phonology},
booktitle = {Proceedings of Ninth Meeting of the ACL Special
Interest Group in Computational Morphology and Phonology},
month = {June},
year = {2007},
address = {Prague, Czech Republic},
publisher = {Association for Computational Linguistics},
pages = {1--5},
url = {http://www.aclweb.org/anthology/W/W07/W07-1301}
}
Personally, I wouldn't want to call borrowings cognates, and I would tend to
see references to named entities as similar to borrowings, because very
often the borrowed word is unchanged, or changed only as much as necessary
to make it minimally acceptable phonologically, so the
Japanese word "sekkusu" is just the way the language borrows the English
word "sex", not
the trace of some historically exciting process.
On Tue, Sep 28, 2010 at 1:23 AM, Padmini priyadharsini <
padminipriyadharsini at gmail.com> wrote:
> Hi All,
>
> Kindly let me know the availeble tools and used techniques for
> cognates identification.
>
> I will be summarizing all the reply to the list.
>
> Thanks,
> Padmini
>
> --
> Life is beautiful and enjoy its simplicity :)
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
--
Chris Brew, Ohio State University
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 5584 bytes
Desc: not available
URL:
<http://www.uib.no/mailman/public/corpora/attachments/20100928/76b9c631/attachment.txt>
------------------------------
Message: 4
Date: Mon, 27 Sep 2010 05:36:58 -0800
From: Ahmad Ahmad <escholer at gmail.com>
Subject: [Corpora-List] A word of advice needed on Pakistani English
To: corpora at uib.no
*Dear Fellows*
I am working on *vowels in Pakistani English* for my M Phil dissertation at
GC University Faisalabad. The idea is to work on the vowels of PE to prove
it an indigenous variety. It is a corpus based study using PRAAT, finding
Formant Values of the vowels of PE.
Do comment if you can be of any help to me in this regard. Details of the
proposal of dissertation may be shared.
Regards
Hafiz Ahmad Bilal
M Phil Scholar
Dept of Applied Linguistics
GC University, Faisalabad
Lecturer, Dept of English
University of Sargodha
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 2162 bytes
Desc: not available
URL:
<http://www.uib.no/mailman/public/corpora/attachments/20100927/b07d2924/attachment.txt>
------------------------------
Message: 5
Date: Tue, 28 Sep 2010 17:02:10 +0200
From: Marta Recasens <mrecasens at ub.edu>
Subject: [Corpora-List] Release of SemEval2010-Task1 datasets
(Coreference Resolution in Multiple Languages)
To: corpora at uib.no
We are pleased to announce the release of the SemEval2010-Task1 datasets for
coreference resolution in multiple languages including Catalan, Dutch, German,
Italian, and Spanish. The English data will be released by LDC early 2011.
The data is now freely available for downloading from the task website, at
http://stel.ub.edu/semeval2010-coref/download/
General formatting is shared by all languages and is inspired by the previous
CoNLL shared tasks (2008/2009 editions: http://ufal.mff.cuni.cz/conll2009-st).
The data are displayed in a uniform column-based format with information about
coreference, lemma, PoS, morphological features, head, dependency relations,
NEs, and semantic dependencies. Both gold-standard and automatically predicted
information are provided (availability depending on the language). The existence
of gold and automatic preprocessing as well as of already published results
makes it an ideal resource for training and testing coreference resolution
systems, especially when cross-language portability is to be achieved.
REFERENCE
Marta Recasens, Lluís Màrquez, Emili Sapena, M. Antònia Martí, Mariona Taulé,
Véronique Hoste, Massimo Poesio, and Yannick Versley. 2010. SemEval-2010 Task 1:
Coreference Resolution in Multiple Languages. In Proceedings of the 5th
International Workshop on Semantic Evaluation (SemEval-2010), ACL 2010, pages
1-8, Uppsala, Sweden.
We will be happy to publish your results or articles related to any of these
datasets on our website: http://stel.ub.edu/semeval2010-coref/posttask
Please feel free to let us know.
ORGANIZERS
* Marta Recasens, M. Antònia Martí, Mariona Taulé
University of Barcelona
{mrecasens,amarti,mtaule}@ub.edu
* Lluís Màrquez, Emili Sapena
Technical University of Catalonia
{lluism,esapena}@lsi.upc.edu
* Massimo Poesio
University of Essex / University of Trento
* Véronique Hoste
University College Ghent
* Yannick Versley
University of Tübingen
------------------------------
Message: 6
Date: Tue, 28 Sep 2010 22:54:51 +0200
From: Sandra Weiss <sandre17 at gmail.com>
Subject: [Corpora-List] thesis / internship in MT inquiry for a
student speaking Polish (native) English, French, Spanish, and basic
Swedish
To: Corpora at uib.no
**
Dear Corpora members,
I am* a polish student* looking for *an internship related to Machine
Translation* that could be the base for my master thesis.
So far I have obtained a *BA in French and Spanish* at an english university
which involved a lot of translation between the 2 languages plus into
English.
(including a year of *french-spanish translation and vice-versa, also into
English* at a french university Jean Moulin 3 in Lyon).
I have taken *courses in Machine Translation* where worked mainly on
*pre-editing
and post-editing* of texts run through *MT engines *plus worked with
translation memory* WordFast.* I have also completed single courses in *General
Linguistics*.
I am now a second year student of *Language and Culture in Europe master
programme at the Linkoping's University in Sweden* plus I am in process of
learning *C language* and *Perl*.
I would like to work in the future in the feild of *MT* and that is why this
semester I am preparing myself to take up a *master thesis* next semester
hopefully *related to MT*.
*I am willing to take on any courses or training if required.*
Any help or information will be very appreciated,
kind regards,
*Sandra Weiss*
*Master student of Language and Culture in Europe*
*Linkoping, Sweden*
*tel: *0046760812503
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 2163 bytes
Desc: not available
URL:
<http://www.uib.no/mailman/public/corpora/attachments/20100928/db71aad4/attachment.txt>
------------------------------
Message: 7
Date: Wed, 29 Sep 2010 00:44:01 +0200
From: Amaç Herda?delen <amac at herdagdelen.com>
Subject: Re: [Corpora-List] Looking for toolkit for verb tense
detection
To: Corpora at uib.no, "Qi Li" <liqiearth at gmail.com>
Hello Qi Li,
I think you might be interested in Nodebox's Linguistics Library:
http://nodebox.net/code/index.php/Linguistics#verb_conjugation It provides a
simple interface that does what you want (in Python).
Here is the example from the documentation.
----
print en.verb.is_tense("wasn't", "1st singular past", negated=True)
print en.verb.is_present("does", person=1)
print en.verb.is_present_participle("doing")
print en.verb.is_past_participle("done")
>>> True
>>> False
>>> True
>>> True
----
Cheers,
Amaç Herda?delen
On Mon, 27 Sep 2010 21:55:15 +0200, Qi Li <liqiearth at gmail.com> wrote:
> Hi Corpora members,
>
> Does anyone know if there are some toolkits for word tense detection? I'm
> doing research on IE system, and need tense of verbs as feature in
> classifier. There are lot of papers mentioned word tense, but nobody said
> how they got the tense, it seems easy to do and even get high performance.
>
> Thanks much for any help.
>
> best,
>
> Qi Li
> ==========================
> Department of Computer Sci.
> Graduate Center, CUNY
> Email: liqiearth at gmail.com
------------------------------
Message: 8
Date: Tue, 28 Sep 2010 19:20:09 -0400
From: Alexander Yeh <asy at mitre.org>
Subject: Re: [Corpora-List] Looking for toolkit for verb tense
detection
To: Qi Li <liqiearth at gmail.com>
Cc: "Corpora at uib.no" <Corpora at uib.no>
Qi Li wrote:
> Hi Corpora members,
>
> Does anyone know if there are some toolkits for word tense detection?
> I'm doing research on IE system, and need tense of verbs as feature in
> classifier. There are lot of papers mentioned word tense, but nobody
> said how they got the tense, it seems easy to do and even get high
> performance.
Many part-of-speech (p-o-s) finders will differentiate between certain
tenses of verbs.
For example, part-of-speech finders trained on the Penn Tree bank will
try to distinguish between the following (from
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html):
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
Hope this helps
-Alex Yeh
>
> Thanks much for any help.
>
> best,
>
> Qi Li
> ==========================
> Department of Computer Sci.
> Graduate Center, CUNY
> Email: liqiearth at gmail.com <mailto:liqiearth at gmail.com>
------------------------------
Message: 9
Date: Tue, 28 Sep 2010 23:07:09 +0000
From: liqiearth at gmail.com
Subject: Re: [Corpora-List] Looking for toolkit for verb tense
detection
To: "=?utf-8?B?QW1hw6cgSGVyZGHEn2RlbGVu?=" <amac at herdagdelen.com>,
Corpora at uib.no
Hi,
Thanks a lot, I'll check it, if there is java version, that would be great. I
don't know why nobody mentioned it in papers, is it so simple to do and people
always do it themselves?
Anyway, thanks again
Qi Li
------Original Message------
From: Amaç Herda?delen
To: Corpora at uib.no
To: Qi Li
Subject: Re: [Corpora-List] Looking for toolkit for verb tense detection
Sent: Sep 28, 2010 18:44
Hello Qi Li,
I think you might be interested in Nodebox's Linguistics Library:
http://nodebox.net/code/index.php/Linguistics#verb_conjugation It provides a
simple interface that does what you want (in Python).
Here is the example from the documentation.
----
print en.verb.is_tense("wasn't", "1st singular past", negated=True)
print en.verb.is_present("does", person=1)
print en.verb.is_present_participle("doing")
print en.verb.is_past_participle("done")
>>> True
>>> False
>>> True
>>> True
----
Cheers,
Amaç Herda?delen
On Mon, 27 Sep 2010 21:55:15 +0200, Qi Li <liqiearth at gmail.com> wrote:
> Hi Corpora members,
>
> Does anyone know if there are some toolkits for word tense detection? I'm
> doing research on IE system, and need tense of verbs as feature in
> classifier. There are lot of papers mentioned word tense, but nobody said
> how they got the tense, it seems easy to do and even get high performance.
>
> Thanks much for any help.
>
> best,
>
> Qi Li
> ==========================
> Department of Computer Sci.
> Graduate Center, CUNY
> Email: liqiearth at gmail.com
Qi
------------------------------
Message: 10
Date: Wed, 29 Sep 2010 04:25:23 +0000
From: liqiearth at gmail.com
Subject: Re: [Corpora-List] Looking for toolkit for verb tense
detection
To: "Alexander Yeh" <asy at mitre.org>
Cc: "Corpora at uib.no" <Corpora at uib.no>
That makes sense! Thanks a lot, POS tagger is more informative than I thought.
Best,
Qi Li
------Original Message------
From: Alexander Yeh
To: Qi Li
Cc: Corpora at uib.no
Subject: Re: [Corpora-List] Looking for toolkit for verb tense detection
Sent: Sep 28, 2010 19:20
Qi Li wrote:
> Hi Corpora members,
>
> Does anyone know if there are some toolkits for word tense detection?
> I'm doing research on IE system, and need tense of verbs as feature in
> classifier. There are lot of papers mentioned word tense, but nobody
> said how they got the tense, it seems easy to do and even get high
> performance.
Many part-of-speech (p-o-s) finders will differentiate between certain
tenses of verbs.
For example, part-of-speech finders trained on the Penn Tree bank will
try to distinguish between the following (from
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html):
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
Hope this helps
-Alex Yeh
>
> Thanks much for any help.
>
> best,
>
> Qi Li
> ==========================
> Department of Computer Sci.
> Graduate Center, CUNY
> Email: liqiearth at gmail.com <mailto:liqiearth at gmail.com>
Qi
------------------------------
Message: 11
Date: Wed, 29 Sep 2010 16:24:08 +0700
From: "Yuri Tambovtsev" <yutamb at mail.ru>
Subject: [Corpora-List] Phonemic variations across languages
To: <corpora at uib.no>
Dear Corpora List members, do you use the coefficient of variation and
Chi-square to study functioning of gerund, participle, phonemes or prepositions
in language? In fact, the application of coefficient of variation and Chi-square
to investigate the variation of linguistic elements in language may stop endless
debates about language variation problems because they can be sucessfully
solved. I use the coefficient of variation and Chi-square all right. They proved
quite useful. With their help I also studied variation of phomenes and groups of
phonemes (labials, velars, sonorants, fricatives, etc.) across languages.
Usually they were used to study the variation of phonemes in texts. I did that
as well. I wonder if you read my publications? Who is researching in the same
area? I call this area phonostatistical typology. We have also studied different
texts of English, American, Russian and Ukrainian authors from the point of view
of the use of different linguistic units. The results of the statistical study
of the corpora were published by us. Looking forward to hearing from you soon to
yutamb at mail.ru Yours sincerely Yuri Tambovtsev, Novosibirsk, Russia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 1637 bytes
Desc: not available
URL:
<http://www.uib.no/mailman/public/corpora/attachments/20100929/8c61f4d4/attachment.txt>
----------------------------------------------------------------------
Send Corpora mailing list submissions to
corpora at uib.no
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.uib.no/listinfo/corpora
or, via email, send a message with subject or body 'help' to
corpora-request at uib.no
You can reach the person managing the list at
corpora-owner at uib.no
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Corpora digest..."
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
End of Corpora Digest, Vol 39, Issue 31
***************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100929/d4a2669a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list