[Corpora-List] Greek corpus

Daniel Zeman zeman at ufal.mff.cuni.cz
Wed Feb 9 19:21:43 UTC 2011


Hi, there is something called Greek Dependency Treebank (GDT). It was 
part of the CoNLL 2007 shared task but the license was granted only for 
the shared task. Maybe if you get in touch with the ILSP, they will tell 
you how to obtain it:

Prokopis Prokopidis, Elina Desypri, Maria Koutsombogera, Haris 
Papageorgiou, and Stelios Piperidis. Theoretical and Practical Issues in 
the Construction of a Greek Dependency Treebank. In Montserrat Civit, 
Sandra Kübler, and Ma. Antonia Martí, editors, Proceedings of The Fourth 
Workshop on Treebanks and Linguistic Theories (TLT 2005), pages 149-160, 
Barcelona, Spain, December 2005. Universitat de Barcelona.

* (c) 2005-2007, by the Institute for Language and Speech Processing. 
ILSP owns the copyright to all automatic and manually-validated 
annotations in the GDT.

Best,
Dan

Dne 9.2.2011 19:57, Taras Zagibalov napsal(a):
> Yes, Fran, first thing I did was deleting everything between<>.
> Thank you, anyway.
>
> Taras
>
> 2011/2/9 Francis Tyers<ftyers at prompsit.com>:
>> Are you sure those aren't tags ? Try and grep any line with '<' out and
>> see what you get. But yeah, Europarl isn't a balanced corpus by any
>> means.
>>
>> For Wikipedia, it depends, you'll probably get stuff like science words
>> fairly high, or history.
>>
>> If you want a corpus of news text about the Balkans, you could do worse
>> than SETIMES http://www.statmt.org/setimes/
>>
>> But if you want a BNC-style "balanced corpus" of Greek, then I have no
>> idea sorry! Those things usually don't come cheap/free :)
>>
>> Fran
>>
>> El dc 09 de 02 de 2011 a les 17:43 +0000, en/na Taras Zagibalov va
>> escriure:
>>> My worst fears came true: for the europarl corpus, the most frequent
>>> words (among others) are NAME, SPEAKER, AFFILIATION, LANGUAGE (in
>>> English, all capitals). As for Greek words, among the most frequent
>>> ones is for example Επιτροπής (Commission).
>>> That's not very good if you want to have a corpus that represents a
>>> language in general.
>>> Probably someone knows any collection of generic texts in Greek?
>>>
>>> As for Wikipedia-based corpus, it most probably has the same problem
>>> as europarl - it's too genre/style specific.
>>>
>>> Regards,
>>> Taras
>>>
>>>
>>> 2011/2/9 Francis Tyers<ftyers at prompsit.com>
>>>> You can always try the Greek Wikipedia:
>>>>
>>>> http://dumps.wikimedia.org/elwiki/20110203/
>>>>
>>>> There are a few tools around for converting it into text.
>>>>
>>>> Fran
>>>>
>>>> El dc 09 de 02 de 2011 a les 15:36 +0000, en/na Taras Zagibalov va
>>>> escriure:
>>>>> Thank you Fran and Alberto,
>>>>>
>>>>>
>>>>> The europarl corpus is fine and I will use it. But I assume its quite
>>>>> specific in terms of style (official, I assume). Is there a corpus of
>>>>> a more generic language? Probably a collection of modern literature or
>>>>> web-based content (blogs, forums etc.)?
>>>>>
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>> Taras
>>>>>
>>>>> 2011/2/9 Alberto Simões<albie at alfarrabio.di.uminho.pt>
>>>>>          Dear taras,
>>>>>
>>>>>           EuroParl [1] and JRC-Acquis [2] include Greek versions.
>>>>>
>>>>>          [1] http://www.statmt.org/europarl/
>>>>>          [2] http://wt.jrc.it/lt/Acquis/
>>>>>
>>>>>          Hope this helps
>>>>>          Alberto
>>>>>
>>>>>
>>>>>
>>>>>          On 09/02/2011 14:44, Taras Zagibalov wrote:
>>>>>
>>>>>
>>>>>                  Dear list members,
>>>>>
>>>>>                  Do you know any freely available plain text modern
>>>>>                  Greek corpus?
>>>>>                  Preferably in Unicode.
>>>>>
>>>>>                  Best regards,
>>>>>                  Taras Zagibalov
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                  _______________________________________________
>>>>>                  Corpora mailing list
>>>>>                  Corpora at uib.no
>>>>>                  http://mailman.uib.no/listinfo/corpora
>>>>>
>>>>>          --
>>>>>          Alberto Simões
>>>>>
>>>>>          _______________________________________________
>>>>>          Corpora mailing list
>>>>>          Corpora at uib.no
>>>>>          http://mailman.uib.no/listinfo/corpora
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Corpora mailing list
>>>>> Corpora at uib.no
>>>>> http://mailman.uib.no/listinfo/corpora
>>>>
>>> _______________________________________________
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>
>>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 
RNDr. Daniel Zeman, Ph.D.
ÚFAL MFF, Univerzita Karlova, Praha
http://ufal.mff.cuni.cz/~zeman/


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list