[Corpora-List] Greek corpus

Thu Feb 10 09:01:28 UTC 2011

Hello,

As far as I know a freely available Greek corpus is not available. But you
might want to contact Dr. Dionisis Goutsos who is responsible for the SEK
project at the University of Athens though. You can find his email mentioned
in the SEK contact page <http://sek.edu.gr/contact.php>. SEK was the result
of a cooperation between the universities of Athens and Cyprus and is
available on the web for searches by following this
link<http://www.sek.edu.gr>.
It contains both written and oral sources from Greece and Cyprus; you can
see the full list here <http://www.sek.edu.gr/dl/SekFileList.pdf>. Maybe the
SEK corpus can be made available for research purposes but Dr. Goutsos would
be the best person to help you out.

As mentioned by others ILSP also has a corpus but that too is not freely
available. There is a public search interface for this corpus as well; you
can see it by following this link <http://hnc.ilsp.gr/en/default.asp>. ILSP
too might make the corpus available for research purposes but I am not sure.

Best,
Valentini Mellas

On Wed, Feb 9, 2011 at 9:21 PM, Daniel Zeman <zeman at ufal.mff.cuni.cz> wrote:

> Hi, there is something called Greek Dependency Treebank (GDT). It was part
> of the CoNLL 2007 shared task but the license was granted only for the
> shared task. Maybe if you get in touch with the ILSP, they will tell you how
> to obtain it:
>
> Prokopis Prokopidis, Elina Desypri, Maria Koutsombogera, Haris
> Papageorgiou, and Stelios Piperidis. Theoretical and Practical Issues in the
> Construction of a Greek Dependency Treebank. In Montserrat Civit, Sandra
> Kübler, and Ma. Antonia Martí, editors, Proceedings of The Fourth Workshop
> on Treebanks and Linguistic Theories (TLT 2005), pages 149-160, Barcelona,
> Spain, December 2005. Universitat de Barcelona.
>
> * (c) 2005-2007, by the Institute for Language and Speech Processing. ILSP
> owns the copyright to all automatic and manually-validated annotations in
> the GDT.
>
> Best,
> Dan
>
> Dne 9.2.2011 19:57, Taras Zagibalov napsal(a):
>
>  Yes, Fran, first thing I did was deleting everything between<>.
>> Thank you, anyway.
>>
>> Taras
>>
>> 2011/2/9 Francis Tyers<ftyers at prompsit.com>:
>>
>>> Are you sure those aren't tags ? Try and grep any line with '<' out and
>>> see what you get. But yeah, Europarl isn't a balanced corpus by any
>>> means.
>>>
>>> For Wikipedia, it depends, you'll probably get stuff like science words
>>> fairly high, or history.
>>>
>>> If you want a corpus of news text about the Balkans, you could do worse
>>> than SETIMES http://www.statmt.org/setimes/
>>>
>>> But if you want a BNC-style "balanced corpus" of Greek, then I have no
>>> idea sorry! Those things usually don't come cheap/free :)
>>>
>>> Fran
>>>
>>> El dc 09 de 02 de 2011 a les 17:43 +0000, en/na Taras Zagibalov va
>>> escriure:
>>>
>>>> My worst fears came true: for the europarl corpus, the most frequent
>>>> words (among others) are NAME, SPEAKER, AFFILIATION, LANGUAGE (in
>>>> English, all capitals). As for Greek words, among the most frequent
>>>> ones is for example Επιτροπής (Commission).
>>>> That's not very good if you want to have a corpus that represents a
>>>> language in general.
>>>> Probably someone knows any collection of generic texts in Greek?
>>>>
>>>> As for Wikipedia-based corpus, it most probably has the same problem
>>>> as europarl - it's too genre/style specific.
>>>>
>>>> Regards,
>>>> Taras
>>>>
>>>>
>>>> 2011/2/9 Francis Tyers<ftyers at prompsit.com>
>>>>
>>>>> You can always try the Greek Wikipedia:
>>>>>
>>>>> http://dumps.wikimedia.org/elwiki/20110203/
>>>>>
>>>>> There are a few tools around for converting it into text.
>>>>>
>>>>> Fran
>>>>>
>>>>> El dc 09 de 02 de 2011 a les 15:36 +0000, en/na Taras Zagibalov va
>>>>> escriure:
>>>>>
>>>>>> Thank you Fran and Alberto,
>>>>>>
>>>>>>
>>>>>> The europarl corpus is fine and I will use it. But I assume its quite
>>>>>> specific in terms of style (official, I assume). Is there a corpus of
>>>>>> a more generic language? Probably a collection of modern literature or
>>>>>> web-based content (blogs, forums etc.)?
>>>>>>
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>> Taras
>>>>>>
>>>>>> 2011/2/9 Alberto Simões<albie at alfarrabio.di.uminho.pt>
>>>>>>         Dear taras,
>>>>>>
>>>>>>          EuroParl [1] and JRC-Acquis [2] include Greek versions.
>>>>>>
>>>>>>         [1] http://www.statmt.org/europarl/
>>>>>>         [2] http://wt.jrc.it/lt/Acquis/
>>>>>>
>>>>>>         Hope this helps
>>>>>>         Alberto
>>>>>>
>>>>>>
>>>>>>
>>>>>>         On 09/02/2011 14:44, Taras Zagibalov wrote:
>>>>>>
>>>>>>
>>>>>>                 Dear list members,
>>>>>>
>>>>>>                 Do you know any freely available plain text modern
>>>>>>                 Greek corpus?
>>>>>>                 Preferably in Unicode.
>>>>>>
>>>>>>                 Best regards,
>>>>>>                 Taras Zagibalov
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                 _______________________________________________
>>>>>>                 Corpora mailing list
>>>>>>                 Corpora at uib.no
>>>>>>                 http://mailman.uib.no/listinfo/corpora
>>>>>>
>>>>>>         --
>>>>>>         Alberto Simões
>>>>>>
>>>>>>         _______________________________________________
>>>>>>         Corpora mailing list
>>>>>>         Corpora at uib.no
>>>>>>         http://mailman.uib.no/listinfo/corpora
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Corpora mailing list
>>>>>> Corpora at uib.no
>>>>>> http://mailman.uib.no/listinfo/corpora
>>>>>>
>>>>>
>>>>>  _______________________________________________
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> http://mailman.uib.no/listinfo/corpora
>>>>
>>>
>>>
>>>  _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
> --
> RNDr. Daniel Zeman, Ph.D.
> ÚFAL MFF, Univerzita Karlova, Praha
> http://ufal.mff.cuni.cz/~zeman/
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110210/2ef01036/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora