[Corpora-List] Greek corpus

Alberto Simões albie at alfarrabio.di.uminho.pt
Wed Feb 9 18:20:48 UTC 2011



On 09/02/2011 17:43, Taras Zagibalov wrote:
> My worst fears came true: for the europarl corpus, the most frequent
> words (among others) are NAME, SPEAKER, AFFILIATION, LANGUAGE (in
> English, all capitals). As for Greek words, among the most frequent
> ones is for example Επιτροπής (Commission).

The English uppercase strings are natural. They are part of the 
annotation process.

But in any case you are correct, you are dealing with a biased corpus, 
for a specific area.

Unfortunately I do not have any better to share :) Sorry.

> That's not very good if you want to have a corpus that represents a
> language in general.
> Probably someone knows any collection of generic texts in Greek?
>
> As for Wikipedia-based corpus, it most probably has the same problem
> as europarl - it's too genre/style specific.
>
> Regards,
> Taras
>
>
> 2011/2/9 Francis Tyers<ftyers at prompsit.com>
>>
>> You can always try the Greek Wikipedia:
>>
>> http://dumps.wikimedia.org/elwiki/20110203/
>>
>> There are a few tools around for converting it into text.
>>
>> Fran
>>
>> El dc 09 de 02 de 2011 a les 15:36 +0000, en/na Taras Zagibalov va
>> escriure:
>>> Thank you Fran and Alberto,
>>>
>>>
>>> The europarl corpus is fine and I will use it. But I assume its quite
>>> specific in terms of style (official, I assume). Is there a corpus of
>>> a more generic language? Probably a collection of modern literature or
>>> web-based content (blogs, forums etc.)?
>>>
>>>
>>> Thank you.
>>>
>>>
>>> Taras
>>>
>>> 2011/2/9 Alberto Simões<albie at alfarrabio.di.uminho.pt>
>>>          Dear taras,
>>>
>>>           EuroParl [1] and JRC-Acquis [2] include Greek versions.
>>>
>>>          [1] http://www.statmt.org/europarl/
>>>          [2] http://wt.jrc.it/lt/Acquis/
>>>
>>>          Hope this helps
>>>          Alberto
>>>
>>>
>>>
>>>          On 09/02/2011 14:44, Taras Zagibalov wrote:
>>>
>>>
>>>                  Dear list members,
>>>
>>>                  Do you know any freely available plain text modern
>>>                  Greek corpus?
>>>                  Preferably in Unicode.
>>>
>>>                  Best regards,
>>>                  Taras Zagibalov
>>>
>>>
>>>
>>>
>>>                  _______________________________________________
>>>                  Corpora mailing list
>>>                  Corpora at uib.no
>>>                  http://mailman.uib.no/listinfo/corpora
>>>
>>>          --
>>>          Alberto Simões
>>>
>>>          _______________________________________________
>>>          Corpora mailing list
>>>          Corpora at uib.no
>>>          http://mailman.uib.no/listinfo/corpora
>>>
>>>
>>> _______________________________________________
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 
Alberto Simões

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list