[Corpora-List] Greek corpus

Taras Zagibalov taras8055 at gmail.com
Wed Feb 9 18:57:55 UTC 2011


Yes, Fran, first thing I did was deleting everything between <>.
Thank you, anyway.

Taras

2011/2/9 Francis Tyers <ftyers at prompsit.com>:
> Are you sure those aren't tags ? Try and grep any line with '<' out and
> see what you get. But yeah, Europarl isn't a balanced corpus by any
> means.
>
> For Wikipedia, it depends, you'll probably get stuff like science words
> fairly high, or history.
>
> If you want a corpus of news text about the Balkans, you could do worse
> than SETIMES http://www.statmt.org/setimes/
>
> But if you want a BNC-style "balanced corpus" of Greek, then I have no
> idea sorry! Those things usually don't come cheap/free :)
>
> Fran
>
> El dc 09 de 02 de 2011 a les 17:43 +0000, en/na Taras Zagibalov va
> escriure:
>> My worst fears came true: for the europarl corpus, the most frequent
>> words (among others) are NAME, SPEAKER, AFFILIATION, LANGUAGE (in
>> English, all capitals). As for Greek words, among the most frequent
>> ones is for example Επιτροπής (Commission).
>> That's not very good if you want to have a corpus that represents a
>> language in general.
>> Probably someone knows any collection of generic texts in Greek?
>>
>> As for Wikipedia-based corpus, it most probably has the same problem
>> as europarl - it's too genre/style specific.
>>
>> Regards,
>> Taras
>>
>>
>> 2011/2/9 Francis Tyers <ftyers at prompsit.com>
>> >
>> > You can always try the Greek Wikipedia:
>> >
>> > http://dumps.wikimedia.org/elwiki/20110203/
>> >
>> > There are a few tools around for converting it into text.
>> >
>> > Fran
>> >
>> > El dc 09 de 02 de 2011 a les 15:36 +0000, en/na Taras Zagibalov va
>> > escriure:
>> > > Thank you Fran and Alberto,
>> > >
>> > >
>> > > The europarl corpus is fine and I will use it. But I assume its quite
>> > > specific in terms of style (official, I assume). Is there a corpus of
>> > > a more generic language? Probably a collection of modern literature or
>> > > web-based content (blogs, forums etc.)?
>> > >
>> > >
>> > > Thank you.
>> > >
>> > >
>> > > Taras
>> > >
>> > > 2011/2/9 Alberto Simões <albie at alfarrabio.di.uminho.pt>
>> > >         Dear taras,
>> > >
>> > >          EuroParl [1] and JRC-Acquis [2] include Greek versions.
>> > >
>> > >         [1] http://www.statmt.org/europarl/
>> > >         [2] http://wt.jrc.it/lt/Acquis/
>> > >
>> > >         Hope this helps
>> > >         Alberto
>> > >
>> > >
>> > >
>> > >         On 09/02/2011 14:44, Taras Zagibalov wrote:
>> > >
>> > >
>> > >                 Dear list members,
>> > >
>> > >                 Do you know any freely available plain text modern
>> > >                 Greek corpus?
>> > >                 Preferably in Unicode.
>> > >
>> > >                 Best regards,
>> > >                 Taras Zagibalov
>> > >
>> > >
>> > >
>> > >
>> > >                 _______________________________________________
>> > >                 Corpora mailing list
>> > >                 Corpora at uib.no
>> > >                 http://mailman.uib.no/listinfo/corpora
>> > >
>> > >         --
>> > >         Alberto Simões
>> > >
>> > >         _______________________________________________
>> > >         Corpora mailing list
>> > >         Corpora at uib.no
>> > >         http://mailman.uib.no/listinfo/corpora
>> > >
>> > >
>> > > _______________________________________________
>> > > Corpora mailing list
>> > > Corpora at uib.no
>> > > http://mailman.uib.no/listinfo/corpora
>> >
>> >
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
>
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list