[Corpora-List] Greek corpus

Taras Zagibalov taras8055 at gmail.com
Wed Feb 9 17:43:51 UTC 2011


My worst fears came true: for the europarl corpus, the most frequent
words (among others) are NAME, SPEAKER, AFFILIATION, LANGUAGE (in
English, all capitals). As for Greek words, among the most frequent
ones is for example Επιτροπής (Commission).
That's not very good if you want to have a corpus that represents a
language in general.
Probably someone knows any collection of generic texts in Greek?

As for Wikipedia-based corpus, it most probably has the same problem
as europarl - it's too genre/style specific.

Regards,
Taras


2011/2/9 Francis Tyers <ftyers at prompsit.com>
>
> You can always try the Greek Wikipedia:
>
> http://dumps.wikimedia.org/elwiki/20110203/
>
> There are a few tools around for converting it into text.
>
> Fran
>
> El dc 09 de 02 de 2011 a les 15:36 +0000, en/na Taras Zagibalov va
> escriure:
> > Thank you Fran and Alberto,
> >
> >
> > The europarl corpus is fine and I will use it. But I assume its quite
> > specific in terms of style (official, I assume). Is there a corpus of
> > a more generic language? Probably a collection of modern literature or
> > web-based content (blogs, forums etc.)?
> >
> >
> > Thank you.
> >
> >
> > Taras
> >
> > 2011/2/9 Alberto Simões <albie at alfarrabio.di.uminho.pt>
> >         Dear taras,
> >
> >          EuroParl [1] and JRC-Acquis [2] include Greek versions.
> >
> >         [1] http://www.statmt.org/europarl/
> >         [2] http://wt.jrc.it/lt/Acquis/
> >
> >         Hope this helps
> >         Alberto
> >
> >
> >
> >         On 09/02/2011 14:44, Taras Zagibalov wrote:
> >
> >
> >                 Dear list members,
> >
> >                 Do you know any freely available plain text modern
> >                 Greek corpus?
> >                 Preferably in Unicode.
> >
> >                 Best regards,
> >                 Taras Zagibalov
> >
> >
> >
> >
> >                 _______________________________________________
> >                 Corpora mailing list
> >                 Corpora at uib.no
> >                 http://mailman.uib.no/listinfo/corpora
> >
> >         --
> >         Alberto Simões
> >
> >         _______________________________________________
> >         Corpora mailing list
> >         Corpora at uib.no
> >         http://mailman.uib.no/listinfo/corpora
> >
> >
> > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
>
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list