[Corpora-List] Looking for Igbo, Hausa, and Yoruba Corpora

Jimmy O'Regan joregan at gmail.com
Sat Feb 25 20:23:06 UTC 2012


On 25 February 2012 19:31, Fink, Clayton R. <finkcr1 at jhuapl.edu> wrote:
> There's a BBC Hausa service and a Yoruba-language Wikipedia, so there are
> some possibilities for those languages. Igbo seems to be a real problem,
> though, in terms of finding text corpora.
>

There's an Igbo Wikipedia: http://ig.wikipedia.org/wiki/Ih%C3%BC_Mbu

> I'm interested, mostly, in training up language id models that I can use on
> names. I have some small corpora of first names and surnames scraped off of
> the Web, but it might be interesting to have some larger corpora to work
> from.

Kevin Scannell's language id model set
(http://nltk.googlecode.com/svn/trunk/nltk_data/packages/corpora/langid.zip)
includes a trigram model for Igbo.


-- 
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list