[My-hm] Hmong Corpus

David Mortensen davidmortensen at gmail.com
Fri May 29 07:34:03 UTC 2015


I would like to announce the availability of an approximately 15 million word corpus of Hmong (mostly Hmong Daw/White Hmong but white some Mong Leng/Green Hmong).

* It was “scraped" from the long-running soc.culture.hmong (SOC) Usenet group, which is still used today (primarily through the Google Groups interface)
* It consists of 13,355 plain text files with no annotations.
* This corpus is most useful if you know Hmong or are performing an analysis that doesn’t require labelled data.
* Each file consists of all or part of a thread.
* Measures were taken to automatically filter out English and Lao posts. These measures were largely, but not completely, successful.
* Measures were also take to eliminate quoted text (that resulted in a high level of redundancy in the raw data files). These measures were much more successful than the language filtering attempts.

A small number of investigators have already used this corpus and found it useful. I am making it available to you to use in your research free of charge but with no warranty regarding its usefulness for any purpose. It can be downloaded at the following link:

http://www.davidmortensen.org/corpora/sch_corpus-2.zip <http://www.davidmortensen.org/corpora/sch_corpus-2.zip>

Even with compression, the file is large. Let me know if you have difficulty downloading it.

David R. Mortensen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/my-hm/attachments/20150529/81f73584/attachment.htm>

More information about the My-hm mailing list