[Corpora-List] Seeking bilingual corpora, colloquial register, gaming

Alex Juan alhelsal at posgrado.upv.es
Mon Aug 6 09:58:39 UTC 2012


Dear all,

I am looking for bilingual/multilingual corpora that could be classified as
UGC, that is, user-generated content. This ranges from (but may not be
limited to) chat conversations, support forum conversations,
phone/sms/email transcripts, etc.

As you know, the language here is not always "standard", and this content
may be rich not only in abbreviations but also contain spelling mistakes,
and even figurative language and swearwords. If there are also collections
or repositories of keywords (aka "seed" words) used in similar studies,
that would also be of help. In the first instance, the languages of
interest are German and English, with the items of the corpora or
repositories aligned with one another.

I am attempting to build an MT prototype of DE<>EN for the gaming domain.

Does anyone know of such a corpus? Any information/orientation will be
appreciated (even if it comes from specialists from other HLT fields, such
as sentiment analysis or semantic web).

Thanks.
-- 
Alex Juan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120806/326b5db4/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list