[Corpora-List] World of Warcraft Corpus

Ivan Krišto ivan.kristo at gmail.com
Tue Sep 10 07:51:16 UTC 2013


On 09/09/2013 07:46 AM, liling tan wrote:
> Dear all,
>
> Does anyone know of any compilation of World of Warcraft (WoW) chat
> corpus?
>
> Any suggestions/advice on how to collect a WoW chat corpus?

Hello!

Here is a suggestion how to collect corpus:
- download recorded gameplays from youtube (there should be plenty of them),
- extract chat using OCR.

This isn't simple method, but it also isn't hard as it seems.
First you need to choose good tool to download YT videos (due to recent
update of YT policy, this isn't some trivial task... maybe Firefox video
downloader plugin will still do the trick).
Break videos into images (or directly use videos, but I prefer images).
I use ffmpeg for this.
Then you need to define part of screen where chats are (to reduce noise
and speed up process). Crop chat screen rectangle from images (I use
ImageMagick for this). Also, you could boost contrast on those images
for better OCR results (ImageMagick can do this).
Then use some OCR software to extract chat text (Tesseract should be
fine -- http://code.google.com/p/tesseract-ocr/ ; but you have a lot of
options:
http://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software
).
At the end, you need to filter out duplicates (more than one frame will
contain same messages).This is also easy.

In summary, If you manage to find ocr software which will behave as:
ocr chat-window.jpg > chat.txt
then, without counting lines of code needed to filter out duplicates and
manual work to download videos, you'll have only 3-5 lines of code :).


  Regards,
    Ivan Krišto

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list