[Corpora-List] World of Warcraft Corpus (Ivan Kri?to)

Tue Sep 10 10:48:48 UTC 2013

Dear liling

There are many extensions and apps that you can use to download YouTube
videos easily, such as Downloadhelper for Firefox or YouTube Downloader.
For Downloadhelper and others you can also use queues to add new videos to
download. YouTube also creates automatic playlists which is definitely
helpful.

As Laura Christopherson has pointed out, protecting identities may be an
issue. WoW users may choose to make use of the Real ID system where they
chat with friends on the same game network even including not those playing
WoW (Battle.net <http://battle.net>), eg. Starcraft 2 (I think). As far as
I remember that means that their real names instead of their character
names are displayed. It is probably a good idea to shoot Blizzard an email
to ask them what you are legally allowed to do in terms of using game
content for research. As gameplay videos, which are public, are allowed to
be published on YouTube, I don't expect there to be major problems. You may
even decide not to limit yourself to WoW chats but include more games
within the Battle.net network.

Good luck with this very interesting project!

Gideon Kotzé
gidi8ster at gmail.com
www.gideonkotze.nl

On Tue, Sep 10, 2013 at 12:00 PM, <corpora-request at uib.no> wrote:

> Send Corpora mailing list submissions to
>         corpora at uib.no
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://mailman.uib.no/listinfo/corpora
> or, via email, send a message with subject or body 'help' to
>         corpora-request at uib.no
>
> You can reach the person managing the list at
>         corpora-owner at uib.no
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Corpora digest..."
>
> Today's Topics:
>
>    1. Re:  Corpora Digest, Vol 75, Issue 9 (Laura Christopherson)
>    2.  A dependency parser for Arabic (Jack Alan)
>    3. Re:  A dependency parser for Arabic (Kevin Gimpel)
>    4.  Fwd: [clcs-sdl-chercheurs] colloque international
>       Interactions Multimodales Par Ecran (IMPEC 2014) (Angus Grieve-Smith)
>    5. Re:  A dependency parser for Arabic (Yuval Marton)
>    6. Re:  A dependency parser for Arabic (Yuval Marton)
>    7.  Call for Participation: SPMRL 2013 - EMNLP-Workshop      on
>       Statistical       Parsing of Morphologically Rich Languages
>       (irehbein at uni-potsdam.de)
>    8. Re:  World of Warcraft Corpus (Ivan Kri?to)
>
>
> ---------- Forwarded message ----------
> From: Laura Christopherson <llchrist at email.unc.edu>
> To: corpora at uib.no
> Cc:
> Date: Mon, 09 Sep 2013 09:08:12 -0400
> Subject: Re: [Corpora-List] Corpora Digest, Vol 75, Issue 9
> Hi All and liling,
>
> I'm happy to talk with you about WoW corpus creation off list. Just email
> me back.
>
> A few things to note:
> - Bots are disallowed, and if you use one you can be banned from WoW.
> - Richer WoW chat is a matter of having toons (avatars) at all experience
> levels who can travel to multiple zones/cities and engage in multiple types
> of activities (and thus communicate a variety of chat channels). And if you
> can get chat from multiple servers and both factions, great.
> - Some IRB committees may feel uncomfortable with this because you do have
> to have a userid/password to get into WoW. So you will need to ensure you
> are taking extra precautions to protect identities. I can tell you more
> about what I put in my IRB which was exempted.
> - I'm happy to send you my dissertation which includes details on how the
> corpus was collected. My corpus included non-WoW-chat texts, so you can
> just ignore all the parts that don't pertain to WoW chat collection.
>
> Thanks,
> Laura Christopherson
>
> On 9/9/13 6:00 AM, corpora-request at uib.no wrote:
>
>> Message: 5
>> Date: Mon, 9 Sep 2013 07:46:51 +0200
>> From: liling tan<alvations at gmail.com>
>> Subject: [Corpora-List] World of Warcraft Corpus
>> To:corpora at uib.no
>>
>> Dear all,
>>
>> Does anyone know of any compilation of World of Warcraft (WoW) chat
>> corpus?
>>
>> Any suggestions/advice on how to collect a WoW chat corpus?
>>
>> Best Regards,
>> liling
>> -------------- next part --------------
>> A non-text attachment was scrubbed...
>> Name: not available
>> Type: text/html
>> Size: 274 bytes
>> Desc: not available
>> URL:<http://www.uib.no/**mailman/public/corpora/**
>> attachments/20130909/d0aaa22b/**attachment.txt<http://www.uib.no/mailman/public/corpora/attachments/20130909/d0aaa22b/attachment.txt>
>> >
>>
>> ------------------------------
>>
>> Message: 6
>> Date: Mon, 9 Sep 2013 08:12:39 +0200
>> From: Daniel Stein<danielstein81 at gmail.com>
>> Subject: Re: [Corpora-List] World of Warcraft Corpus
>> To: liling tan<alvations at gmail.com>,"corp**ora at uib.no <corpora at uib.no>"
>>         <Corpora at uib.no>
>>
>> Dear liling,
>>
>> may be this is interesting for you:
>>
>> Laura Christopherson: What are people really saying in World of Warcraft
>> Chat? (http://dl.acm.org/citation.**cfm?id=1920331.1920490<http://dl.acm.org/citation.cfm?id=1920331.1920490>
>> )
>>
>> Kind Regards
>> Daniel
>>
>
>
>
>
>
> ---------- Forwarded message ----------
> From: Jack Alan <j.o.alan2012 at gmail.com>
> To: corpora at uib.no
> Cc:
> Date: Mon, 9 Sep 2013 22:12:19 +0100
> Subject: [Corpora-List] A dependency parser for Arabic
> Hi eveyone,
>
> I wonder if someone came a cross a dependency parser for Arabic. I've no
> access to any resources provided by LDC, so I'm looking for something
> **opensource**, i.e. free.
>
> By the way, I'm using AMIRA[1] to perform tokenization. So, I want to feed
> the tokenized text into the dependency parser to do the job.
>
> Could anyone pinpoint me to the proper tool to use, if any?
>
> Jack
>
>
> Ref:
> [1] Diab, Mona. "Second generation AMIRA tools for Arabic processing:
> Fast and robust tokenization, POS tagging, and base phrase chunking." *2nd
> International Conference on Arabic Language Resources and Tools*. 2009.
>
>
>
> ---------- Forwarded message ----------
> From: Kevin Gimpel <kgimpel at cs.cmu.edu>
> To: Jack Alan <j.o.alan2012 at gmail.com>
> Cc: corpora at uib.no
> Date: Mon, 9 Sep 2013 18:23:06 -0500
> Subject: Re: [Corpora-List] A dependency parser for Arabic
> Hi Jack,
> TurboParser (http://www.ark.cs.cmu.edu/TurboParser/) includes a
> pretrained model for Arabic. (Not sure how the AMIRA tokenization differs
> from the tokenization of the CoNLL-X data used to train this model, but
> others might know.)
> The Stanford parser (http://nlp.stanford.edu/software/lex-parser.shtml) also
> has an Arabic model. You can get dependencies from the phrase structure
> parses, though not typed dependencies (
> http://nlp.stanford.edu/software/parser-arabic-faq.shtml#j).
> Kevin
>
>
> On Mon, Sep 9, 2013 at 4:12 PM, Jack Alan <j.o.alan2012 at gmail.com> wrote:
>
>> Hi eveyone,
>>
>> I wonder if someone came a cross a dependency parser for Arabic. I've no
>> access to any resources provided by LDC, so I'm looking for something
>> **opensource**, i.e. free.
>>
>> By the way, I'm using AMIRA[1] to perform tokenization. So, I want to
>> feed the tokenized text into the dependency parser to do the job.
>>
>> Could anyone pinpoint me to the proper tool to use, if any?
>>
>> Jack
>>
>>
>> Ref:
>> [1] Diab, Mona. "Second generation AMIRA tools for Arabic processing:
>> Fast and robust tokenization, POS tagging, and base phrase chunking." *2nd
>> International Conference on Arabic Language Resources and Tools*. 2009.
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
>
> ---------- Forwarded message ----------
> From: Angus Grieve-Smith <grvsmth at panix.com>
> To: corpora at uib.no
> Cc:
> Date: Mon, 09 Sep 2013 19:47:22 -0400
> Subject: [Corpora-List] Fwd: [clcs-sdl-chercheurs] colloque international
> Interactions Multimodales Par Ecran (IMPEC 2014)
>
>
>
> -------- Original Message --------  Subject: [clcs-sdl-chercheurs]
> colloque international Interactions Multimodales Par Ecran (IMPEC 2014)  Date:
> Mon, 9 Sep 2013 16:57:59 +0200  From: Samira Ibnelkaïd
> <samiraibnelkaid at gmail.com> <samiraibnelkaid at gmail.com>
>
> Cher(e)s collègues,
>
>  Vous trouverez sur le site suivant http://impec2014.sciencesconf.org/ l'appel
> à communication en français et en anglais du premier *colloque
> international Interactions Multimodales Par Ecran (IMPEC)* qui se tiendra
> à Lyon du 2 au 4 juillet 2014.
>
> Merci de bien vouloir le diffuser le plus largement possible dans vos
> réseaux.
>
> Cordialement,
>
>  Pour le comité d'organisation,
> Samira Ibnelkaïd
> Doctorante en Sciences du Langage
> Laboratoire ICAR - Université Lumière Lyon 2
>
> Dear colleagues,
>
> Please find here http://impec2014.sciencesconf.org/ the call for
> submissions in French and in English for the first international conference
> on *Multimodal screen-based interactions* which will be held in Lyon,
> France, from July 2 to 4, 2014.
>
> Thank you for distributing this information.
>
>    On behalf of the organizing committee
> Samira Ibnelkaïd
> Doctorante en Sciences du Langage
> Laboratoire ICAR - Université Lumière Lyon 2
>
>
>
>
>
>
> --
> 				-Angus B. Grieve-Smith
> 				grvsmth at panix.com
>
>
>
>
>
> ---------- Forwarded message ----------
> From: Yuval Marton <yuvalmarton at gmail.com>
> To: Kevin Gimpel <kgimpel at cs.cmu.edu>
> Cc: "corpora at uib.no" <corpora at uib.no>, Jack Alan <j.o.alan2012 at gmail.com>
> Date: Mon, 9 Sep 2013 17:00:17 -0700
> Subject: Re: [Corpora-List] A dependency parser for Arabic
> Jack,
>
> You might want to check out the Columbia CATiB parser (same group who
> developed Amira)
>
> http://www1.ccls.columbia.edu/~ymarton/#_Teaching
> (resources and tools section)
>
> It is one of the best dep parsers for Arabic to date, just evaluated in
> the EMNLP 2013 SPMRL shared task.
>
> I can provide you with more details if you email me directly.
>
> -Yuval
>
> --- Pardon typos, sent from my phone ---
>
> On Sep 9, 2013, at 4:23 PM, Kevin Gimpel <kgimpel at cs.cmu.edu> wrote:
>
> Hi Jack,
> TurboParser (http://www.ark.cs.cmu.edu/TurboParser/) includes a
> pretrained model for Arabic. (Not sure how the AMIRA tokenization differs
> from the tokenization of the CoNLL-X data used to train this model, but
> others might know.)
> The Stanford parser (http://nlp.stanford.edu/software/lex-parser.shtml) also
> has an Arabic model. You can get dependencies from the phrase structure
> parses, though not typed dependencies (
> http://nlp.stanford.edu/software/parser-arabic-faq.shtml#j).
> Kevin
>
>
> On Mon, Sep 9, 2013 at 4:12 PM, Jack Alan <j.o.alan2012 at gmail.com> wrote:
>
>> Hi eveyone,
>>
>> I wonder if someone came a cross a dependency parser for Arabic. I've no
>> access to any resources provided by LDC, so I'm looking for something
>> **opensource**, i.e. free.
>>
>> By the way, I'm using AMIRA[1] to perform tokenization. So, I want to
>> feed the tokenized text into the dependency parser to do the job.
>>
>> Could anyone pinpoint me to the proper tool to use, if any?
>>
>> Jack
>>
>>
>> Ref:
>> [1] Diab, Mona. "Second generation AMIRA tools for Arabic processing:
>> Fast and robust tokenization, POS tagging, and base phrase chunking." *2nd
>> International Conference on Arabic Language Resources and Tools*. 2009.
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
>
> ---------- Forwarded message ----------
> From: Yuval Marton <yuvalmarton at gmail.com>
> To: Kevin Gimpel <kgimpel at cs.cmu.edu>
> Cc: Nizar Habash <habash at ccls.columbia.edu>, "corpora at uib.no" <
> corpora at uib.no>, Sarah Alkuhlani <sma2149 at columbia.edu>, Owen Rambow <
> rambow at ccls.columbia.edu>, Jack Alan <j.o.alan2012 at gmail.com>
> Date: Mon, 9 Sep 2013 20:16:34 -0700
> Subject: Re: [Corpora-List] A dependency parser for Arabic
> Hi Jack,
>
> Just to add to my previous answer:
>
> Here's the related publication of mine:
> Yuval Marton, Nizar Habash and Owen Rambow. “Dependency Parsing of Modern
> Standard Arabic with Lexical and Inflectional Features”. Computational
> Linguistics, Volume 39, Issue 1. Online version<http://www.mitpressjournals.org/doi/abs/10.1162/COLI_a_00138>posted November 13, 2012.
> http://www.mitpressjournals.org/doi/abs/10.1162/COLI_a_00138
>
> Follow this link for the EMNLP 2013  SPMRL workshop shared task benchmark
> (to be published soon) :
> http://www.spmrl.org/spmrl2013.html
>
> Anyone who is interested in trying the parser out, please email me
> directly (until we update the official page).
>  The installation assumes you have MADA (morphological analyzer) and a
> few other tools installed, but once installed, it provides an end-to-end
> pipeline from raw text to POS tags and dependency parses.
>
>
> Best,
>
> -Yuval
>
>
>
>
>
> On Mon, Sep 9, 2013 at 5:00 PM, Yuval Marton <yuvalmarton at gmail.com>wrote:
>
>> Jack,
>>
>> You might want to check out the Columbia CATiB parser (same group who
>> developed Amira)
>>
>> http://www1.ccls.columbia.edu/~ymarton/#_Teaching
>> (resources and tools section)
>>
>> It is one of the best dep parsers for Arabic to date, just evaluated in
>> the EMNLP 2013 SPMRL shared task.
>>
>> I can provide you with more details if you email me directly.
>>
>> -Yuval
>>
>> --- Pardon typos, sent from my phone ---
>>
>> On Sep 9, 2013, at 4:23 PM, Kevin Gimpel <kgimpel at cs.cmu.edu> wrote:
>>
>> Hi Jack,
>> TurboParser (http://www.ark.cs.cmu.edu/TurboParser/) includes a
>> pretrained model for Arabic. (Not sure how the AMIRA tokenization differs
>> from the tokenization of the CoNLL-X data used to train this model, but
>> others might know.)
>> The Stanford parser (http://nlp.stanford.edu/software/lex-parser.shtml) also
>> has an Arabic model. You can get dependencies from the phrase structure
>> parses, though not typed dependencies (
>> http://nlp.stanford.edu/software/parser-arabic-faq.shtml#j).
>> Kevin
>>
>>
>> On Mon, Sep 9, 2013 at 4:12 PM, Jack Alan <j.o.alan2012 at gmail.com> wrote:
>>
>>> Hi eveyone,
>>>
>>> I wonder if someone came a cross a dependency parser for Arabic. I've no
>>> access to any resources provided by LDC, so I'm looking for something
>>> **opensource**, i.e. free.
>>>
>>> By the way, I'm using AMIRA[1] to perform tokenization. So, I want to
>>> feed the tokenized text into the dependency parser to do the job.
>>>
>>> Could anyone pinpoint me to the proper tool to use, if any?
>>>
>>> Jack
>>>
>>>
>>> Ref:
>>> [1] Diab, Mona. "Second generation AMIRA tools for Arabic processing:
>>> Fast and robust tokenization, POS tagging, and base phrase chunking." *2nd
>>> International Conference on Arabic Language Resources and Tools*. 2009.
>>>
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
>
> ---------- Forwarded message ----------
> From: irehbein at uni-potsdam.de
> To: corpora at uib.no
> Cc: irehbein at uni-potsdam.de
> Date: Tue, 10 Sep 2013 08:01:30 +0200
> Subject: [Corpora-List] Call for Participation: SPMRL 2013 -
> EMNLP-Workshop on Statistical Parsing of Morphologically Rich Languages
> ****************************************************************
> *************
> SPMRL 2013 - EMNLP-Workshop on Statistical
> Parsing of Morphologically Rich Languages
> ****************************************************************
> *************
>
> ENDORSED BY SIGPARSE
>
> The 4th Workshop on Statistical Parsing of Morphologically Rich Languages
> will be held in conjunction with the 2013 Conference on Empirical Methods
> in Natural Language Processing (EMNLP 2013) which will take place on
> October 18th, 2013 in Seattle, Washington.
>
> Please note that the workshop takes place BEFORE the main conference
>
> SPMRL 2013 will also host the first SHARED TASK on parsing morphologically
> rich languages (see section below).
>
>
> Workshop Description
> --------------------
> The SPMRL series of workshop provides a forum for research in parsing
> morphologically-rich languages, with the goal of identifying cross-cutting
> issues in the annotation and parsing methodology for such languages, which
> typically have more flexible word order and/or higher word-form variation
> than English.
>
> Website  http://www.spmrl.org
>
>
> Keynote Speaker
> ---------------
> Julia Hockenmaier (University of Illinois at Urbana-Champaign)
>
>
> Chairs
> ------
> The workshop will be chaired by Djamé Seddah and Yuval Marton.
>
>
> Accepted Papers
> ---------------
> LITHUANIAN DEPENDENCY PARSING WITH RICH MORPHOLOGICAL FEATURES
> Jurgita Kapociute-Dzikiene, Joakim Nivre and Algis Krupavicius
>
> PARSING CROATIAN AND SERBIAN BY USING CROATIAN DEPENDENCY TREEBANKS
> Zeljko Agic, Danijela Merkler and Dasa Berovic
>
> A CROSS-TASK FLEXIBLE TRANSITION MODEL FOR ARABIC TOKENIZATION, AFFIX
> DETECTION, AFFIX LABELING, POS  Stephen Tratz
>
> WORKING WITH A SMALL DATASET - SEMI-SUPERVISED DEPENDENCY PARSING FOR IRISH
> Teresa Lynn, Jennifer Foster and Mark Dras
>
> AN EMPIRICAL STUDY ON THE EFFECT OF MORPHOLOGICAL AND LEXICAL FEATURES IN
> PERSIAN DEPENDENCY PARSING
> Mojtaba Khallash, Ali Hadian and Behrouz Minaei-Bidgoli
>
> CONSTRUCTING A PRACTICAL CONSTITUENT PARSER FROM A JAPANESE TREEBANK WITH
> FUNCTION LABELS  Takaaki Tanaka and Masaaki Nagata
>
> CONTEXT BASED MORPHOLOGICAL ANALYZER FOR HINDI AND ITS EFFECT ON HINDI
> DEPENDENCY PARSING  Deepak Kumar Malladi and Prashanth Mannem
>
> REPRESENTATION OF MORPHOSYNTACTIC UNITS AND COORDINATION STRUCTURES IN THE
> TURKISH DEPENDENCY TREEBANK  Umut Sulubacak and Gülsen Eryigit
>
> A STATISTICAL APPROACH TO PREDICTION OF EMPTY CATEGORIES IN HINDI
> DEPENDENCY TREEBANK  Puneeth Kukkadapu and Prashanth Mannem
>
>
>
> SPMRL 2013 SHARED TASK
> ----------------------
> The fourth SPMRL workshop will also host the first shared task on parsing
> morphologically rich languages:
>
> The primary goal of the shared task on parsing morphologically rich
> languages is to bring forward work on parsing morphologically ambiguous
> input in both dependency and constituency parsing, and to show the state
> of the art for MRLs. In the longer term,  we aim to provide streamlined
> data sets and  evaluation metrics, thus improving the comparability of
> cross-linguistic work on parsing MRLs.  The shared task will feature
> tracks in constituency parsing and in dependency parsing, in gold as well
> as in realistic scenarios (the realistic scenario will have no gold
> tokenization, no gold part-of-speech tags and morphological features).
>
> Website  http://www.spmrl.org/**spmrl2013-sharedtask.html<http://www.spmrl.org/spmrl2013-sharedtask.html>
>
>
> Workshop Organizers
> -------------------
> Yoav Goldberg (Bar Ilan University, Israel)
> Yuval Marton (Microsoft, WA)
> Ines Rehbein (Potsdam University, Germany)
> Yannick Versley (Tübingen University, Germany)
>
>
> Shared Task Organizers
> ----------------------
> Sandra Kübler (Indiana University, US)
> Djamé Seddah (Université Paris Sorbonne & INRIAs Alpage Project, France)
> Reut Tsarfaty (Weizmann Institute of Science, Israel)
>
>
> Program Committee
> -----------------
> Mohammed Attia (Dublin City University, Ireland)
> Bernd Bohnet (University of Birmingham, UK)
> Marie Candito (University of Paris 7, France)
> Aoife Cahill (Educational Testing Service, US)
> Ozlem Cetinoglu (University of Stuttgart, Germany)
> Jinho Choi (University of Colorado at Boulder, US)
> Grzegorz Chrupala (Saarland University, Germany)
> Benoit Crabbé (University of Paris 7, France)
> Gülsen Cebiroglu Eryigit (Istanbul Technical University, Turkey)
> Michael Elhadad (Ben Gurion University, Israel)
> Richard Farkas (University of Szeged, Hungary)
> Jennifer Foster (Dublin City University, Ireland)
> Josef van Genabith (Dublin City University, Ireland)
> Koldo Gojenola (University of the Basque Country, Spain)
> Spence Green (Stanford University, US)
> Samar Husain (Potsdam University, Germany)
> Sandra Kübler (Indiana University, US)
> Jonas Kuhn (University of Stuttgart, Germany)
> Alberto Lavelli (FBK-irst, Italy)
> Joseph Le Roux (Université Paris-Nord, France)
> Wolfgang Maier (University of Düsseldorf, Germany)
> Takuya Matsuzaki (University of Tokyo, Japan)
> Joakim Nivre (Uppsala University, Sweden)
> Kemal Oflazer (Carnegie Mellon University, Qatar)
> Adam Przepiorkowski (ICS PAS, Poland)
> Owen Rambow (Columbia University, US)
> Kenji Sagae (University of Southern California, US)
> Benoit Sagot (Inria Rocquencourt, France)
> Djamé Seddah (Inria Rocquencourt, France)
> Reut Tsarfaty (Weizmann Institute of Science, Israel)
> Lamia Tounsi (Dublin City University, Ireland)
> Daniel Zeman (Charles University, Czechia)
>
>
>
> ENDORSEMENT
>
> This workshop is endorsed by THE ACL SIGPARSE interest group.
>
> For their precious help preparing the SPMRL 2013 Shared Task and for
> allowing
> their data to be part of it, we warmly thank the Linguistic Data
> Consortium,
> the Knowledge Center for Processing Hebrew (MILA), the Ben Gurion
> University,
> Columbia University, Institute of Computer Science (Polish Academy of
> Sciences),
> Korea Advanced Institute of Science and Technology, University of the
> Basque
> Country, University of Lisbon, Uppsala University, University of Stuttgart,
> University of Szeged and University Paris Diderot (Paris 7).
>
>
>
>
>
>
>
>
>
>
>
>
>
> ---------- Forwarded message ----------
> From: "Ivan Krišto" <ivan.kristo at gmail.com>
> To: liling tan <alvations at gmail.com>
> Cc: corpora at uib.no
> Date: Tue, 10 Sep 2013 09:51:16 +0200
> Subject: Re: [Corpora-List] World of Warcraft Corpus
> On 09/09/2013 07:46 AM, liling tan wrote:
> > Dear all,
> >
> > Does anyone know of any compilation of World of Warcraft (WoW) chat
> > corpus?
> >
> > Any suggestions/advice on how to collect a WoW chat corpus?
>
> Hello!
>
> Here is a suggestion how to collect corpus:
> - download recorded gameplays from youtube (there should be plenty of
> them),
> - extract chat using OCR.
>
> This isn't simple method, but it also isn't hard as it seems.
> First you need to choose good tool to download YT videos (due to recent
> update of YT policy, this isn't some trivial task... maybe Firefox video
> downloader plugin will still do the trick).
> Break videos into images (or directly use videos, but I prefer images).
> I use ffmpeg for this.
> Then you need to define part of screen where chats are (to reduce noise
> and speed up process). Crop chat screen rectangle from images (I use
> ImageMagick for this). Also, you could boost contrast on those images
> for better OCR results (ImageMagick can do this).
> Then use some OCR software to extract chat text (Tesseract should be
> fine -- http://code.google.com/p/tesseract-ocr/ ; but you have a lot of
> options:
>
> http://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software
> ).
> At the end, you need to filter out duplicates (more than one frame will
> contain same messages).This is also easy.
>
> In summary, If you manage to find ocr software which will behave as:
> ocr chat-window.jpg > chat.txt
> then, without counting lines of code needed to filter out duplicates and
> manual work to download videos, you'll have only 3-5 lines of code :).
>
>
>   Regards,
>     Ivan Krišto
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130910/baf8b045/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora