[Corpora-List] Constructing a parallel Arabic-English corpus that can be freely distributed without cost

Ana Frankenberg ana.frankenberg at gmail.com
Tue Dec 10 20:23:46 UTC 2013


Dear Kais
 I find your expensive option of paying for the translations a little
contradictory with the idea that corpora should be based on naturally
occurring texts. If you want to use the corpus to study translation, then I
think it should be based on existing translations produced for authentic
purposes rather than for the "artificial" purpose of building a parallel
corpus. Who would your translators be? How many translators would you
employ? Questions such as these should be taken into account. But then you
might be interested in parallel alignment for other reasons, and then
paying for the translations might be an (expensive) option. Hope this helps
you decide.
Best wishes
Ana



On Tue, Dec 10, 2013 at 10:02 AM, Kais Dukes <sckd at leeds.ac.uk> wrote:

> I'm performing a small feasibility study to understand how expensive it
> would be to build a parallel Arabic-English corpus. I'm aware that such
> resources already exist (e.g. LDC), but these don't suit my purpose. I want
> to develop something free that can be easily downloaded and used without
> cost by the wider research community. e.g. under a creative commons or
> other open source license. With regards to Arabic genre/dialect, I'm only
> interested in MSA in its standard register (so not generally social media,
> blogs, etc.). Ideally, I'm looking for well-written news articles in Arabic
> by a prominent news agency, or something of comparable quality.
> Some options I could pursue, from most expensive to least expensive:
> 1. Collect some Arabic news articles online, and then pay to have these
> translated into English (either by a professional translation service or
> via crowdsourcing). Sources could include Al Jazeera, Associated Press,
> Arabic Wikinews etc.
>
> 2. Use the United Nations as a source of parallel translated texts. My
> only concern with this option is that these texts sound quite specific in
> terms of genre compared to more general news articles, so might not be an
> ideal solution for what I want to achieve.
>
> 3. Use some other high-quality source of Arabic-English (free and easily
> available) parallel text that I've not thought of for Modern Standard
> Arabic.
> My aim is to work out whether or not option 1 is the only way to develop
> and publish (to the research community) a free high-quality Arabic-English
> parallel corpus, or if there is something I'm missing. I would also like to
> have the corpus sentence-aligned, although this is something I could do
> myself (semi-automatically with manual correction) if the two corpora form
> a good structural translation pair.
> In a nutshell my question is ... is my best option to pay for high-quality
> Arabic news articles to get translated into English, then distribute the
> resulting corpus as a free resource, or is there a better (high-quality)
> starting point for MSA news?
>
> Any advice is most welcome. For option 1, it would also be great to get a
> general feel for high-quality translation costs in terms of words/dollars.
> Also, if anyone is generally interested in helping with this effort,
> please do get in touch.
> Kind Regards,
> -- Kais Dukes
> School of Computing
> University of Leeds
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20131210/cdc60c25/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list