34.1584, Confs: Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation

The LINGUIST List linguist at listserv.linguistlist.org
Sat May 20 03:05:02 UTC 2023


LINGUIST List: Vol-34-1584. Sat May 20 2023. ISSN: 1069 - 4875.

Subject: 34.1584, Confs: Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation

Moderator: Malgorzata E. Cavar, Francis Tyers (linguist at linguistlist.org)
Managing Editor: Lauren Perkins
Team: Helen Aristar-Dry, Steven Franks, Everett Green, Joshua Sims, Daniel Swanson, Matthew Fort, Maria Lucero Guillen Puon, Zackary Leech, Lynzie Coburn
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================


Date: 20-May-2023
From: John Ortega [j.ortega at northeastern.edu]
Subject: Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation


Second Workshop on Corpus Generation and Corpus Augmentation for
Machine Translation
Short Title: CoCo4MT

Date: 04-Sep-2023 - 05-Sep-2023
Location: Macau @ MT SUMMIT 2023, China
Contact: John Ortega
Contact Email: coco4mt-2023-organizers at googlegroups.com
Meeting URL: https://sites.google.com/view/coco4mt

Linguistic Field(s): Applied Linguistics; Computational Linguistics;
Morphology; Text/Corpus Linguistics; Translation

Meeting Description:

The Second Workshop on Corpus Generation and Corpus Augmentation for
Machine Translation (CoCo4MT) @MT-SUMMIT XIX
The 19th Machine Translation Summit
Sep 4-8, 2023, Macau SAR, China
https://sites.google.com/view/coco4mt

SCOPE

It is a well-known fact that machine translation systems, especially
those that use deep learning, require massive amounts of data. Several
resources for languages are not available in their human-created
format. Some of the types of resources available are monolingual,
multilingual, translation memories, and lexicons. Those types of
resources are generally created for formal purposes such as
parliamentary collections when parallel and more informal situations
when monolingual. The quality and abundance of resources including
corpora used for formal reasons is generally higher than those used
for informal purposes. Additionally, corpora for low-resource
languages, languages with less digital resources available, tends to
be less abundant and of lower quality.

CoCo4MT is a workshop centered around research that focuses on manual
and automatic corpus creation, cleansing, and augmentation techniques
specifically for machine translation. We accept work that covers any
language (including sign language) but we are specifically interested
in those submissions that explicitly report on work with languages
with limited existing resources (low-resource languages). Since
techniques from high-resource languages are generally statistical in
nature and could be used as generic solutions for any language, we
welcome submissions on high-resource languages also.

CoCo4MT aims to encourage research on new and undiscovered techniques.
We hope that the methods presented at this workshop will lead to the
development of high-quality corpora that will in turn lead to
high-performing MT systems and new dataset creation for multiple
corpora. We hope that submissions will provide high-quality corpora
that are available publicly for download and can be used to increase
machine translation performance thus encouraging new dataset creation
for multiple languages that will, in turn, provide a general workshop
to consult for corpora needs in the future. The workshop’s success
will be measured by the following key performance indicators:

- Promotes the ongoing increase in quality of machine translation
systems when measured by standard measurements,
- Provides a meeting place for collaboration from several research
areas to increase the availability of commonly used corpora and new
corpora,
- Drives innovation to address the need for higher quality and
abundance of low-resource language data.

Topics of interest include:

- Difficulties with using existing corpora (e.g., political
considerations or domain limitations) and their effects on final MT
systems,
- Strategies for collecting new MT datasets (e.g., via crowdsourcing),
- Data augmentation techniques,
- Data cleansing and denoising techniques,
- Quality control strategies for MT data,
- Exploration of datasets for pretraining or auxiliary tasks for
training MT systems.

SHARED TASK

To encourage research on corpus construction for low-resource machine
translation, we introduce a shared task focused on identifying
high-quality instances that should be translated into a target
low-resource language. Participants are provided access to multi-way
corpora in the high-resource languages of English, Spanish, German,
Korean, and Indonesian, and using these, are required to identify
beneficial instances, that when translated into the low-resource
languages of Cebuano, Gujarati, and Burmese, lead to high-performing
MT systems. More details on data, evaluation and submission can be
found on the website (https://sites.google.com/view/coco4mt) or by
emailing coco4mt-shared-task at googlegroups.com.




------------------------------------------------------------------------------

Please consider donating to the Linguist List https://give.myiu.org/iu-bloomington/I320011968.html


LINGUIST List is supported by the following publishers:

American Dialect Society/Duke University Press http://dukeupress.edu

Bloomsbury Publishing (formerly The Continuum International Publishing Group) http://www.bloomsbury.com/uk/

Brill http://www.brill.com

Cambridge Scholars Publishing http://www.cambridgescholars.com/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Mouton https://cloud.newsletter.degruyter.com/mouton

Dictionary Society of North America http://dictionarysociety.com/

Edinburgh University Press www.edinburghuniversitypress.com

Equinox Publishing Ltd http://www.equinoxpub.com/

European Language Resources Association (ELRA) http://www.elra.info

Georgetown University Press http://www.press.georgetown.edu

John Benjamins http://www.benjamins.com/

Lincom GmbH https://lincom-shop.eu/

Linguistic Association of Finland http://www.ling.helsinki.fi/sky/

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Oxford University Press http://www.oup.com/us

SIL International Publications http://www.sil.org/resources/publications

Springer Nature http://www.springer.com

Wiley http://www.wiley.com


----------------------------------------------------------
LINGUIST List: Vol-34-1584
----------------------------------------------------------



More information about the LINGUIST mailing list