28.523, FYI: Text Normalization Training Data Available

The LINGUIST List linguist at listserv.linguistlist.org
Wed Jan 25 23:28:00 UTC 2017


LINGUIST List: Vol-28-523. Wed Jan 25 2017. ISSN: 1069 - 4875.

Subject: 28.523, FYI: Text Normalization Training Data Available

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================


Date: Wed, 25 Jan 2017 18:27:51
From: Richard Sproat [rws at xoba.com]
Subject: Text Normalization Training Data Available

 
Please visit https://github.com/rwsproat/text-normalization-data to download
the data. The data in this directory are the training, development and test
data used in
Sproat and Jaitly (2016) (https://arxiv.org/abs/1611.00068)

The data are in the two subdirectories, en_with_types for English, and
ru_with_types for Russian.

The following divisions of data were used:

Training: output-000[0-8]?-of-00100
Runtime eval: output-0009[0-4]-of-00100
Test data: output-0009[5-9]-of-00100

In practice for the results reported in the paper only the first 100,002 lines
of output-00099-of-00100 were used (for English), and the first 100,007 lines
of
output-00099-of-00100 for Russian.

Lines with ''<eos>'' in two columns are the end of sentence marker, otherwise
there are three columns, the first of which is the ''semiotic class'' (Taylor,
2009), the second is the input token and the third is the output, following
the
paper cited above.

All text is from Wikipedia. All data were extracted on 2016/04/08, and run
through the Google Kestrel TTS text normalization system (Ebden and Sproat,
2015), so that the notion of ''token'', ''semiotic class'' and reference
output are
all Kestrel's notion.

Disclaimer:

This is not an official Google product.

References:

Ebden, Peter and Sproat, Richard. 2015. The Kestrel TTS text normalization
system. Natural Language Engineering. 21(3).

Richard Sproat and Navdeep Jaitly. 2016. RNN Approaches to Text Normalization:
A
Challenge. Released on arXiv.org: https://arxiv.org/abs/1611.00068

Taylor, Paul. 2009. Text-to-Speech Synthesis. Cambridge University Press,
Cambridge.
 



Linguistic Field(s): Computational Linguistics

Subject Language(s): English (eng)
                     Russian (rus)





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

        Thank you very much for your support of LINGUIST!
 


----------------------------------------------------------
LINGUIST List: Vol-28-523	
----------------------------------------------------------







More information about the LINGUIST mailing list