[Corpora-List] Transcript of Modern Standard Arabic with Pauses Marked

Claire Brierley C.Brierley at leeds.ac.uk
Wed Mar 13 13:41:32 UTC 2013


Hello List members

Researchers at the Universities of Leeds and Jordan are looking for a small test corpus (5000 - 10,000 words) of Modern Standard Arabic (MSA) marked up with prosodic boundaries. The latter should delineate well-formed, meaningful chunks and should not represent disfluencies.

In essence, we would like each word tagged either as a break/pause or non-break for Arabic phrase break prediction. This work relates to our EPSRC-funded project (starting in April): "Natural Language Processing Working Together with Arabic and Islamic Studies".

Such a corpus could be/could have been created by annotators listening to well-formed speech recordings of, say, broadcast news, documentaries, and audiobooks, marking "perceived pauses", and checking inter-annotator agreement - rather like the Spoken English Corpus. Breaks will need to be marked more frequently than punctuation, which can be sparse in MSA.

If anyone has such a dataset (or something similar, with "well-formed" pauses identified) and is willing to share it, we would love to hear from you.

As an example, in the data below from Leeds' Corpus of Contemporary Arabic, we have a single MSA  sentence of 48 words. Only two of these are followed by punctuation - and we have identified these as breaks. In addition, we have also tagged a few other words as likely boundary locations.


يَخُوضُ non-break       VERB
مُنْتَخَبُ      non-break       NOUN
الْكُوَيْتِ     non-break       NOUN
الْوَطَنِيِّ    non-break       NOUN
لِكُرَةِ        non-break       NOUN
الْقَدَمِ       non-break       NOUN
مُبَارَاتَهُ    non-break       NOUN
الْثَّالِثَةِ   non-break       NOMINAL
وَالْمُهِمَّةِ  break   NOMINAL
الْسَّاعَةِ     non-break       NOUN
الْسَّابِعَةِ   non-break       NOMINAL
وَالْنِّصْفِ    non-break       NOUN
مَسَاءً non-break       NOUN
الْيَوْمَ       non-break       ADVERB
أَمَامَ non-break       ADVERB
الْيَمَنِ       break   NOUN
عَلَى   non-break       PREPOSITION
اسْتَادِ        non-break       NOUN
نَادِي  non-break       NOUN
الْكُوَيْتِ     break   NOUN
فِيْ    non-break       PREPOSITION
الْجَوْلَةِ     non-break       NOUN
الثَّالِثَةِ    non-break       NOMINAL
لِمُبَارَيَاتِ  non-break       NOUN
بُطُولَةِ       non-break       NOUN
كَأْسِ  non-break       NOUN
الْخَلِيْجِ     non-break       NOUN
الْعَرَبِيِّ    non-break       NOUN
الْسَّادِسَةِ   non-break       NOMINAL
عَشْرةَ non-break       NOMINAL
لِكُرَةِ        non-break       NOUN
الْقَدَمِ       break   NOUN
الَّتِيْ        non-break       PRONOUN
تَسْتَمِرُّ     non-break       VERB
بِالْكُوَيْتِ   non-break       NOUN
حَتَّى  non-break       PREPOSITION
يَوْمِ  non-break       ADVERB
11      non-break       NOMINAL
الْجَارِي       break   NOUN    ،       |
وَيَسْبِقُهَا   non-break       VERB
مُبَارَاةِ      non-break       NOUN
الْبَحْرَيْنِ   non-break       NOUN
وَالْسَّعُودِيَّةِ      non-break       NOUN
الَّتِيْ        non-break       PRONOUN
سَتَبْدَأُ      non-break       VERB
الْسَّاعَةَ     non-break       ADVERB
الْخَامِسَةِ    non-break       NOMINAL
وَالْنِّصْفِ    break   NOUN    .       ||


This is the sort of data we are looking for - though it doesn't need to be fully vowelised.

Claire Brierley [C.Brierley at leeds.ac.uk]
Senior Research Fellow
Computing
University of Leeds, UK


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list