[Corpora-List] Transcript of Modern Standard Arabic with Pauses Marked
Claire Brierley
C.Brierley at leeds.ac.uk
Wed Mar 13 13:41:32 UTC 2013
Hello List members
Researchers at the Universities of Leeds and Jordan are looking for a small test corpus (5000 - 10,000 words) of Modern Standard Arabic (MSA) marked up with prosodic boundaries. The latter should delineate well-formed, meaningful chunks and should not represent disfluencies.
In essence, we would like each word tagged either as a break/pause or non-break for Arabic phrase break prediction. This work relates to our EPSRC-funded project (starting in April): "Natural Language Processing Working Together with Arabic and Islamic Studies".
Such a corpus could be/could have been created by annotators listening to well-formed speech recordings of, say, broadcast news, documentaries, and audiobooks, marking "perceived pauses", and checking inter-annotator agreement - rather like the Spoken English Corpus. Breaks will need to be marked more frequently than punctuation, which can be sparse in MSA.
If anyone has such a dataset (or something similar, with "well-formed" pauses identified) and is willing to share it, we would love to hear from you.
As an example, in the data below from Leeds' Corpus of Contemporary Arabic, we have a single MSA sentence of 48 words. Only two of these are followed by punctuation - and we have identified these as breaks. In addition, we have also tagged a few other words as likely boundary locations.
يَخُوضُ non-break VERB
مُنْتَخَبُ non-break NOUN
الْكُوَيْتِ non-break NOUN
الْوَطَنِيِّ non-break NOUN
لِكُرَةِ non-break NOUN
الْقَدَمِ non-break NOUN
مُبَارَاتَهُ non-break NOUN
الْثَّالِثَةِ non-break NOMINAL
وَالْمُهِمَّةِ break NOMINAL
الْسَّاعَةِ non-break NOUN
الْسَّابِعَةِ non-break NOMINAL
وَالْنِّصْفِ non-break NOUN
مَسَاءً non-break NOUN
الْيَوْمَ non-break ADVERB
أَمَامَ non-break ADVERB
الْيَمَنِ break NOUN
عَلَى non-break PREPOSITION
اسْتَادِ non-break NOUN
نَادِي non-break NOUN
الْكُوَيْتِ break NOUN
فِيْ non-break PREPOSITION
الْجَوْلَةِ non-break NOUN
الثَّالِثَةِ non-break NOMINAL
لِمُبَارَيَاتِ non-break NOUN
بُطُولَةِ non-break NOUN
كَأْسِ non-break NOUN
الْخَلِيْجِ non-break NOUN
الْعَرَبِيِّ non-break NOUN
الْسَّادِسَةِ non-break NOMINAL
عَشْرةَ non-break NOMINAL
لِكُرَةِ non-break NOUN
الْقَدَمِ break NOUN
الَّتِيْ non-break PRONOUN
تَسْتَمِرُّ non-break VERB
بِالْكُوَيْتِ non-break NOUN
حَتَّى non-break PREPOSITION
يَوْمِ non-break ADVERB
11 non-break NOMINAL
الْجَارِي break NOUN ، |
وَيَسْبِقُهَا non-break VERB
مُبَارَاةِ non-break NOUN
الْبَحْرَيْنِ non-break NOUN
وَالْسَّعُودِيَّةِ non-break NOUN
الَّتِيْ non-break PRONOUN
سَتَبْدَأُ non-break VERB
الْسَّاعَةَ non-break ADVERB
الْخَامِسَةِ non-break NOMINAL
وَالْنِّصْفِ break NOUN . ||
This is the sort of data we are looking for - though it doesn't need to be fully vowelised.
Claire Brierley [C.Brierley at leeds.ac.uk]
Senior Research Fellow
Computing
University of Leeds, UK
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list