<div dir="ltr"><div dir="ltr"><div><br></div><div>Call for papers<br></div><div><br></div><div>Grammar Data Mining (GDM): Extracting Linguistic Features From Grammatical Descriptions</div><div><br></div><div>September 5-6, 2019 - Varna, Bulgaria</div><div><br></div><div>Submission deadline: 30 June 2019</div><div><br></div><div>Link: <a href="https://spraakbanken.gu.se/lsi/sharedtask/">https://spraakbanken.gu.se/lsi/sharedtask/</a></div><div><br></div><div>Description</div><div>-----------</div><div><br></div><div>The present Workshop/Shared Task seeks to transform a large set of</div><div>digitized publications describing the grammars of the languages of the</div><div>world into structured databases that will enable comparison of</div><div>different languages at an unprecedented breadth and depth.</div><div><br></div><div>There are some 6 500 languages in the world and information about</div><div>their grammatical characteristics is available in book-form for over</div><div>4 000 of them. Until recently, extraction of information from grammars</div><div>has been done exclusively through manual collection. This procedure is</div><div>naturally bounded by the limits of human capacities, and as such can</div><div>only target a relatively small amount of languages/characteristics at</div><div>a substantial time investment in a given time.</div><div><br></div><div>We are now entering a phase where it is practical to use NLP tools for</div><div>a number of similar tasks. A computer may minimally infer some</div><div>characteristics of the language described simply by counting words</div><div>used in a grammatical description, e.g., a high-frequency of the term</div><div>’suffix’ likely indicates that the language being described uses a lot</div><div>of suffixes. Further, there are less straightforward or more detailed</div><div>characteristics traditionally of interest to linguists, such as where</div><div>the verb is placed in then sentence (beginning, middle, end), the</div><div>existence and use of participles, possessive constructions,</div><div>evidentiality and so on. Any techniques from the NLP toolbox such as</div><div>td-idf-weighting, tagging, parsing and vector spaces may be used in</div><div>combination and as input in more sophisticated Machine Learning</div><div>approaches.</div><div><br></div><div>In this shared task we provide a subset of the World Atlas of Language</div><div>Structures (WALS, <a href="http://wals.info">http://wals.info</a>) along with the digitized sources</div><div>from which the features were drawn. Sources are provided in raw text</div><div>form. The task is to infer WALS datapoints from the raw text data of</div><div>the digitized grammatical descriptions.</div><div><br></div><div>Training Data</div><div>-------------</div><div>10 000 datapoints spanning 191 languages and 100 features along with their</div><div>value and source(s) are given as training in the following form:</div><div><br></div><div><br></div><div>Language ISO 639-3 Feature<span style="white-space:pre"> </span>                 Value<span style="white-space:pre">      </span>             Source</div><div>----------------------------------------------------------------------------------------</div><div>Macushi  mbc<span style="white-space:pre">        </span>   31A Sex-based and             Non-sex-based       Abbott-1991[105-106]</div><div>                   Non-sex-based Gender Systems</div><div>Macushi  mbc       57A Position of Pronominal    Possessive prefixes Abbott-1991[85,101];</div><div>                   Possessive Affixes                                Williams-1932[61];</div><div><span style="white-space:pre">         </span>                                                     Carson-1982[104-106]</div><div>E. Oromo hae       118A Predicative Adjectives   Mixed               Owens-1985</div><div>E. Oromo hae       9A The Velar Nasal            No velar nasal      Owens-1985[10]</div><div>...      ...       ...                           ...                 ...</div><div><br></div><div><br></div><div>Features and values are defined as per WALS</div><div>(<a href="http://wals.info">http://wals.info</a>). Sources are semi-colon separated and optionally</div><div>indicate a page range in square brackets. Each source maps uniquely to</div><div>an entry with bibliographical details in a bibtex-file and to a</div><div>full-text of the source in question. The full-text is an OCR of a scan</div><div>of the original source (varying quality) and contains no</div><div>formatting. OCR errors are present, especially for IPA- or</div><div>non-ascii-script text in a vernacular.  There is a total of 443 source</div><div>texts supplied.</div><div><br></div><div>The training data can be downloaded at</div><div><a href="http://stp.lingfil.uu.se/~harald/grammar-data-mining.zip">http://stp.lingfil.uu.se/~harald/grammar-data-mining.zip</a></div><div><br></div><div>Task</div><div>----</div><div><br></div><div>The task is to provide the Value for an unseen Language-Feature-Source</div><div>triple.</div><div><br></div><div>No language-specific data source external to the training data</div><div>(such as the classifcation of a language, other sources for a language</div><div>etc.) may be used. However, other open generic linguistic data sources</div><div>may be utilized (such as the raw text of the corresponding WALS</div><div>chapter, a list of linguistic terms etc.).</div><div><br></div><div>Not every possible value for every feature is attested in the training</div><div>data set but systems should nevertheless strive to potentially output</div><div>any of the possible values for a features as defined in WALS. It is</div><div>not obligatory that the training set values are utilized at all.</div><div><br></div><div>Submission Instructions</div><div>-----------------------</div><div><br></div><div>Authors should submit a paper of up to 8 pages conforming to the RANLP</div><div>style guidelines (see <a href="http://lml.bas.bg/ranlp2019/submissions.php">http://lml.bas.bg/ranlp2019/submissions.php</a>)</div><div>describing their technical solution to the specific task. The</div><div>submission should contain a link to a runnable version (e.g. on</div><div><a href="http://github.com">github.com</a>) of the authors’ solution. This runnable should output a</div><div>Value (and nothing else) upon running the system: e.g. Given a</div><div>language-code, the feature of interest, and the source document, the</div><div>system should output the feature value as examplified below:</div><div><br></div><div>>>>python grammar-data-mining.py "hae" "118A Predicative Adjectives" "Owens-1985; Heine 1981"</div><div>Mixed </div><div><br></div><div>Submission is electronic, using the Softconf submission system for the Grammar Data Mining Workshop at <a href="https://www.softconf.com/ranlp2019/GDM/">https://www.softconf.com/ranlp2019/GDM/</a></div><div><br></div><div>Papers must be written in English.</div><div><br></div><div>Submitted papers will be peer-reviewed by three experts from a related field.</div><div><br></div><div>At least one author of each accepted paper is required to register for</div><div>the RANLP 2019 conference, attend the workshop, and present the paper.</div><div><br></div><div><br></div><div>Important Dates</div><div>---------------</div><div><br></div><div>Workshop paper submission deadline:     30 June 2019</div><div>Workshop paper acceptance notification: 28 July 2019</div><div>Workshop paper camera-ready version:    20 August 2019</div><div>Workshop:                               5-6 September 2019</div><div><br></div><div><br></div><div>Evaluation</div><div>----------</div><div><br></div><div>Each submission will be evaluated against a test set of 1000 random</div><div>datapoints drawn from the same origin as the training data set. The</div><div>test set will not be made available until after submission. Other</div><div>aspects than accuracy (such as running time) will not be evaluated.</div><div><br></div><div><br></div><div>Programme Committee</div><div>-------------------</div><div><br></div><div>Guillaume Segerer (CNRS, LLACAN, France)</div><div>Harald Hammarström (Department of Linguistics and Philology, Uppsala University, Sweden)</div><div>Markus Forsberg (Språkbanken, University of Gothenburg, Sweden)</div><div>Søren Wichmann (Leiden University Centre for Linguistics, Netherlands)</div><div>Shafqat Mumtaz Virk (Språkbanken, University of Gothenburg, Sweden)</div><div>Zeljko Agic (IT University of Copenhagen, Denmark)</div><div>Erich Round (University of Queensland, Australia)</div><div>Sebastian Nordhoff (LangSci Press, Germany) </div><div><br></div><div>Venue</div><div>-----</div><div><br></div><div>The workshop will be co-located with RANLP <a href="http://lml.bas.bg/ranlp2019">http://lml.bas.bg/ranlp2019</a> in Bulgaria</div><div>and take place in Hotel "Cherno More", Varna, the main RANLP-2019 conference venue.</div><div><br></div></div></div>