[Corpora-List] ATALA Workshop, Role of typography and punctuation in natural language processing

Ghassan Mourad Ghassan.Mourad at paris4.sorbonne.fr
Wed Jul 16 13:05:27 UTC 2003


         CALL FOR WORKSHOP PAPERS

(Please accept my apologies if you receive multiple copies of this
message.)
-------------------------------------------------------------------------------------------------------

ATALA Workshop

**************************************
22 novembre 2003
ENST, 46, rue Barrault (49, rue Vergnault), 75013 Paris
****************************************************
Title :
Role of typography and punctuation in natural language processing
(texts segmentation, prosody, syntactical analysis, information retrieval, 
coding in multilingual systems,
)

Organisation : Ghassan Mourad & Jean-Pierre Descles
Laboratory : LaLICC  (UMR 8139 Paris-Sorbonne / CNRS

Conference call

Objective:
Even though punctuation and typography are not seen as teaching knowledge, 
we can hardly deny their role in reading and writing. This is also true for 
natural language processing, where punctuation plays an important role.
Typographical and punctuation signs are “natural tags” of information, and 
indicators on which most of the processing should rely. It is essential to 
tally and study all issues in the multilingual, multiwriting, and 
multicoding processing phases.

The ATALA workshop is particularly concerned with current research on 
punctuation, typography, coding and transcribing issues in linguistics and 
language processing; and with work that already exists in this restricted 
domain or directly related to.

Issues:
Linguistic engineering and language processing is confronted with new 
issues.  Indeed, it is now necessary to work not only on isolated sentences 
or utterances, but on entire structured or unstructured texts too; for 
example, texts from the Internet or from document-bases stored by companies 
or administrations, encyclopaedias or even dictionary articles.
Moreover, texts are rarely tagged or digitised. However, text processing 
requires pre-processing in order to conduct syntactical, semantic and 
pragmatic analysis. In particular, each text has two structures: formal and 
discursive. The later depends on the earlier. The formal structure 
expresses a certain meaning intentionality; it results from the coding in a 
typographical system and from “text-setting” or text layout.
The pre-processing of a text must exploit the formal structure (titles and 
sub-titles localisation; text fragmentation in sentences, paragraphs, 
utterances, propositions, words; quotation identification; item list 
identification; spatial disposition consideration; images, diagrams, 
captions, boxes localisation....), before executing other tasks, or 
exploiting the discursive structure (temporal, spatial, topic, event frames 
identification; relations between concepts, terms, events; anaphoric links; 
enunciative phenomena
).

  Without complete control of the exploitation of formal structure, text 
processing will not really be operational. Obviously, this issue did not 
appear when we worked only on isolated sentences. However, for semantic 
analysis, text must segmented into linguistic units that are superior or 
inferior to the normative sentences, by taking into account semiotic marks 
clearly and formally known by the computer. Punctuation and all typographic 
signs (index) are still the most relevant elements, since they can provide 
sharp indications for formal text segmentation and structuring; these 
indications being the foundation of automatic textual linguistics.

We can distinguish between three types of approaches for segmentation:
(a)     Digital approaches (neuronal nets, N-grams, Markov model
);
(b)     Finite automata and regular expressions approaches (for instance 
INTEX);
(c)     Contextual exploration approaches based on punctuation marks (for 
instance SegATex).

Traditional theories (treaties, handbooks) of punctuation generally are 
normative and do not allow the expression of precise rules that could lead 
to automatic segmentation. Furthermore, these treaties did not consider 
semantic analysis of highly polysemous marks like comma, semicolon, colon, 
dash, parenthesises, ... However, marks play a very important role in 
semantic structuring; their analysis allow to improve segmentation process 
and text discursive structuring.
Text processing tools offer enormous potentialities for typographic 
variations; for example highlighting a term being quoted, exemplify, or 
disambiguate an expression
; Quoting Ch. Gouriou : « A tout problème que 
pose la transcription de la  pensée, la typographie se doit d’apporter au 
moins une solution ; elle en offre plusieurs dès que l’on la sollicite de 
faire valoir des nuances ou des subtilité ». However, the integration to be 
granted to these variations is not regular and depends on other contextual 
(typographic and punctuation) elements; for example, an italicized 
expression does not have the same value (meaning) according to the fact 
that it is capitalized or between quoting marks. It is indeed a 
conglomerate of typographic marks, variable from text to text, which gives 
the value of an occurrence of typographic change. Text processing must 
resolve these linguistic and computational issues.

Theme:
Submission can also Discuss/tackle cross-domain topics in relation to:

-       Formal segmentation of text,
-       Text discursive segmentation based on punctuation and typography marks,
-       “Textual architecture”,
-       The role of the punctuation –particularly, the comma- in a 
syntactic analysis,
-       Contribution of the punctuation for the coding of the prosody and 
contribution of typography for the coding of intonation,
-       Contribution of the punctuation for the identification of proper 
names, compound words, abbreviations, initials, 

-       Comparison between punctuation in various linguistic systems (Arab, 
Chinese
),
-       Coding and transcribing issues in various linguistics systems,
-       


Modalities :
Submission : a 2-4 page summary.
We ask authors to indicate if their submission:
-       present in-progress work or is a position paper;
-       present theoretical or applied completed work.
A 2-4-page summary must be sent before  30 September 2003 by e-mail in 
text, .rtf, .doc or .pdf to:
Ghassan.Mourad at paris4.sorbonne.fr
and
Jean-Pierre.Descles at paris4.sorbonne.fr

Acceptance notifications will be sent for 20 October 2003.

****************************************************************************************


Ghassan Mourad
ISHA, Paris - Sorbonne
Laboratoire LaLICC (Langage, Logique, Informatique, Cognition et Communication)
(UMR 8139 Paris-Sorbonne / CNRS)
http://www.lalic.paris4.sorbonne.fr/
96, Bd Raspail
75006 Paris
France
tel : 01 44 39 35 90
fax : 01 44 39 35 91



More information about the Corpora mailing list