[Corpora-List] ATALA Workshop, Role of typography and punctuation in natural language processing
Ghassan Mourad
Ghassan.Mourad at paris4.sorbonne.fr
Wed Jul 16 13:05:27 UTC 2003
CALL FOR WORKSHOP PAPERS
(Please accept my apologies if you receive multiple copies of this
message.)
-------------------------------------------------------------------------------------------------------
ATALA Workshop
**************************************
22 novembre 2003
ENST, 46, rue Barrault (49, rue Vergnault), 75013 Paris
****************************************************
Title :
Role of typography and punctuation in natural language processing
(texts segmentation, prosody, syntactical analysis, information retrieval,
coding in multilingual systems,
)
Organisation : Ghassan Mourad & Jean-Pierre Descles
Laboratory : LaLICC (UMR 8139 Paris-Sorbonne / CNRS
Conference call
Objective:
Even though punctuation and typography are not seen as teaching knowledge,
we can hardly deny their role in reading and writing. This is also true for
natural language processing, where punctuation plays an important role.
Typographical and punctuation signs are natural tags of information, and
indicators on which most of the processing should rely. It is essential to
tally and study all issues in the multilingual, multiwriting, and
multicoding processing phases.
The ATALA workshop is particularly concerned with current research on
punctuation, typography, coding and transcribing issues in linguistics and
language processing; and with work that already exists in this restricted
domain or directly related to.
Issues:
Linguistic engineering and language processing is confronted with new
issues. Indeed, it is now necessary to work not only on isolated sentences
or utterances, but on entire structured or unstructured texts too; for
example, texts from the Internet or from document-bases stored by companies
or administrations, encyclopaedias or even dictionary articles.
Moreover, texts are rarely tagged or digitised. However, text processing
requires pre-processing in order to conduct syntactical, semantic and
pragmatic analysis. In particular, each text has two structures: formal and
discursive. The later depends on the earlier. The formal structure
expresses a certain meaning intentionality; it results from the coding in a
typographical system and from text-setting or text layout.
The pre-processing of a text must exploit the formal structure (titles and
sub-titles localisation; text fragmentation in sentences, paragraphs,
utterances, propositions, words; quotation identification; item list
identification; spatial disposition consideration; images, diagrams,
captions, boxes localisation....), before executing other tasks, or
exploiting the discursive structure (temporal, spatial, topic, event frames
identification; relations between concepts, terms, events; anaphoric links;
enunciative phenomena
).
Without complete control of the exploitation of formal structure, text
processing will not really be operational. Obviously, this issue did not
appear when we worked only on isolated sentences. However, for semantic
analysis, text must segmented into linguistic units that are superior or
inferior to the normative sentences, by taking into account semiotic marks
clearly and formally known by the computer. Punctuation and all typographic
signs (index) are still the most relevant elements, since they can provide
sharp indications for formal text segmentation and structuring; these
indications being the foundation of automatic textual linguistics.
We can distinguish between three types of approaches for segmentation:
(a) Digital approaches (neuronal nets, N-grams, Markov model
);
(b) Finite automata and regular expressions approaches (for instance
INTEX);
(c) Contextual exploration approaches based on punctuation marks (for
instance SegATex).
Traditional theories (treaties, handbooks) of punctuation generally are
normative and do not allow the expression of precise rules that could lead
to automatic segmentation. Furthermore, these treaties did not consider
semantic analysis of highly polysemous marks like comma, semicolon, colon,
dash, parenthesises, ... However, marks play a very important role in
semantic structuring; their analysis allow to improve segmentation process
and text discursive structuring.
Text processing tools offer enormous potentialities for typographic
variations; for example highlighting a term being quoted, exemplify, or
disambiguate an expression
; Quoting Ch. Gouriou : « A tout problème que
pose la transcription de la pensée, la typographie se doit dapporter au
moins une solution ; elle en offre plusieurs dès que lon la sollicite de
faire valoir des nuances ou des subtilité ». However, the integration to be
granted to these variations is not regular and depends on other contextual
(typographic and punctuation) elements; for example, an italicized
expression does not have the same value (meaning) according to the fact
that it is capitalized or between quoting marks. It is indeed a
conglomerate of typographic marks, variable from text to text, which gives
the value of an occurrence of typographic change. Text processing must
resolve these linguistic and computational issues.
Theme:
Submission can also Discuss/tackle cross-domain topics in relation to:
- Formal segmentation of text,
- Text discursive segmentation based on punctuation and typography marks,
- Textual architecture,
- The role of the punctuation particularly, the comma- in a
syntactic analysis,
- Contribution of the punctuation for the coding of the prosody and
contribution of typography for the coding of intonation,
- Contribution of the punctuation for the identification of proper
names, compound words, abbreviations, initials,
- Comparison between punctuation in various linguistic systems (Arab,
Chinese
),
- Coding and transcribing issues in various linguistics systems,
-
Modalities :
Submission : a 2-4 page summary.
We ask authors to indicate if their submission:
- present in-progress work or is a position paper;
- present theoretical or applied completed work.
A 2-4-page summary must be sent before 30 September 2003 by e-mail in
text, .rtf, .doc or .pdf to:
Ghassan.Mourad at paris4.sorbonne.fr
and
Jean-Pierre.Descles at paris4.sorbonne.fr
Acceptance notifications will be sent for 20 October 2003.
****************************************************************************************
Ghassan Mourad
ISHA, Paris - Sorbonne
Laboratoire LaLICC (Langage, Logique, Informatique, Cognition et Communication)
(UMR 8139 Paris-Sorbonne / CNRS)
http://www.lalic.paris4.sorbonne.fr/
96, Bd Raspail
75006 Paris
France
tel : 01 44 39 35 90
fax : 01 44 39 35 91
More information about the Corpora
mailing list