Corpora: PhD thesis

nicolas turenne turenne at liia.u-strasbg.fr
Thu Dec 21 09:55:55 UTC 2000


Hello,

My PhD thesis  held at Strasbourg (France) on 24
November 2000 with the title:
Statistical Learning from Texts for Concept
Extraction from a Domain. Application to textual
Information Filtering.

is reachable at the web URL:
http://bach.u-strasbg.fr/LIIA/theses/these_eng.htm

under the field knowledge acquisition
(Please take our apologies if you receive this
message more than once....)

ABSTRACT
The goal of this dissertation is to build an
automatic and approximate representation of the
meaning of a document. We try to adapt techniques
of automatic indexing to a non-indexed document
base. Classical techniques are based on vector
models. Each document is represented by certain
features, and one defines a distance between them.
Access to relevant documents is based on
similarity estimation between features. A
structuring of the domain, described by documents,
with the aim of obtaining semantic fields, is
reached by term clustering. One can improve the
techniques by making it possible to process non
indexed documents. By adapting linguistic
knowledge and analysis of relations, pointed out
by term cooccurrences, the results would improve.
The growing amount of electronic documents leads
to a storage of large significant samples of
re-usable data. Techniques to describe relations
between terms stem from mathematical methods
usually applied to structured and non-textual
data. Coupling of specific knowledge about data
with a methodology adapted to textual data should
lead to an improving of classification results. We
try to justify several things: first, the
consideration of linguistic phenomena so as to
reduce biases of a descriptive statistics
concerning term occurrences; second, the using of
a method based on graph pattern extraction, which
is supposed to retrieve conceptual relations
between terms. Third, we make it easier to
interpret results from automatic processing by a
consensus labelling of the theme represented by a
class. Interpretation of classes remains
difficult, because of multiple points of view or
links a user can imagine between terms. More
accurate classes should facilitate an
interpretation, driven by a 3-level thesaurus,
which may be assigned to a conceptual structuring
of a term of a domain.
Large use of Internet increases exchange of
electronic documents between users of different
websites. Development of software systems dealing
with what is called "workflow" in intranets,
improves the flow of documents between persons and
services. A system which can learn automatically
user profiles and exploit this knowledge to
disseminate information is inescapable. We try to
match a user interest with classes of terms.

FIELD : Computer Science, Artificial Intelligence.

KEYWORDS : Terminology, Artificial Intelligence,
Corpus Processing, Lexicometry, Morphosyntactic
Schemes, Graph Patterns, Semi-Automatic Extraction
of Concepts, Term Clustering, Document Filtering,
Automatic Learning, User Profile, Statistical Data
Analysis, Information Retrieval.



More information about the Corpora mailing list