[Corpora-List] [software] Wiki2Tei converter 1.0
Sylvain Loiseau
sylvain.loiseau at u-paris10.fr
Wed Oct 10 18:34:33 UTC 2007
We are pleased to announce the first release of the Wiki2Tei software.
Wiki2Tei is a converter from the mediawiki format to XML (TEI vocabulary).
The mediawiki format is used by wikimedia fundation wikis (Wikipedia,
Wikibooks, Wikisource), and many other wikis using the mediawiki software.
Large amounts of free hight-quality structured texts are available in this
format. These texts are used more and more often in NLP (natural language
processing) projects. However, the mediawiki parser is oriented towards
rendition and the mediawiki syntax is complex and hard to parse.
The Wiki2Tei converter makes available the information contained in wiki
syntax
(structuration, highlighting, etc.), and allows to properly retrieve the
plain
text. This conversion is intended to preserve all the properties of the
original text. Wiki2Tei is closely coupled with the mediawiki software,
allowing to convert all the features of the mediawiki syntax.
The Wiki2Tei converter provides a rich set of tools for converting
mediawiki
text from several sources (file, mediawiki database) and managing
collections
of files to be converted. The TEI vocabulary used is documented, according
to
the TEI Guidelines, in an ODD document. The code is open source and may be
downloaded from the SourceForge download area:
http://sourceforge.net/projects/wiki2tei/
http://sourceforge.net/project/showfiles.php?group_id=198407
The web site contains full documentation and a "demo":
http://wiki2tei.sourceforge.net/
http://wiki2tei.sourceforge.net/demo/
A mailing list is open:
https://lists.sourceforge.net/lists/listinfo/wiki2tei-users
Best,
Bernard Desgraupes,
Sylvain Loiseau
----------------------------------------------------------------
Ce message a ete envoye par IMP, grace a l'Universite Paris 10 Nanterre
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list