[Corpora-List] Translator_HTML_to_XML
Scott James Cederberg
cederber at csli.Stanford.EDU
Fri May 2 23:48:43 UTC 2003
Hey there,
HTML is an SGML document type; it includes some features (namely
opening tags appearing without closing tags and attribute values
appearing without surrounding quotation marks) that do not
constitute well-formed XML. XHTML is precisely a version of HTML
that has been designed to be conforming XML.
The osx program, part of the OpenSP package (a successor to James
Clark's sp package) can automatically convert SGML files to
corresponding XML files; you could give that a try. OpenSP is
maintained along with OpenJade (http://openjade.sourceforge.net/).
I believe osx is written in C++...
You should be able to validate the resulting XML documents against
the XHTML DTDs, although I would imagine you'll need to make minor
changes for full validity.
Hope that helps.
Scott
On Sat, May 03, 2003 at 12:48:59AM +0200, wassim souayah wrote:
> Dear all,
>
> I'm working on an Internet Query System,
> Can somebody point me to : any system for translating
> HTML to XML (In Java)?
>
> Thanks a lot,
> wassim
>
>
>
> ************************************************
> Wassim Souayah
> Etudiant DEA
> Laboratoire de LARIS
> Sfax-TUNISIE
>
> Email : wsouayah at yahoo.fr
>
>
> ___________________________________________________________
> Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en français !
> Yahoo! Mail : http://fr.mail.yahoo.com
>
--
Scott Cederberg
Researcher
Infomap Project
Computational Semantics Lab
Center for the Study of Language and Information (CSLI)
Stanford University
http://infomap.stanford.edu/
More information about the Corpora
mailing list