[Corpora-List] Translator_HTML_to_XML

Scott James Cederberg cederber at csli.Stanford.EDU
Fri May 2 23:48:43 UTC 2003


Hey there,

    HTML is an SGML document type; it includes some features (namely
    opening tags appearing without closing tags and attribute values
    appearing without surrounding quotation marks) that do not
    constitute well-formed XML.  XHTML is precisely a version of HTML
    that has been designed to be conforming XML.

    The osx program, part of the OpenSP package (a successor to James
    Clark's sp package) can automatically convert SGML files to
    corresponding XML files; you could give that a try.  OpenSP is
    maintained along with OpenJade (http://openjade.sourceforge.net/).

    I believe osx is written in C++...

    You should be able to validate the resulting XML documents against
    the XHTML DTDs, although I would imagine you'll need to make minor
    changes for full validity.

    Hope that helps.

							Scott

On Sat, May 03, 2003 at 12:48:59AM +0200, wassim souayah wrote:
> Dear  all,
>
> I'm working on an Internet Query System,
> Can somebody point me to  : any system for translating
> HTML to XML (In Java)?
>
> Thanks a lot,
> wassim
>
>
>
> ************************************************
> Wassim Souayah
> Etudiant DEA
> Laboratoire de LARIS
> Sfax-TUNISIE
>
> Email : wsouayah at yahoo.fr
>
>
> ___________________________________________________________
> Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en français !
> Yahoo! Mail : http://fr.mail.yahoo.com
>

--
Scott Cederberg
Researcher

Infomap Project
Computational Semantics Lab
Center for the Study of Language and Information (CSLI)
Stanford University

http://infomap.stanford.edu/



More information about the Corpora mailing list