[Corpora-List] Translator_HTML_to_XML

Klaus Guenther klaus at capitalfocus.org
Fri May 2 23:52:10 UTC 2003


----- Original Message -----
From: "d'Armond Speers" <speersdl at msn.com>
To: <corpora at hd.uib.no>
Sent: Saturday, May 03, 2003 1:36 AM
Subject: Re: [Corpora-List] Translator_HTML_to_XML


>
> >Dear  all,
> >
> >I'm working on an Internet Query System,
> >Can somebody point me to  : any system for translating
> >HTML to XML (In Java)?
>
> Hmm, HTML is a form of XML, isn't it?

HTML is SGML which is not a type of XML. XHTML, however, is HTML that is
reconstructed using XML. One of the big differences is that XML is a very
strict language, and doesn't tolerate mistakes (e.g., unclosed tags, illegal
tag combinations, etc). So a simple transform isn't going to be enough. You
need to parse it and get rid of errors before you can declare it as an XML
document. Even a tag like <br> will break an XML parser -- it needs to be
written <br />. And then in HTML you have all the unclosed <p> tags.

I know there are converters. Macromedia Dreamweaver, for example, will
update your code to be XHMTL compliant. So there must be addons. I can't
think of any for Java, but I'm sure they are out there.

hth

K.G.

> For converting between different XML specs (as defined by a DTD or XML
> Schema), you should take a look at XSLT (XML transforms).  This is an
> XML-based programming language.  There are quite a few XSLT processors out
> there that include Java libraries, such as Saxon and Xalan.  You write the
> XSLT, and apply the XSLT to the input XML to generate the output XML.
Check
> out XML, XSL and XML Schemas at the W3C (www.w3.org).
>
> >Thanks a lot,
> >wassim
>
> --
> d'Armond Speers, Ph.D.
> speersd at georgetown.edu
>
>
> _________________________________________________________________
> Tired of spam? Get advanced junk mail protection with MSN 8.
> http://join.msn.com/?page=features/junkmail



More information about the Corpora mailing list