[Corpora-List] Translator_HTML_to_XML

Burnard Towers lou.burnard at computing-services.oxford.ac.uk
Sat May 3 11:22:42 UTC 2003


Dave Raggett's tidy utility is THE best way of converting HTML to XML.
However, this is almost certainly not enough for your purposes, since
presumably you will want to be using meaningful tagging in your XML if it is
to be part of a query system. In other words, you might find
<b>123</b> or <em>123</em>   in your HTML or the XML version of it, where
your query system really wants to find <partNumber>123</partNumber>

This is obviously easy to fix *if* the HTML or XML input is completely
regular, and you never find<b> or <em> used to mark things which are not
part numbers. But, of course, the world aint like that and you need fairly
sophisticated tools to add semantics to the purely presentational markup
that HTML will give you, even when it's converted to something that is valid
XML.

The good news is that the sophistication of such tools is increased by their
ability to act on XML structures. So for example, if the part numbers are
always in column two of a table, you can apply the transformation I
suggested above only to <b> or <em> elements appearing in the second column
of a table.  There are lots of good XML-aware tools, many of them in Java,
which can do this kind of thing. And there is also XSLT, which is the
language I would recommend for such jobs.


> -----Original Message-----
> From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no]On
> Behalf Of wassim souayah
> Sent: 02 May 2003 23:53
> To: corpora at hd.uib.no
> Subject: [Corpora-List] Translator_HTML_to_XML
>
>
> Dear  all,
>
> I'm working on an Internet Query System,
> Can somebody point me to  : any system for translating
> HTML to XML (In Java)?
>
> Thanks a lot,
> wassim
>
>
>
> ************************************************
> Wassim Souayah
> Etudiant DEA
> Laboratoire de LARIS
> Sfax-TUNISIE
>
> Email : wsouayah at yahoo.fr
>
>
> ___________________________________________________________
> Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en français !
> Yahoo! Mail : http://fr.mail.yahoo.com
>
>
>



More information about the Corpora mailing list