[Corpora-List] Extracting only editorial content from a HTML page

Wed Aug 10 09:23:52 UTC 2005

If the original question (or at least Mike's new question) was how to
remove boilerplate (as opposed to cleaning HTML and representing HTML
document structure), we got reasonably good results with the "difference
between token count and tag count" method that has already been mentioned,
and that is implemented in Aidan Finn's BTE modules.

1) remove javascript and comments

2) transform all html tags into a special symbol, e.g. XXTAG

3) tokenize everything else (a rough tokenization method -- e.g., split on
whitespace -- is good enough for the purpose) and map each token to a
special symbol, e.g. TOKEN, so that you get a representation of the
document as:

...
XXTAG XXTAG TOKEN TOKEN TOKEN XXTAG XXTAG TOKEN TOKEN TOKEN
TOKEN TOKEN XXTAG TOKEN TOKEN XXTAG TOKEN TOKEN XXTAG
XXTAG XXTAG TOKEN TOKEN
...

4) for each possible beginning and ending point, compute
N(TOKEN) - N(XXTAG)
and choose the stretch where this quantity is highest.

5) print only the TOKENs from the chosen stretch.

Typically, the "low tag density" area found in this way does not contain 
boilerplate.

Notes on the implementation:

- it's probably safe to start looking for the best stretch from the first 
TOKEN after </head>

- you can make the implementation a lot more efficient by observing that
the best stretch will always begin at the beginning of a TOKEN sequence
and end at the end of a TOKEN sequence, and thus: 1) there is no need to
even consider stretches that being/end with a XXTAG; 2) there is no need
to consider stretches that begin with a TOKEN preceded by another TOKEN or
end with a TOKEN followed by another TOKEN [indeed, the document can be
represented in a compact way as an array of counts where the even number
positions -- treating 0 as even -- correspond to counts of sequences of
TOKEN (positive values) and the odd positions correspond to counts of
sequences of XXTAG (negative) values] {if you start counting from the
first TOKEN after</head>, the first slot will always be occupied by a
TOKEN}

- I wrote a module in perl that includes large chunks of Finn's BTE code
but uses the previous observations to speed up the computation, and, at
least in my experiments, it is much faster than Finn's original
implementation: if anybody is interested, I can send it to them (I plan to
make it public, eventually, but at the moment it is "scantily
documented" to say the least...)

- we have only experimented with German, Italian and English (and my
script returns latin1 output); languages that do not use whitespace to
separate tokens would require a more intelligent tokenization strategy,
and I have no idea of whether the method would still work on agglutinative
languages, where the number of "tokens" is likely to be lower than for the
synthetic languages we worked on.

Notes on the results:

- the method is good if you are collecting a lot of documents from 
different sites, since it does not depend on site-specific 
characteristics; of course, if you are collecting documents from few 
sites, you are likely to get better results by studying their structure 
and extracting the interesting contents with appropriate regular 
expressions.

- If you are collecting a very large corpus, the method has an advantage
over looking for sentences/paragraphs that are repeated across documents
in that you only need to analyze one document at a time. This means that
the demand on RAM is more or less constant, no matter how large the corpus
is, and that if you have X machines available you can split the corpus
into X parts and process them separately in parallel (of course, methods
requiring global statistics could also be parallelized, but that would
require more work).

- The method tends to have better precision than recall: you get a section
of a page that is relatively clean, but it is possible that this is just a
fragment of the clean part of the page, that could be larger than what you 
got.

- Because of the previous observation, the method works well if you are 
sampling from many pages, and it's fine to have "snapshots" of all the 
pages. If, instead, you are interested in the structure of single pages, 
the method is probably too destructive for you purposes.

- Depending on implementation details (in particular, the stretch you
choose when more than one stretch has the highest value for the quantity
above), you can get funny phenomena at the beginning and end of the
extracted stretch (e.g., an incomplete sentence at the beginning, and a
bit of boilerplate at the end).

Because of these factors, I would recommend the BTE approach (with a
faster re-implementation than the original one) to those that are
interested in sampling from many pages, in a scalalable way, are not
interested in the structure of single pages, and can live with a certain
amount of noise. In short, if you are building a large web corpus from a
large variety of sources, it could be the right way to go; if you are
building a smaller corpus from more controlled sources, probably you can
get better results with a more ad-hoc method.

Regards,

Marco