[sw-l] Are We Going in the Wrong Direction?

Sandy Fleming sandy at FLEIMIN.DEMON.CO.UK
Fri Dec 10 14:14:54 UTC 2004

Dan wrote:

> And how does this influence our searches one way or another? Right now,
> I can do a search in SWML looking for signs which have, say, a head and
> a contact symbol, regardless of linear order. The hierarchy in the DOM
> gives me that for free.

But you have to distinguish the DOM from the XML (or SWML). Just because
something isn't stored as XML doesn't mean you can't have a DOM.

>                         The complexity comes when I look, say, for
> contact at the temple, which means the contact symbol is in a region at
> a certain range of distances (either city-block or Euclidean) from the
> head symbol. Using scalable fonts and linear representations do nothing
> to simplify that problem.

Neither does using SWML, does it? Don't forget that simple text matching
algorithms (such as exact matches) are based on the storage format of the
text. The more sophisticated an algorithm you want, the less it has to do
with the storage format, and it will be complex no matter how it the text is
stored. You can't reject an idea on the basis that it's complex to do things
that are inherently complex.

> Not trying to _make_ things difficult, but I find that oversimplifying
> a solution at the outset simply delays the headaches. :-)

Now Dan, there's a difference between simplifying and oversimplifying! I
never even used that second word! :)

Consider this. A count of signs and symbols in a SW text compared to the
equivalent text suggests that signs take more symbols to write than words
take characters, but we only seem to need as half as many signs in a sign
language to express the same text as in an oral language. So as far as
information content goes a symbol in SW might, very roughly, be considered
equivalent to a character in an oral-language text.

I used SW-Edit to save some symbols in a SWML file and it seems that one
symbol takes roughly 200 characters to store in SWML. So a novel of 500,000
characters or about 0.5 megabytes is liable to take about 100 megabytes to
store as SWML. Is that what we want? Perhaps it's undersimplifying at the
outset that will cause headaches later  :)

This is why I'm wary of moving from dictionary projects to word processors
without reconsidering SWML.

OK, so a symbol character plus two dimensions eg (s134;663) could take about
eight character to store, but it's a big improvement! I'd love to be able to
get over having to store the dimensions too but I don't see a good way.

I suppose a good proof of concept of this would be to show that a "simple"
(I'd rather say "compact") method can be converted to SWML and vice versa
without problems, showing that everything SWML stores, this stores just as
well. This would also show that we can create a DOM without XML. I'll think
about that!

Thanks for these comments - it's made me rememeber that it's important these
days to be able to have a DOM, whether derived from XML or not.


More information about the Sw-l mailing list