Corpora: Relatve text length
Tolkin, Steve
Steve.Tolkin at FMR.COM
Thu Apr 25 19:21:07 UTC 2002
The following was copied from
http://www.microsoft.com/sql/techinfo/productdoc/2000/books.asp
on 2002-04-25.
It is based on the documentation for Microsoft SQL Server.
This data shows that, after compression, the languages
produce about the same number of bytes.
However English is slightly smaller than the others,
and Japanese is somewhat larger.
People who are truly interested in this topic should download all
these files, expand them, remove everything but the text, and then
report the results back.
In another place I recall seeing that using text for lexicographic
purposes (counting words, characters, etc.) is allowed under any
interpretation of copyright.
<quote>
SQL Server 2000 Books Online (Updated)
Posted: February 20, 2002
Download
English
35.8 MB Download File
2 hr 54 min @ 28.8 kbps
Chinese (Simplified)
37.1 MB Download File
3 hr 0 min @ 28.8 kbps
Chinese (Traditional)
37.88 MB Download File
3 hr 4 min @ 28.8 kbps
French
38.05 MB Download File
3 hr 5 min @ 28.8 kbps
German
38.5 MB Download File
3 hr 7 min @ 28.8 kbps
Italian
37.17 MB Download File
3 hr 0 min @ 28.8 kbps
Japanese
41.72 MB Download File
3 hr 23 min @ 28.8 kbps
Korean
37.89 MB Download File
3 hr 4 min @ 28.8 kbps
Spanish
37.62 MB Download File
3 hr 3 min @ 28.8 kbps
Download the updated documentation for Microsoft SQL Server 2000. SQL
Server Books Online (Updated) includes the complete documentation that
shipped with SQL Server 2000 plus minor revisions.
SQL Server Books Online (Updated) is available for download as a
cabinet file (.cab). This file contains multiple files that have been
compressed into one extractable file. You can extract the compressed
files by using an expansion utility such as Expand.exe,
...
</quote>
Hopefully helpfully yours,
Steve
--
Steven Tolkin steve.tolkin at fmr.com 617-563-0516
Fidelity Investments 82 Devonshire St. V8D Boston MA 02109
There is nothing so practical as a good theory. Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.
More information about the Corpora
mailing list