Corpora: Relatve text length

Tolkin, Steve Steve.Tolkin at FMR.COM
Thu Apr 25 19:21:07 UTC 2002


The following was copied from
http://www.microsoft.com/sql/techinfo/productdoc/2000/books.asp
on 2002-04-25.
It is based on the documentation for Microsoft SQL Server.
This data shows that, after compression, the languages
produce about the same number of bytes.
However English is slightly smaller than the others,
and Japanese is somewhat larger.

People who are truly interested in this topic should download all
these files, expand them, remove everything but the text, and then
report the results back.

In another place I recall seeing that using text for lexicographic
purposes (counting words, characters, etc.) is allowed under any
interpretation of copyright.

<quote>
SQL Server 2000 Books Online (Updated)

Posted: February 20, 2002

 Download

 English
 35.8 MB Download File
 2 hr 54 min @ 28.8 kbps

 Chinese (Simplified)
 37.1 MB Download File
 3 hr 0 min @ 28.8 kbps

 Chinese (Traditional)
 37.88 MB Download File
 3 hr 4 min @ 28.8 kbps

 French
 38.05 MB Download File
 3 hr 5 min @ 28.8 kbps

 German
 38.5 MB Download File
 3 hr 7 min @ 28.8 kbps

 Italian
 37.17 MB Download File
 3 hr 0 min @ 28.8 kbps

 Japanese
 41.72 MB Download File
 3 hr 23 min @ 28.8 kbps

 Korean
 37.89 MB Download File
 3 hr 4 min @ 28.8 kbps

 Spanish
 37.62 MB Download File
 3 hr 3 min @ 28.8 kbps


Download the updated documentation for Microsoft SQL Server 2000. SQL
Server Books Online (Updated) includes the complete documentation that
shipped with SQL Server 2000 plus minor revisions.

SQL Server Books Online (Updated) is available for download as a
cabinet file (.cab). This file contains multiple files that have been
compressed into one extractable file. You can extract the compressed
files by using an expansion utility such as Expand.exe,
 ...
</quote>

Hopefully helpfully yours,
Steve
--
Steven Tolkin          steve.tolkin at fmr.com      617-563-0516
Fidelity Investments   82 Devonshire St. V8D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.



More information about the Corpora mailing list