[Corpora-List] what are the news about the origin of Basque? - link to corpus
Tiago Tresoldi
tresoldi at gmail.com
Tue Mar 30 15:37:28 UTC 2010
Hello,
> Dear Corpora colleagues, are there any corpora of Basque? what are the news about the origin of Basque? We found it is close to Turkic languages by the phono-typological features. Is it still connected with the Caucasian languages? Looking forward to hearing from you to
I extracted some corpora and language models from Wikipedia a while
ago, including one for Basque. They are hosted at SourceForge, you can
download them from
http://en.wikipedia.org/wiki/User:Tresoldi#Wikipedia_as_a_corpus
Please note that there probably are some English sentences (mostly
Mediawiki strings) and other noise in the corpus. Here are is an
extract of the clean version, check if it is what you want/need:
===
tiago at samosata:~/progetti/wikicorpus$ zcat eu.clean.gz | head -n 5
<s> argizagiak aztertzen ditu astronomiak . </s>
<s> irudian , hale-bopp kometa zerua zeharkatzen , beste argizagi
askorekin batera . </s>
<s> astronomia ( grekerazko ἄστρον , astron ; argizagi , zeruko
objektu eta νόμος , nomos , arau , lege hitzetatik : argizagien legea
) zeruko objektu edo argizagiak ( hala nola izarrak , planetak ,
kometak , galaxiak ) eta lurraren atmosferatik kanpo gertatzen diren
fenomenoak ( hondoko erradiazio kosmikoa , esaterako ) aztertzen
dituen zientzia da . </s>
<s> aldi berean , astronomiaren adarra den kosmologiak unibertsoaren
sorrera eta bilakaera ere ikertzen du . </s>
<s> zientzia independentea bada ere , besteak beste fisika , kimika ,
geologia eta meteorologia zientzietako metodo , teoria eta emaitzak
ere erabiltzen ditu bere ikerketak aurrera eramateko . </s>
===
Best regards,
Tiago Tresoldi
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list