[Corpora-List] Legal-domain corpora
Jernej Vicic
jernej.vicic at pef.upr.si
Wed Oct 18 15:45:44 UTC 2006
You can try JRC-Acquis:
JRC-Acquis: a large aligned parallel corpus in 21 languages, freely
available
SIZE AND FORMAT
- 21 languages (all 20 official EU languages plus Romanian)
- Average corpus size: 8.8 million words per language
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.
LANGUAGES
Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
Romanian, Slovak, Slovene, Spanish, Swedish.
TEXT TYPES
- Documents on contents, principles and political objectives of the EU
Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.
PARAGRAPH ALIGNMENT
- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.
MANUAL SUBJECT DOMAIN CLASSIFICATION
- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.
USE / DOWNLOAD
- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.
Seth Grimes wrote:
>Hello all,
>
> I'm researching legal-domain application of NLP with machine
>learning. What annotated corpora are available in this domain, either for
>free or for a license fee? I'd be interested in --
>
>- legislation and statutes
>- case law
>- briefs, depositions & testimony, crime reports, and evidentiary
>materials
>- court judgments
>- patent filings
>
>-- and also in parallel, multi-lingual corpora, for instance that might
>have been created in the EU, Switzerland, Canada, and other areas with
>multiple official languages.
>
> I've been told that news-media text can provide good training
>material for the legal domain. I'd also be interested in hearing
>reactions to that claim, especially if anyone has formally studied the
>question.
>
> Thanks very much for all help,
>
> Seth
>
>
>--
>Seth Grimes Alta Plana Corp, analytical computing & data management
> Intelligent Enterprise magazine (CMP), Contributing Editor
>grimes at altaplana.com http://altaplana.com 301-270-0795
>
>
>
More information about the Corpora
mailing list