[Corpora-List] Legal-domain corpora

Jernej Vicic jernej.vicic at pef.upr.si
Wed Oct 18 15:45:44 UTC 2006


You can try JRC-Acquis:

JRC-Acquis: a large aligned parallel corpus in 21 languages, freely 
available

SIZE AND FORMAT

- 21 languages (all 20 official EU languages plus Romanian)
- Average corpus size: 8.8 million words per language
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.

LANGUAGES

Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
Romanian, Slovak, Slovene, Spanish, Swedish.

TEXT TYPES

- Documents on contents, principles and political objectives of the EU
Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.

PARAGRAPH ALIGNMENT

- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.

MANUAL SUBJECT DOMAIN CLASSIFICATION

- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.

USE / DOWNLOAD

- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.



Seth Grimes wrote:

>Hello all,
>
>	I'm researching legal-domain application of NLP with machine
>learning.  What annotated corpora are available in this domain, either for
>free or for a license fee?  I'd be interested in --
>
>- legislation and statutes
>- case law
>- briefs, depositions & testimony, crime reports, and evidentiary
>materials
>- court judgments
>- patent filings
>
>-- and also in parallel, multi-lingual corpora, for instance that might
>have been created in the EU, Switzerland, Canada, and other areas with
>multiple official languages.
>
>	I've been told that news-media text can provide good training
>material for the legal domain.  I'd also be interested in hearing
>reactions to that claim, especially if anyone has formally studied the
>question.
>
>	Thanks very much for all help,
>
>					Seth
>
>
>--
>Seth Grimes   Alta Plana Corp, analytical computing & data management
>              Intelligent Enterprise magazine (CMP), Contributing Editor
>grimes at altaplana.com       http://altaplana.com    301-270-0795
>
>  
>



More information about the Corpora mailing list