[Corpora-List] corpus of plain text docs in English

Tue Apr 5 10:13:20 UTC 2011

You could try the Corpus of Modern Scottish Writing (1700-1945) which has 
a range of text types going back to the 18th century. At the moment the 
texts can only be downloaded one by one - so you could work on a subcorpus 
to start with - but a bulk download should be made available in the not 
too distant future. See http://www.scottishcorpus.ac.uk/cmsw/ You can view 
digital facsimiles, transcriptions and plain text and also download plain 
text files.

Hope this helps,

John Corbett

From:
corpora-request at uib.no
To:
corpora at uib.no
Date:
05/04/2011 18:02
Subject:
Corpora Digest, Vol 46, Issue 6
Sent by:
corpora-bounces at uib.no

Today's Topics:

   1.  corpus of plain text docs in English (petar at lml.bas.bg)
   2. Re:  corpus of plain text docs in English (Mark Davies)
   3.  Call for Papers: "Language Technology for a Multilingual
      Europe" (David Vilar)
   4.  CFP SIGIR 2011 Workshop on "entertain me": Supporting
      Complex Search Tasks (Jaap Kamps)

----------------------------------------------------------------------

Message: 1
Date: Fri, 1 Apr 2011 10:13:28 +0300
From: petar at lml.bas.bg
Subject: [Corpora-List] corpus of plain text docs in English
To: Corpora at uib.no

Dear Corpora members,

I am working on a domain specific machine translation project. I am
looking for a corpus of plain text (historical) documents in English. I
would like to experiment whether standard n-gram model, trained on such
docs, could be used to improve other machine translation techniques
designed specially for historical docs. Would you recommend some corpora?

Thank you.

Best regards,
Petar Mitankin

------------------------------

Message: 2
Date: Mon, 4 Apr 2011 08:43:17 -0600
From: Mark Davies <Mark_Davies at byu.edu>
Subject: Re: [Corpora-List] corpus of plain text docs in English
To: "petar at lml.bas.bg" <petar at lml.bas.bg>, "Corpora at uib.no"
                 <Corpora at uib.no>

Petar,

I'm not sure how far back you want the texts. If it's just to the early 
1800s or so, you might check the links at the 400 million word Corpus of 
Historical American English (http://corpus.byu.edu/coha): Help / 
Composition of Corpus. It provides suggestions for some nice text 
archives, like Project Gutenberg, Making of America, etc.

For anything farther back than the early 1800s, you could just use the 
older texts from Project Gutenberg, or the many online archives of authors 
of Early Modern English. If your library is a member, you'll also want to 
check the huge collection at Early English Books Online (EEBO) for the 
machine readable (as opposed to the PDF image) texts.

Best,

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================ 

> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf 
Of
> petar at lml.bas.bg
> Sent: Friday, April 01, 2011 1:13 AM
> To: Corpora at uib.no
> Subject: [Corpora-List] corpus of plain text docs in English
> 
> Dear Corpora members,
> 
> I am working on a domain specific machine translation project. I am 
looking for a
> corpus of plain text (historical) documents in English. I would like to 
experiment
> whether standard n-gram model, trained on such docs, could be used to 
improve
> other machine translation techniques designed specially for historical 
docs. Would you
> recommend some corpora?
> 
> Thank you.
> 
> Best regards,
> Petar Mitankin
> 
> 
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

------------------------------

Message: 3
Date: Tue, 05 Apr 2011 10:52:13 +0200
From: David Vilar <david.vilar at dfki.de>
Subject: [Corpora-List] Call for Papers: "Language Technology for a
                 Multilingual Europe"
To: CORPORA at UIB.NO

PDF Version with complete information:
http://www.dfki.de/~davi01/cfp/ws-cfp.en.pdf

Apologies if you receive multiple copies of this call.

Call for Papers: "Language Technology for a Multilingual Europe"
================================================================

Overview
--------

The Workshop aims at bringing various groups together who are concerned
with the broad topic of "Language Technology for a multilingual Europe".
This encompasses on the one hand representatives from research and
development in the field of language technologies, on the other hand
users from quite divers areas. Two examples of the application of
language technology is (automatic / machine) translation, and processing
of texts from the humanities with methods from language technology, like
automatic topic indexing, text mining, integrating numerous texts and
additional information across languages etc.

These kinds of application areas and research and development in
language technology have in common that they rely on resources (lexica,
corpora, grammars, ontologies etc.), or that they produce these
resources. A multilingual Europe, being supported by language
technology, is only possible if an adequate, interoperable
infrastructure of resources, including the related tooling, is available
for all European languages.

In addition it is necessary that the aforementioned and other
communities of developers and users of language technology stand as one,
homogenous community.  Only in this way it will be possible to assure
the long-term political acceptance of the topic "language technology" in
Europe.

Topics
------

The workshop aims at brining research and development from academia and
industry together, to discuss the aforementioned technical and political
prerequisites for language technology in Europe. Submissions may touch
on the following or other aspects of this overall topic:

- Research and development of language technology in various areas
   (Human Language Technology, ICT, eHumanities, ...)
- Infrastructure for resources in language technology
- Prerequisites for interoperability of language technology based
   applications
- Language technology and standardization
- "Political perspectives" about requirements and the usefulness of
   language technology, from the perspective of research, industry and
   various user communities.

Important dates
---------------

Deadline for submission of abstracts: May 15th 2011
Notification of acceptance: June 15th 2011
Workshop: September 27th, the Tuesday before the GSCL conference

-- 
David Vilar Torres
DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
Tel. (+49) 30 238 95 1845

--------------- Legal Note ---------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster 
(Vorsitzender), Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313

------------------------------

Message: 4
Date: Tue, 05 Apr 2011 11:11:26 +0200
From: Jaap Kamps <kamps at science.uva.nl>
Subject: [Corpora-List] CFP SIGIR 2011 Workshop on "entertain me":
                 Supporting Complex Search Tasks
To: corpora at uib.no

SIGIR 2011 Workshop on "entertain me": Supporting Complex Search Tasks
July 28, Beijing
http://staff.science.uva.nl/~kamps/entertainme/

Call for Papers: deadline June 3

* A Workshop on a Single Query ?!?

Searchers with a complex information need typically slice-and-dice their 
problem into several queries and subqueries, and laboriously combine the 
answers post hoc to solve their tasks.  This workshop invites discussion 
about any technique, knowledge representation, model or technology to 
integrate the search results into a coherent session on a level of 
abstraction which matches the original information need.

Consider planning a social event at the last day of SIGIR, in the 
unknown city of Beijing, factoring in distances, timing, and preferences 
on budget, cuisine, and entertainment.  A system supporting the entire 
search episode should "know" a lot, either from profiles or implicit 
information, or from explicit information in the query or from feedback. 
  This may lead to the (interactive) construction of a complexly 
structured query, but sometimes the most obvious query for a complex 
need is dead simple: "entertain me."  Rather than returning 
ten-blue-lines in response to a 2.4-word query, the desired system 
should support searchers during their whole task or search episode, by 
iteratively constructing a complex query or search strategy, by 
exploring the result-space at every stage, and by combining the partial 
answers into a coherent whole.

Although a SIGIR Workshop devoted to a single query may seem 
extravagant, this query is just one example of the general problem of 
supporting simple and common requests that express complex and dynamic 
needs.

* Social Evening Program

Many interesting ideas will come out of the workshop, but how do we know 
if they are any good?  We will have a special breakout group designing a 
mock-up for solving the "entertain me" query, charting out the 
background information (implicit and explicit context), the different 
sources (maps, web, social, news, ...), and the needed components and 
interaction.  A group of local Peking University grad students is 
available to serve as oracles for local information.

The scientific evaluation of the resulting "entertainment plan" will be 
done by executing it in the evening after the workshop, with all 
participants.

- Are you willing and able to sponsor the social event?  Please contact 
the organizers for details.
- Do you want to take part?  Read the Call for Submission and contribute!

* Call for Submissions

We invite the submission of papers that think outside the box, from any 
aspect of relevance to the workshop's theme, including:

- information seeking behavior, interaction, berry-picking;
- information needs and ways of articulating them;
- implicit and explicit feedback;
- exploiting collection structure and semantic annotations;
- exploratory search, HCI, UI and UX design;
- situated search (maps, Geo, customized, personalized, ...);
- entertainment search (broadcasters, content owners, network operators, 
device manufacturers).

We aim to bring together a varied group of researchers covering both 
user and system centered approaches, and together work on ways to make 
IR systems support searchers when interactively solving a complex task, 
such as the entertain me planning problem.

Help us shape the future of IR!

- Submit a short 2-page poster or position paper of relevance to 
supporting complex tasks, e.g., that identify specific research problems 
and use-cases, develop models/theory of complex tasks and interaction, 
discuss novel interfaces or system components, examine ways of 
evaluating, and/or report on preliminary experiments,

- and take actively part in the discussion at the Workshop.

The deadline is Monday June 3, 2011, submission details and further 
information are on http://staff.science.uva.nl/~kamps/entertainme/

Nick Belkin (Rutgers)
Charlie Clarke (Waterloo)
Ning Gao (Peking University)
Jaap Kamps (Amsterdam)
Jussi Karlgren (SICS)

----------------------------------------------------------------------
Send Corpora mailing list submissions to
                 corpora at uib.no

To subscribe or unsubscribe via the World Wide Web, visit
                 http://mailman.uib.no/listinfo/corpora
or, via email, send a message with subject or body 'help' to
                 corpora-request at uib.no

You can reach the person managing the list at
                 corpora-owner at uib.no

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Corpora digest..."

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

End of Corpora Digest, Vol 46, Issue 6
**************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110405/ba56bbba/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 60157 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110405/ba56bbba/attachment-0001.gif>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora