[Corpora-List] corpus of plain text docs in English
JCorbett at umac.mo
JCorbett at umac.mo
Tue Apr 5 10:13:20 UTC 2011
You could try the Corpus of Modern Scottish Writing (1700-1945) which has
a range of text types going back to the 18th century. At the moment the
texts can only be downloaded one by one - so you could work on a subcorpus
to start with - but a bulk download should be made available in the not
too distant future. See http://www.scottishcorpus.ac.uk/cmsw/ You can view
digital facsimiles, transcriptions and plain text and also download plain
text files.
Hope this helps,
John Corbett
From:
corpora-request at uib.no
To:
corpora at uib.no
Date:
05/04/2011 18:02
Subject:
Corpora Digest, Vol 46, Issue 6
Sent by:
corpora-bounces at uib.no
Today's Topics:
1. corpus of plain text docs in English (petar at lml.bas.bg)
2. Re: corpus of plain text docs in English (Mark Davies)
3. Call for Papers: "Language Technology for a Multilingual
Europe" (David Vilar)
4. CFP SIGIR 2011 Workshop on "entertain me": Supporting
Complex Search Tasks (Jaap Kamps)
----------------------------------------------------------------------
Message: 1
Date: Fri, 1 Apr 2011 10:13:28 +0300
From: petar at lml.bas.bg
Subject: [Corpora-List] corpus of plain text docs in English
To: Corpora at uib.no
Dear Corpora members,
I am working on a domain specific machine translation project. I am
looking for a corpus of plain text (historical) documents in English. I
would like to experiment whether standard n-gram model, trained on such
docs, could be used to improve other machine translation techniques
designed specially for historical docs. Would you recommend some corpora?
Thank you.
Best regards,
Petar Mitankin
------------------------------
Message: 2
Date: Mon, 4 Apr 2011 08:43:17 -0600
From: Mark Davies <Mark_Davies at byu.edu>
Subject: Re: [Corpora-List] corpus of plain text docs in English
To: "petar at lml.bas.bg" <petar at lml.bas.bg>, "Corpora at uib.no"
<Corpora at uib.no>
Petar,
I'm not sure how far back you want the texts. If it's just to the early
1800s or so, you might check the links at the 400 million word Corpus of
Historical American English (http://corpus.byu.edu/coha): Help /
Composition of Corpus. It provides suggestions for some nice text
archives, like Project Gutenberg, Making of America, etc.
For anything farther back than the early 1800s, you could just use the
older texts from Project Gutenberg, or the many online archives of authors
of Early Modern English. If your library is a member, you'll also want to
check the huge collection at Early English Books Online (EEBO) for the
machine readable (as opposed to the PDF image) texts.
Best,
Mark Davies
============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
http://davies-linguistics.byu.edu
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf
Of
> petar at lml.bas.bg
> Sent: Friday, April 01, 2011 1:13 AM
> To: Corpora at uib.no
> Subject: [Corpora-List] corpus of plain text docs in English
>
> Dear Corpora members,
>
> I am working on a domain specific machine translation project. I am
looking for a
> corpus of plain text (historical) documents in English. I would like to
experiment
> whether standard n-gram model, trained on such docs, could be used to
improve
> other machine translation techniques designed specially for historical
docs. Would you
> recommend some corpora?
>
> Thank you.
>
> Best regards,
> Petar Mitankin
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
------------------------------
Message: 3
Date: Tue, 05 Apr 2011 10:52:13 +0200
From: David Vilar <david.vilar at dfki.de>
Subject: [Corpora-List] Call for Papers: "Language Technology for a
Multilingual Europe"
To: CORPORA at UIB.NO
PDF Version with complete information:
http://www.dfki.de/~davi01/cfp/ws-cfp.en.pdf
Apologies if you receive multiple copies of this call.
Call for Papers: "Language Technology for a Multilingual Europe"
================================================================
Overview
--------
The Workshop aims at bringing various groups together who are concerned
with the broad topic of "Language Technology for a multilingual Europe".
This encompasses on the one hand representatives from research and
development in the field of language technologies, on the other hand
users from quite divers areas. Two examples of the application of
language technology is (automatic / machine) translation, and processing
of texts from the humanities with methods from language technology, like
automatic topic indexing, text mining, integrating numerous texts and
additional information across languages etc.
These kinds of application areas and research and development in
language technology have in common that they rely on resources (lexica,
corpora, grammars, ontologies etc.), or that they produce these
resources. A multilingual Europe, being supported by language
technology, is only possible if an adequate, interoperable
infrastructure of resources, including the related tooling, is available
for all European languages.
In addition it is necessary that the aforementioned and other
communities of developers and users of language technology stand as one,
homogenous community. Only in this way it will be possible to assure
the long-term political acceptance of the topic "language technology" in
Europe.
Topics
------
The workshop aims at brining research and development from academia and
industry together, to discuss the aforementioned technical and political
prerequisites for language technology in Europe. Submissions may touch
on the following or other aspects of this overall topic:
- Research and development of language technology in various areas
(Human Language Technology, ICT, eHumanities, ...)
- Infrastructure for resources in language technology
- Prerequisites for interoperability of language technology based
applications
- Language technology and standardization
- "Political perspectives" about requirements and the usefulness of
language technology, from the perspective of research, industry and
various user communities.
Important dates
---------------
Deadline for submission of abstracts: May 15th 2011
Notification of acceptance: June 15th 2011
Workshop: September 27th, the Tuesday before the GSCL conference
--
David Vilar Torres
DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
Tel. (+49) 30 238 95 1845
--------------- Legal Note ---------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender), Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
------------------------------
Message: 4
Date: Tue, 05 Apr 2011 11:11:26 +0200
From: Jaap Kamps <kamps at science.uva.nl>
Subject: [Corpora-List] CFP SIGIR 2011 Workshop on "entertain me":
Supporting Complex Search Tasks
To: corpora at uib.no
SIGIR 2011 Workshop on "entertain me": Supporting Complex Search Tasks
July 28, Beijing
http://staff.science.uva.nl/~kamps/entertainme/
Call for Papers: deadline June 3
* A Workshop on a Single Query ?!?
Searchers with a complex information need typically slice-and-dice their
problem into several queries and subqueries, and laboriously combine the
answers post hoc to solve their tasks. This workshop invites discussion
about any technique, knowledge representation, model or technology to
integrate the search results into a coherent session on a level of
abstraction which matches the original information need.
Consider planning a social event at the last day of SIGIR, in the
unknown city of Beijing, factoring in distances, timing, and preferences
on budget, cuisine, and entertainment. A system supporting the entire
search episode should "know" a lot, either from profiles or implicit
information, or from explicit information in the query or from feedback.
This may lead to the (interactive) construction of a complexly
structured query, but sometimes the most obvious query for a complex
need is dead simple: "entertain me." Rather than returning
ten-blue-lines in response to a 2.4-word query, the desired system
should support searchers during their whole task or search episode, by
iteratively constructing a complex query or search strategy, by
exploring the result-space at every stage, and by combining the partial
answers into a coherent whole.
Although a SIGIR Workshop devoted to a single query may seem
extravagant, this query is just one example of the general problem of
supporting simple and common requests that express complex and dynamic
needs.
* Social Evening Program
Many interesting ideas will come out of the workshop, but how do we know
if they are any good? We will have a special breakout group designing a
mock-up for solving the "entertain me" query, charting out the
background information (implicit and explicit context), the different
sources (maps, web, social, news, ...), and the needed components and
interaction. A group of local Peking University grad students is
available to serve as oracles for local information.
The scientific evaluation of the resulting "entertainment plan" will be
done by executing it in the evening after the workshop, with all
participants.
- Are you willing and able to sponsor the social event? Please contact
the organizers for details.
- Do you want to take part? Read the Call for Submission and contribute!
* Call for Submissions
We invite the submission of papers that think outside the box, from any
aspect of relevance to the workshop's theme, including:
- information seeking behavior, interaction, berry-picking;
- information needs and ways of articulating them;
- implicit and explicit feedback;
- exploiting collection structure and semantic annotations;
- exploratory search, HCI, UI and UX design;
- situated search (maps, Geo, customized, personalized, ...);
- entertainment search (broadcasters, content owners, network operators,
device manufacturers).
We aim to bring together a varied group of researchers covering both
user and system centered approaches, and together work on ways to make
IR systems support searchers when interactively solving a complex task,
such as the entertain me planning problem.
Help us shape the future of IR!
- Submit a short 2-page poster or position paper of relevance to
supporting complex tasks, e.g., that identify specific research problems
and use-cases, develop models/theory of complex tasks and interaction,
discuss novel interfaces or system components, examine ways of
evaluating, and/or report on preliminary experiments,
- and take actively part in the discussion at the Workshop.
The deadline is Monday June 3, 2011, submission details and further
information are on http://staff.science.uva.nl/~kamps/entertainme/
Nick Belkin (Rutgers)
Charlie Clarke (Waterloo)
Ning Gao (Peking University)
Jaap Kamps (Amsterdam)
Jussi Karlgren (SICS)
----------------------------------------------------------------------
Send Corpora mailing list submissions to
corpora at uib.no
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.uib.no/listinfo/corpora
or, via email, send a message with subject or body 'help' to
corpora-request at uib.no
You can reach the person managing the list at
corpora-owner at uib.no
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Corpora digest..."
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
End of Corpora Digest, Vol 46, Issue 6
**************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110405/ba56bbba/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 60157 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110405/ba56bbba/attachment-0001.gif>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list