[Corpora-List] All English Text Messaging Corpus?

Khurshid Ahmad kahmad at scss.tcd.ie
Mon Apr 11 13:43:34 UTC 2011


Dear Laura
I am writing to support Rich.

The USPTO documents are in Legal English and are written by Patent
Attorneys -these documents form a representative sample of American
English and to a lesser extent that of other national varieties of written
English.  There is a distinction between patent claims and granted
patents.  To be authentic, I prefer the granted patents as these documents
have been reviewed by more than one person. I think the USPTO allows you
to make that distinction in their retrieval engine.


Yes there has been a proliferation of such documents and there maybe some
laxity by some attornyes in some documents.  Large corporations like IBM,
Google file these patents and the assignees of the patents include the US
Defence Forces. I infer from the named entities on these documents that
some care and attention has been paid to the legal arguments which are
presented in these documents; and apart from diagrams in the patent
documents, we have written English. The whole point of corpus linguistics
is that some texts within the collection will comprise outliers of the
collection.  Literary critics will not allow for outliers, but dictionary
makers and information extraction folk love the outliers.

I think it is an excellent idea to be so focussed on building a corpus. 
the world, and its scholars, are so fixated on news paper and news wire
texts that any other variety is seldom considered.

Good luck

> Hi Laura,
>
> I don't know of any text message sources exactly like what your are
> describing.  But there is a huge, partially structured text database for
> US
> patent documents, nearly all in English I suppose, which have all been
> critiqued by expert examiners, as edited in the process of negotiating a
> patent claim set - all in English.  You can create databases of patent
> documents on your desktop by downloading the free web client software Elk
> for Patents (EfP), which is built on the English Logic Kernel (Elk), as
> described in US Patent 7,209,923.  The patent is posted on the web site as
> well.  It teaches ways to combine corpus analysis methods with relational
> and object oriented database technologies.  See my website to download and
> try the free program.
>
> 		EnglishLogicKernel dot com
>
> One advantage of choosing the patent database is that every document is
> constrained by the patenting process by experts in each patent's specific
> technologies, and the vocabulary of words defined modus ponens after
> careful
> debate and crafting of each claim sentence.  For example, no really
> effective syntax parser for English has reached widespread usage, with the
> best of the performers being the Link Grammar Processor (LGP), IMHO.
> Using
> the vocabulary of non-noise words defined in patent claims, the English
> analyst can relate those claim words and phrases to specific objects as
> they
> have been described by sentences in the much more verbose specification
> part
> of the patent document.  This provides an ideal, large, partially
> structured
> database and processing environment in which to analyze the English of
> claim
> language.
>
> HTH,
> -Rich
>
>
> Sincerely,
> Rich Cooper
> EnglishLogicKernel.com
> Rich AT EnglishLogicKernel DOT com
> 9 4 9 \ 5 2 5 - 5 7 1 2
>
> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
> Christopherson, Laura
> Sent: Saturday, April 09, 2011 12:35 PM
> To: corpora at uib.no
> Subject: [Corpora-List] All English Text Messaging Corpus?
>
> Hi All,
>
> Do any of you know of a text messaging corpus only in English that is not
> a collection of someone's personal (and/or family/friends') messages?
>
> Thanks,
>


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list