[Corpora-List] Query on the use of Google for corpus research

Thu Jun 2 12:02:35 UTC 2005

Dear Nancy,

Although I have not been following this thread too closely since I am not using the web as a corpus, your reference to a tool for near duplicate detection caught my eye as my small corpus (approx. 120,000 running words) is taken from Lexis Nexis newspaper articles. Although I tried to avoid it as much as possible, a number of the articles are compilations from newswire services and/or major newspapers (and who knows how many others are but perhaps not cited).

While I realize that through my concordance lines I can detect some duplication, due to the kind of research I'm doing,  it would be extremely helpful to know exactly how much has been borrowed. (For that reason, I was even entertaining the thought of running them through a program that is used to identify plagiarism, but unfortunately they are rather pricey and the University of Houston doesn't (to the best of my knowledge) have a program such as this -at least not available to adjuncts!)

Truthfully, in the end, it may not be of any importance, but I would like to cover the possibility. Any advice or suggestions you or anyone else who has used newspaper articles for their corpus might have would be very much appreciated.

Best wishes,
Linda Bawcom (currently at the University of Liverpool)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050602/094be1c3/attachment.htm>