[Corpora-List] data set

Trevor Jenkins trevor.jenkins at suneidesis.com
Sat Aug 31 15:35:57 UTC 2013


On 31 Aug 2013, at 13:43, Rezan Moradi <rizan_rm1989 at yahoo.com> wrote:

> I'm studying about "Expert Finding" field and I have some background information about it. Now, I want to use language models, but language models need a suitable data set in text format. My main problem is the lack of a suitable data set. I need a data set contain many number of papers in .txt format that each paper consists of title, keywords, abstract, author(s)'s name and main text. My previously used data set consist of title, abstract and author(s)'s name.
> Any help or hint at the existence of such a data set will be appreciated
> Thank you very much

You don't say what language(s), domain(s), currency, status or count of the papers you are looking for. However, open access journals might provide you a starting point to create a dataset appropriate to your needs. The Directory of Open Access Journals (http://www.doaj.org/) will provide you with sources of such papers. Their list covers domains from Physics to Film to 19th cenutry (Western/European) History to Medicine to Sociology to Linguistics to Translation to many others with overlaps between them. And in various languages including English, and Danish.

It is unlikely that you will find any with plain text format. You should expect to reformat from either PDF, HTML, or whatever file format the publisher uses (for example, several DOAJ offerings use EPUB format).

Regards, Trevor.

<>< Re: deemed!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130831/bd85d1cb/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list