[Corpora-List] Serbian resources wanted

Vlado Keselj vlado at cs.dal.ca
Sat Feb 23 00:09:03 UTC 2013


Hi Martin,

Resources prepared in our paper:

Vlado Keselj and Danko Sipka. A Suffix Subsumption-based Approach
to Building Stemmers and Lemmatizers for Highly Inflectional
Languages with Sparse Resources.  In INFOTHECA, Journal of
Informatics and Librarianship, No 1-2, Volume IX, May 2008.

are available at:
http://web.cs.dal.ca/~vlado/nlp/2007-sr/

among other resource files, they include lists lemmatized words:

list-l:    47489 lemmas (0.47 KB)
list-w:   675140 word-forms (7.3 MB)
list-w-l: 696454 word-form/lemma pairs (14.6 MB)

Regards,
Vlado


On Fri, 22 Feb 2013, Adam Kilgarriff wrote:

> Hi Martyn,
> 
> we have a Serbian corpus in the Sketch Engine so all she needs to do is
> upload her corpus and then run 'keywords' to compare hers with the
> reference.
> 
> The one that is currently available is not lemmatised so comparisons there
> would be wordform-baed, however we are lemmatising and POS-tagging a newer,
> bigger dataset (courtesy of Nikola Ljubešić) as we speak so can make that
> available too, then she can get key lemmas.  If you or she ask, we can make
> a big sample of the lemmatised material available at a day or two's notice
> 
> Best
> 
> Adam
> 
> 
> On 22 February 2013 15:39, Martin Wynne <martin.wynne at it.ox.ac.uk> wrote:
> 
> > I would like to pose a question on behalf of a student who would like to
> > generate keywords by comparing her corpus of contemporary online personal
> > ads in Serbian with a reference corpus.
> >
> > Does anyone know of any freely available wordlists for the modern Serbian
> > language? Ideally, we'd like a lemma frequency list generated from a
> > general reference corpus, although lists from various other text types
> > could be useful. We'd be interested if there is a corpus available to use
> > as well.
> >
> > Many thanks for any help.
> >
> >
> > --
> > Martin Wynne
> > IT Services, University of Oxford
> > Oxford e-Research Centre
> > Faculty of Linguistics, Philology and Phonetics
> >
> > martin.wynne at it.ox.ac.uk
> >
> >
> >
> > ______________________________**_________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
> >
> 
> 
> 
> -- 
> ========================================
> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
> adam at lexmasterclass.com
> Director                                    Lexical Computing
> Ltd<http://www.sketchengine.co.uk/>
> 
> Visiting Research Fellow                 University of
> Leeds<http://leeds.ac.uk>
> 
> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>
> 
>                         *DANTE: a lexical database for
> English<http://www.webdante.com>
>                   *
> ========================================
> 
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list