Corpora: protein name list
Philip Resnik
resnik at umiacs.umd.edu
Thu Nov 1 16:46:06 UTC 2001
> I am collecting protein name list for bioinformatics research.
> Does anyone know of public protein name list?
You might find GenBank useful (http://www.ncbi.nlm.nih.gov/Genbank/).
In particular, there is a protein database "compiled from a variety of
sources, including SwissProt, PIR, PRF, PDB" (see the information at
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein).
Also, the SWISS-PROT database can be downloaded; see the "downloading"
link at http://www.expasy.ch/sprot/sprot-top.html. Either of these
should provide a source from which a protein name list could be
extracted.
You might also be interested in a nice paper by Vasileios
Hatzivassiloglou, Pablo A. Duboue and Andrey Rzhetsky: Disambiguating
Proteins, Genes, and RNA in Text: A Machine Learning Approach, in
Proceedings of the 9th International Conference on Intelligent Systems
for Molecular Biology, Tivoli Gardens, Denmark, July 21--25, 2001
(http://www.cs.columbia.edu/~pablo/publications/ISMB2001disambiguation.pdf).
They apply supervised learning techniques to disambiguation of textual
references, which you might find important since many appearances of
items on a protein name list might actually be references to the
related gene, etc. (I myself am exploring the use of named-entity
tagging techniques for similar purposes.)
I hope this helps. I'd be grateful if you'd post or forward any
useful replies you receive!
Best,
Philip
----------------------------------------------------------------
Philip Resnik, Assistant Professor
Department of Linguistics and Institute for Advanced Computer Studies
1401 Marie Mount Hall UMIACS phone: (301) 405-6760
University of Maryland Linguistics phone: (301) 405-8903
College Park, MD 20742 USA Fax: (301) 314-2644 / (301) 405-7104
http://umiacs.umd.edu/~resnik E-mail: resnik at umiacs.umd.edu
More information about the Corpora
mailing list