Corpora: protein name list

Philip Resnik resnik at umiacs.umd.edu
Thu Nov 1 16:46:06 UTC 2001


>   I am collecting protein name list for bioinformatics research.
>   Does anyone know of public protein name list?

You might find GenBank useful (http://www.ncbi.nlm.nih.gov/Genbank/).
In particular, there is a protein database "compiled from a variety of
sources, including SwissProt, PIR, PRF, PDB" (see the information at
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein).
Also, the SWISS-PROT database can be downloaded; see the "downloading"
link at http://www.expasy.ch/sprot/sprot-top.html.  Either of these
should provide a source from which a protein name list could be
extracted.

You might also be interested in a nice paper by Vasileios
Hatzivassiloglou, Pablo A. Duboue and Andrey Rzhetsky: Disambiguating
Proteins, Genes, and RNA in Text: A Machine Learning Approach, in
Proceedings of the 9th International Conference on Intelligent Systems
for Molecular Biology, Tivoli Gardens, Denmark, July 21--25, 2001
(http://www.cs.columbia.edu/~pablo/publications/ISMB2001disambiguation.pdf).
They apply supervised learning techniques to disambiguation of textual
references, which you might find important since many appearances of
items on a protein name list might actually be references to the
related gene, etc.  (I myself am exploring the use of named-entity
tagging techniques for similar purposes.)

I hope this helps.  I'd be grateful if you'd post or forward any
useful replies you receive!

Best,

  Philip

  ----------------------------------------------------------------
  Philip Resnik, Assistant Professor
  Department of Linguistics and Institute for Advanced Computer Studies

  1401 Marie Mount Hall            UMIACS phone: (301) 405-6760
  University of Maryland           Linguistics phone: (301) 405-8903
  College Park, MD 20742 USA	   Fax: (301) 314-2644 / (301) 405-7104
  http://umiacs.umd.edu/~resnik	   E-mail: resnik at umiacs.umd.edu



More information about the Corpora mailing list