[Corpora-List] Text Mining for Trend Analysis
王經篤
wangjingdoo at gmail.com
Fri Sep 17 14:47:41 UTC 2010
Dear all,
I am focusing on the extraction of maximal repeat patterns
from textual information, meanwhile compute the frequency distribution of
these patterns over time(pattern history).
There is a web site for pattern history and
its URL is (http://120.108.115.115/TM/Search_PubMed_Simple.php).
The pattern history extracted from medicine articles
"PubMed"( from 1990 to 2009),
containing 3,225,549 articles containing 677,728,269 words (600M+
MILLION WORDS) .
Note that the type of these patterns extracted not only
include single-word but also phrases (multi-words),
e.g. "patients with squamous cell carcinoma of the head and neck".
To more specific, any segment (a sequence of words) within sentences
in corpus will be extracted if that segment appear twice;
meanwhile the corresponding frequency distribution of that segment
over time, defined as "pattern history", would be computed.
I am looking forward to have more retrospective
(historial)(chronological) corpus, publications or literatures for
experiements to make my experiments more robust, and seek for linguistic
experts
for cooperation if they could provide the text with timestamp.
I will also provide them with the patterns histories extracted from these
corpus as the feedback.
please let me know if you have textual data(Corpus) with timestamp
Yours faithfully,
ps. There is an abstract about what I am doing as attached.
--
Jing-Doo Wang
Assistant Professer
Department of Computer Science and Information Engineering
Asia Universiyt, Taiwan.
886-4-23323456-ext 1847
http://asia.edu.tw/~jdwang <http://asia.edu.tw/%7Ejdwang>
jdwang at asia.edu.tw
wangjingdoo at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100917/a81228e4/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PatternHistoryExtraction_Abstract.pdf
Type: application/pdf
Size: 23223 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100917/a81228e4/attachment-0001.pdf>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list