[Corpora-List] Time costs between manual pos tagging of English and Chinese corpus

Xing Fukun xingfukun001 at gmail.com
Wed Nov 17 15:23:15 UTC 2010


Dear all,
Have anybody made a comparison between the time costs of the manual pos tagging of English and Chinese corpus.  
I haven’t made any such comparisons but I wander that there are maybe some differences. The possible reason is that there are more context clues (especially the formal or syntactic clues) for English to determine the pos than that in Chinese. For there are less formal or syntactic clues in Chinese to determine the pos, person has to rely on the semantic clues to determine the pos. But sometimes the semantic clues are not clear enough to rely on. For example, “改革很重要” (Reform is very important || To reform is very important). In Chinese verb and noun both can possess the position of subject and so there is no formal clue to determine the pos of “改革(reform)”. If we rely on semantics to determine the pos 改革, it is also difficult . “改革”(reform) can be interpreted as object or action in this context. So it is difficult to tag pos of the word. But in English it is different. If “reform” is subject without “to” it is a noun. If it is a subject with “to” it is a verb. There are enough formal clues to determine the pos of reform. In this sense I think it is easier for English to tag pos on the raw text and maybe more difficult for Chinese to tag pos. And maybe the time cost of Chinese corpus construction is more than English. This is just my guess without any experiment or investigation. If you know any more I would like to know that. 
Thank you in advance.
 




Xing Fukun
2010-11-17
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101117/be37d3fa/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list