[Corpora-List] a corpus of Polish chat conversations (long note)

Leszek Szymański l_sz at poczta.fm
Mon Dec 7 15:45:51 UTC 2009



Dear Colleagues,

Below I present a brief description of my research on Internet chat communication with the use of a corpus. The research was a basis of my doctoral dissertation. 

I collected a corpus of Polish text-based Internet chat conversations held between February 20, 2004 5:32 p.m., and March 27, 2006 10:54 p.m. The raw material needed editing, and so I removed all the expressions by bots, all the software information, as well as time stamps at the beginning of each line. I also edited the deformed letters with diacritic marks, which occurred throughout the material. This resulted in a corpus of 1, 629, 823 running words. The average word length was 4.36 letters.

Then, I compared my corpus with a standardized one. My corpus included a bigger number of shorter words than the standardized one. I observed the highest frequencies for 2-3-letter words (compared to 6-letter words in the standardized corpus). Other statistics are as follows:

types:1,629,823

tokens:140,991

type/token:8.65

I also analyzed key words (those with unusually high frequencies compared to standard). Those included:

a) forms without conventional meanings (ornaments or emoticons)

b) forms spelled without diacritic marks (Polish words)

c) users' nicknames

d) foreign words

e) short forms

f) onomatopoeic words

My research focused on three issues. First, I analyzed spelling (esp. nonstandard spelling) of words and punctuation in the chat room. Secondly, I studied, in detail, the lexical elements utilized in speech acts in Internet-based chat rooms. The research included words used in: greetings, farewells, thanks and apologies, as well as vulgarisms. Moreover, foreign words and short forms were analyzed. Furthermore, the nicknames of the users were studied. Thirdly, I also investigated the potential hybrid of spoken and written communication in the chat room, which resulted in my attempts to characterize the genre of Internet chat with reference to the characteristics of speech and writing.

I made the following observations (in short). First, nonstandard writing is a purposeful action. Some of the utilized conventions are aimed at helping the users to depict spoken communication by means of writing. There are also word forms in which the users' hurry is visible (spelling without diacritic marks, misspellings), since for the users it is more important to communicate fast and expressively than to pay attention to proper spelling. The spelling unconventionality is also a way of expressing one's group membership. Furthermore, strong informality attempts may be observed in Internet chat communication. These are manifested in, for example, shortening word forms, language plays or interlarding with foreign words. Despite considerable level of informality, relative politeness with little use of vulgarity may be found in this type of communication. In addition to that, it was revealed that the chatters include certain information about themselves in their nicks. The information may include: the person's age, origin or the type of Internet connection. It was also established that there is no hybrid of speech and writing in chat room communication. I describe the Internet chat is a written communication channel, which attempts to signal its informality. To my best knowledge this research is the first empirical study of Polish Internet chats.




Should anyone be interested in the research, please do not hesitate to contact me directly via email l_sz at poczta.fm. My material and dissertation are (unfortunately) only in Polish.

Regards, 


Leszek
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20091207/df164803/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list