[Corpora-List] RES: Corpora Digest, Vol 12, Issue 2
José Lopes Moreira Filho
jlopes at corpuslg.org
Mon Jun 2 19:22:22 UTC 2008
Hi
I am a Corpus Linguistics student/researcher. Experience with: Shell, PHP,
ASP, VB6, VB.NET, and learning C#. I also have some applications built with
VB.net:
http://www.corpuslg.org/software/temp/kwicg1.1.exe
http://www.corpuslg.org/software/developer/?postid=17
I think you don't need to clean text to take word frequency with C# or
VB.NET.
The 'Split' or 'Replace' process are slow compared to what you can do using
'StringBuilder', 'RegularExpressions', and 'Dictionaries'.
Take a look at the simple vb.net code below. It 'cleans' (actually extracts)
text without using 'replace' or 'Split' functions.
The same approach is possible for a C# function.
VB.NET CODE:
'+-------------------------------------------------------------------------
Imports System.Text
Imports System.Text.RegularExpressions
Imports System.Collections.Generic
'+-------------------------------------------------------------------------
'--------------------------------------------------------------------------
Private function Get_Words (strText as string ) as String
Dim strbText As New StringBuilder
Dim strbResult As New StringBuilder
Dim WordIndex As MatchCollection
Dim regexpWord As String = "\b\w+-\w+\b|\b\w+\'\w+\b|\b\w+\'\b|\b\w+\b"
Dim i As Integer
strbText.Append (strText)
WordIndex = Regex.Matches(strbText.ToString, regexpWord)
For i = 0 To WordIndex.Count - 1
strbresult.AppendLine (wordindex(i).Value )
Next
Get_Words = strbresult.ToString
strbresult = nothing
strbText =Nothing
WordIndex = Nothing
regexpword = Nothing
i = Nothing
End function
'---------------------------------------------------------------------------
In order to count words, you can use a 'Dictionary(Of String, Long)' object.
Instead of passing the words to a Stringbuilder object (as in the code
above), you pass the extracted words to a Dictionary of String, Long.
I hope this can shed some light on the issue.
Regards,
José.
-----Mensagem original-----
De: corpora-request at uib.no [mailto:corpora-request at uib.no]
Enviada em: segunda-feira, 2 de junho de 2008 10:00
Para: corpora at uib.no
Assunto: Corpora Digest, Vol 12, Issue 2
Today's Topics:
1. Cleaning text to take word frequency (Trevor Jenkins)
2. Cleaning text to take word frequency (Alexandre Rafalovitch)
3. Workshop CFP (Corrected submission date):Psychocomputational
Models of Human Language Acquisition (PsychoCompLA-2008)
(pcomp_AT_hunter.cuny.edu)
4. PALC 2009 First Call (Stanislaw Roszkowski)
----------------------------------------------------------------------
Message: 1
Date: Sun, 1 Jun 2008 12:57:13 +0000 (GMT)
From: Trevor Jenkins <trevor.jenkins_AT_suneidesis.com>
Subject: [Corpora-List] Cleaning text to take word frequency
To: Corpora list <corpora_AT_uib.no>
On Sun, 1 Jun 2008, True Friend <true.friend2004_AT_gmail.com> wrote:
> ... version in C# of a Perl script a respected subscriber of this list
> (Alexander Schutz) ... now I am trying to programm myself so I tried to
> implement that idea in C#. I have done that all and it works also but it
> does not give me 100% frequency of the word as the Perl script does.
That is possible. In fact doesn't surprise me at all.
> ... The resulting string array was cleaned from such characters but I
> couldn't get the 100% result. The frequency of most words are less than
> that of Perl script (which does the same thing). ...
I'm neither a perl wizard or a C# tune-smith (I still use Snobol4) but I'd
suspect a major difference in the way the two language process text. For
my money I'd believe perl is giving you a more accurate result because the
language itself was designed to process text. I'd further believer that C#
(as Microsoft's attempt to have their own Java) doesn't deal with
character and/or textual data in the same way. What perl accepts as text
C# may well be ditching. You may be right by citing the System.split()
function; check very carefully what that function is intended to do and
then compare it with how the similar, but not necessarily identical,
function in perl works. Assume absolutely nothing about the functionality
of either language or of functions with the same name. If in doubt blame
C# for the discrepancy.
Regards, Trevor
<>< Re: deemed!
------------------------------
Message: 2
Date: Sun, 1 Jun 2008 09:30:03 -0400
From: "Alexandre Rafalovitch" <arafalov_AT_gmail.com>
Subject: [Corpora-List] Cleaning text to take word frequency
To: "True Friend" <true.friend2004_AT_gmail.com>
Cc: corpora_AT_uib.no
The way I would have approached this is by finding which words
generate count discrepancies and also exist in one, but not another
version of the result. Then, I would look for those words in the text
and see what context they are in.
What I suspect you will find is that your partial reimplementation of
perl's :punct: class is causing problems. I would either do a complete
reimplementation of that (see:
http://en.wikipedia.org/wiki/Regular_expression ) or look into C#'s
regular expressions, which I am sure will contain the same definition
of the :punct: class.
Finally, if you are working with languages other than English, you
most certainly should look into regular expression libraries. They
take into account Unicode's rules as well, something you really don't
want to have to duplicate in your own code.
Regards,
Alex.
--
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
On Sun, Jun 1, 2008 at 7:07 AM, True Friend <true.friend2004_AT_gmail.com>
wrote:
>
> HI
> I am a corpus linguistics student and learning C# for this purpose as
well.
> I've created a simple application to find the frequency of a given word in
> two files. Actually this simple application is a practice version in C# of
a
> Perl script a respected subscriber of this list (Alexander Schutz) written
> for me on my request on this list. I needed it then, now I am trying to
> programm myself so I tried to implement that idea in C#.
------------------------------
Message: 3
Date: Sun, 1 Jun 2008 21:31:02 -0400 (EDT)
From: pcomp_AT_hunter.cuny.edu
Subject: [Corpora-List] Workshop CFP (Corrected submission
date):Psychocomputational Models of Human Language Acquisition
(PsychoCompLA-2008)
To: Corpora_AT_uib.no
Apologies for multiple postings.
************************ 2nd Call for Abstracts ****************************
Psychocomputational Models of Human Language Acquisition (PsychoCompLA-2008)
July 23rd at CogSci 2008 - Washington, D.C.
Submission Deadline: June 15, 2008
http://www.colag.cs.hunter.cuny.edu/psychocomp/
Workshop Topic:
The workshop is devoted to psychologically-motivated computational models of
language acquisition. That is, models that are compatible with research in
psycholinguistics, developmental psychology and linguistics.
Invited Speakers:
* Rens Bod, Institute for Logic, Language and Computation, University of
Amsterdam, Netherlands
* Damir Cavar, University of Indiana, USA and Zadar University, Croatia
* Gary Marcus, New York University, USA
* Jeffery Lidz, University of Maryland, USA
* Gary Marcus, New York University, USA
* Josh Tenenbaum, Massachusetts Institute of Technology, USA
Workshop History:
This is the fourth meeting of the Psychocomputational Models of Human
Language Acquisition workshop following PsychoCompLA-2004, held in Geneva,
Switzerland as part of the 20th International Conference on Computational
Linguistics (COLING-2004), PsychoCompLA-2005 as part of the 43rd Annual
Meeting of the Association for Computational Linguistics (ACL-2005) held in
Ann Arbor, Michigan where the workshop shared a joint session with the Ninth
Conference on Computational Natural Language Learning (CoNLL-2005), and
PsychoCompLA-2007 held in Nashville, Tennessee as part of the 29th meeting
of the Cognitive Science Society (CogSci-2007).
Workshop Description:
The workshop will present research and foster discussion centered around
psychologically-motivated computational models of language acquisition, with
an emphasis on the acquisition of syntax. In recent decades there has been a
thriving research agenda that applies computational learning techniques to
emerging natural language technologies and many meetings, conferences and
workshops in which to present such research. However, there have been only a
few (but growing number of) venues in which psychocomputational models of
how humans acquire their native language(s) are the primary focus.
Psychocomputational models of language acquisition are of particular
interest in light of recent results in developmental psychology that suggest
that very young infants are adept at detecting statistical patterns in an
audible input stream. Though, how children might plausibly apply statistical
'machinery' to the task of grammar acquisition, with or without an innate
language component, remains an open and important question. One effective
line of investigation is to computationally model the acquisition process
and determine interrelationships between a model and linguistic or
psycholinguistic theory, and/or correlations between a model's performance
and data from linguistic environments that children are exposed to.
Special Theme:
Although the workshop program speaks to many facets of psychocomputational
language acquisition modeling, the theme of the workshop this year is:
* Computational resources: How much is just right, and does it matter?
The computational resources (e.g., number of calculations per input datum,
size of memory store, etc.) employed by current psychocomputational modeling
efforts vary tremendously from model to model. However, two important
questions have rarely been addressed. How well do a particular acquisition
model's resources parallel the resources employed by a human language
learner? And, how relevant (or not) is it to establish such a relationship?
Topics and Goals:
Abstracts that present research on (but not necessarily limited to) the
following topics are welcome:
* Models that address the acquisition of word-order;
* Models that combine parsing and learning;
* Formal learning-theoretic and grammar induction models that incorporate
psychologically plausible constraints;
* Comparative surveys that critique previously reported
studies;
* Models that have a cross-linguistic or bilingual perspective;
* Models that address learning bias in terms of innate
linguistic knowledge versus statistical regularity in the
input;
* Models that employ language modeling techniques from corpus linguistics;
* Models that employ techniques from machine learning;
* Models of language change and its effect on language
acquisition or vice versa;
* Models that employ statistical/probabilistic grammars;
* Computational models that can be used to evaluate existing linguistic or
developmental theories (e.g., principles & parameters, optimality theory,
construction grammar, etc.)
* Empirical models that make use of child-directed corpora such as CHILDES.
This workshop intends to bring together researchers from cognitive
psychology, computational linguistics, other computer/mathematical sciences,
linguistics and psycholinguistics working on all areas of language
acquisition. Diversity and cross-fertilization of ideas is the central goal.
Workshop Organizer:
William Gregory Sakas, City University of New York
(sakas at hunter.cuny.edu)
Workshop Co-organizer:
David Guy Brizan, City University of New York
(dbrizan at gc.cuny.edu)
Program Committee:
Rens Bod, Institute for Logic, Language and Computation, University of
Amsterdam, Netherlands
David Guy Brizan, City University of New York, USA
Damir Cavar, University of Indiana, USA and Zadar University, Croatia
Gary Marcus, New York University
Nick Chater, University of College London, UK
Alex Clark, Royal Holloway University of London, UK
Rick Dale, University of Memphis, USA
Jeffery Lidz, University of Maryland, USA
Gary Marcus, New York University, USA
Lisa Pearl, University of California, Irvine, USA
William Gregory Sakas, City University of New York, USA
Josh Tenenbaum, Massachusetts Institute of Technology, USA
Charles D. Yang, University of Pennsylvania, USA
Submission details:
Authors are invited to submit abstracts of 1 page plus 1 page for data and
other supplementary materials. Abstracts should be anonymous, clearly titled
and no more than 500 words in length. Text of the abstract should fit on one
page, with a second page for examples, table, figures, references, etc. The
following formats are accepted: PDF, PS, and MS Word. Please include a cover
sheet (as a separate attachment) containing the title of your submission,
your name, contact details and affiliation. Send your submission
electronically to
Email: Psycho.Comp_AT_hunter.cuny.edu.
with "PsychoCompLA-2008 Submission" somewhere in the subject line.
Publication:
The accepted abstracts will appear in the online workshop proceedings. Full
papers of accepted abstracts will be considered in Fall 2008 for inclusion
in an issue of the new Cognitive Science Society Journal - topiCS - whose
focus will be psychocomputational modeling of human language acquisition.
Submission deadline: June 15, 2008
Contact: Psycho.Comp_AT_hunter.cuny.edu
with "PsychoCompLA-2008" somewhere in the subject line.
------------------------------
Message: 4
Date: Mon, 2 Jun 2008 10:38:53 +0200 (CEST)
From: "Stanislaw Roszkowski" <roszkowski_AT_uni.lodz.pl>
Subject: [Corpora-List] PALC 2009 First Call
To: Corpora_AT_uib.no
PALC 2009 FIRST CALL FOR PAPERS
The Department of English Language and Applied Linguistics is proud to
announce that the 7th international conference on PRACTICAL APPLICATIONS
IN LANGUAGE AND COMPUTERS (PALC 2009) will be held over 3 days, 6 to 8
April 2009 (arrival day 5 April) at the Lodz University Conference Centre
in Lodz, Poland.
For over a decade, the PALC conferences have served the international
community of corpus linguists by providing a useful forum for the exchange
of views and ideas on how corpora and computational tools can be
effectively employed to explore and advance our understanding of language.
The topics of the conference include, but are not limited to, the following:
Contrastive Studies and Language Corpora
Discourse and Language Corpora
ESP and Language Corpora
Expert, Retrieval and Analytical Systems
FLA/SLA and Language Corpora
Language Teaching Materials and Language (Learner) Corpora/ICT
Virtual Learning Environments
E-testing
Large (multilingual/multimodal) Corpora
Lexicography and Language Corpora
Cognition, Computers and Language
Computer Translation Tools
Machine Translation, Machine-aided Translation, Translation and Corpora
E-books and Corpora and Literature
Workshop sessions:
Combinatorics (patterning) in specialized discourses (organized by
Stanis?aw Go?d?-Roszkowski
E-learning (organized by Jacek Wali?ski & Przemek Krakowian)
Exploring National Corpora (organized by Piotr P?zik and ?ukasz Dró?d?)
Official language of the conference will be English.
PLENARY SPEAKERS
The following scholars have accepted our invitation to address the
conference as plenary speakers:
Mark Davies, Brigham Young University, Provo, USA
Ken Hyland, University of London, UK
Ramesh Krishnamurthy, Aston University, Birmingham, UK
Margaret Rogers, University of Surrey, Guildford, UK
Terttu Nevalainen, University of Helsinki, Finland
ABSTRACTS
Abstracts of papers should be up to 500 words long and forwarded (by
e-mail or fax) to the organisers (see below). Deadline for submission is
31 December 2008. Presentations should last 30 minutes including
demonstrations, questions and discussion.
PROCEEDINGS
A selection of conference papers will be published with the Peter Lang as
part of the ?ód? Studies in Language series. Deadline for the submission
of papers is 1 July 2009.
COST
The cost of conference registration is 200 euros (170 euros on or before
15 February). This includes a conference pack, coffee breaks and
participation in sessions.
Participants can book subsidised accommodation at the Lodz University
Conference Centre complex, where the conference will be held
(http://www.csk.uni.lodz.pl/).
A conference package is available (including registration fee,
accommodation [3 nights], full board and conference dinner) at 470 euros.
Please, note that all accommodation bookings at the conference centre are
handled by the organizers and the conference centre will not accept
reservations coming directly from participants.
Alternative accommodation options can be found at http://www.hotel.lodz.pl/
We have got a limited number of bursaries for colleagues from low-GNP
countries (including Poland) and full-time students. Please apply by
emailing us at palc_AT_uni.lodz.pl.
PAYMENT
Payment should be by bank transfer:
BANK: PKO S.A. II O/Lodz
ACCOUNT No:14124030281111001004347782
IBAN: PL14124030281111001004347782
SWFT: PKO PPL PW
Alternatively, cash payment can be made on arrival. This will incur an
additional charge of 25 euros. We regret that neither cheques nor credit
cards can be handled.
IMPORTANT DATES
Abstracts due: 31 December 2008
Notification of acceptance 31 January 2009
Early bird registration ends 15 February, 2009
Submission of conference papers 1 July 2009
ORGANISING COMMITTEE
Prof. Barbara Lewandowska-Tomaszczyk
Dr Stanis?aw Go?d?-Roszkowski
Dr Przemys?aw Krakowian
Dr Krzysztof Kredens
Dr Jacek Wali?ski
Mr ?ukasz Dró?d?
Ms Anna Kami?ska
Mr Piotr P?zik
Department of English Language and Applied Linguistics
University of Lodz
Email: palc_AT_uni.lodz.pl; roszkowski_AT_uni.lodz.pl
http://palc.ia.uni.lodz.pl
phone 48 42 6655220
fax 48 42 6655221
--
Dr Stanislaw Gozdz-Roszkowski
Assistant Professor
Department of English and Applied Linguistics
University of Lodz
Kosciuszki 65
90-514, Lodz, Poland
tel. 48 42 6655220
fax 48 42 6655221
cell phone 48 603 965 867
http://www.filolog.uni.lodz.pl/kja/Roszkowski.htm
http://www.linkedin.com/in/stanislawgozdzroszkowski
----------------------------------------------------------------------
Send Corpora mailing list submissions to
corpora at uib.no
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.uib.no/listinfo/corpora
or, via email, send a message with subject or body 'help' to
corpora-request at uib.no
You can reach the person managing the list at
corpora-owner at uib.no
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Corpora digest..."
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
End of Corpora Digest, Vol 12, Issue 2
**************************************
--
No virus found in this incoming message.
Checked by AVG.
Version: 7.5.524 / Virus Database: 269.24.4/1478 - Release Date: 2/6/2008
07:12
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list