[Corpora-List] Methodology for capturing corpus from paper to computer
rita.calabrese at libero.it
rita.calabrese at libero.it
Sat Jul 16 09:05:02 UTC 2011
Dear Sammy Danso,
>the quickest way to build an electronic corpus from printed materials is the
>OCR (Optical Character Recognition) application. You can freely download the
>software from the following website:
>
>http://softi-freeocr.softonic.it/download
>
>Best wishes
>
>Rita Calabrese
>University of Salerno
>via Ponte don Melillo
>84084 Fisciano (SA)
>ITALY
>----Messaggio originale----
>Da: corpora-request at uib.no
>Data: 16/07/2011 10.49
>A: <corpora at uib.no>
>Ogg: Corpora Digest, Vol 49, Issue 19
>
>Today's Topics:
>
> 1. Linguamática V3N2 - CFP (Alberto Simões)
> 2. Reminder - CL2011 Workshop: Dealing with spelling variation
> in historical corpora - Call for participation (Alistair Baron)
> 3. Re: Pashto (was: Which Statistical Test is Suitable)
> (fatima zuhra)
> 4. Re: Which Statistical Test is Suitable (fatima zuhra)
> 5. Re: Which Statistical Test is Suitable (True Friend)
> 6. R: Corpora Digest, Vol 49, Issue 16 (rita.calabrese at libero.it)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: Fri, 15 Jul 2011 11:43:54 +0100
>From: Alberto Simões <albie at alfarrabio.di.uminho.pt>
>Subject: [Corpora-List] Linguamática V3N2 - CFP
>To: undisclosed-recipients:;
>
>----------------------------------------------------------------------------
> Linguamática
> ISSN: 1647-0818
> http://www.linguamatica.com/
>----------------------------------------------------------------------------
> (español, português, galego, català, euskara, english)
>
>
> PETICIÓN DE ARTÍCULOS [en español]
>
> Linguamática, Revista para el Procesamiento Automático de las
> Lenguas Ibéricas (ISSN 1647-0818), está abierta a la recepción de
> artículos para el volumen 3, número 2 de la revista.
>
> Los artículos serán publicados en forma electrónica y puestos a
> disposición de la comunidad científica con licencia Creative
> Commons.
>
> Temas de interés:
> * Morfología, sintaxis y semántica computacional
> * Traducción automática y herramientas de ayuda a la traducción
> * Terminología y lexicografía computacional
> * Síntesis y reconocimiento del habla
> * Extracción de información
> * Respuesta automática a preguntas
> * Lingüística de corpus
> * Bibliotecas digitales
> * Evaluación de sistemas de procesamiento del linguage natural
> * Herramientas y recursos públicos o cooperativos
> * Servicios lingüísticos en la red
> * Ontologías y representación del conocimiento
> * Métodos estadísticos aplicados a la lengua
> * Herramientas de apoyo para la enseñanza de lenguas
>
> Envío de artículos
>
> Los artículos tienen que enviarse en PDF mediante el sistema
> electrónico de la revista (http://www.linguamatica.com/). Aunque
> el número de páginas de los artículos sea flexible, se sugiere
> que no excedan las 20 páginas. Los artículos tienen que
> identificarse debidamente. Del mismo modo, los comentarios de los
> miembros del comité científico serán debidamente firmados.
>
> Los artículos deberán ser escritos en portugués, gallego,
> castellano, vasco o catalán, o bien en inglés. Con todo, se
> invita a los autores a presentar sus contribuciones en una de las
> lenguas de la Península Ibérica siempre que sea posible. Sólo se
> publicarán artículos en inglés cuando ninguno de los autores
> tenga competencia lingüística en alguna de las lenguas preferidas
> de la revista (es decir, en portugués, gallego, castellano, vasco
> o catalán) y siempre que los editores consideren que el artículo
> sea relevante para la revista.
>
> Los artículos tienen que seguir el formato de la revista. Existen
> modelos LaTeX, Microsoft Word y OpenOffice.org en la página de
> Linguamática (http://www.linguamatica.com/).
>
> Fechas importantes
> * Envío de artículos hasta: 31 de octubre de 2011
> * Resultados de la selección: 30 de noviembre de 2011
> * Versión final: 15 de diciembre de 2011
> * Publicación de la revista: diciembre de 2011
>
> La información sobre los Editores y el Comité Científico de
> Linguamática se encuentra al final de este texto.
>
> Contacto
> Para cualquier cuestión, puede dirigirse a: editores at linguamatica.com
>
>
>----------------------------------------------------------------------------
>
> CHAMADA DE ARTIGOS [em português]
>
> A Linguamática, Revista para o Processamento Automático das
> Línguas Ibéricas (ISSN 1647-0818), está aberta à recepção de
> artigos para o volume 3, número 2 da revista.
>
> Os artigos serão publicados electronicamente e colocados à
> disposição da comunidade científica com licença Creative Commons.
>
> Temas de interesse:
> * Morfologia, sintaxe e semântica computacional
> * Tradução automática e ferramentas de ajuda à tradução
> * Terminologia e lexicografia computacional
> * Síntese e reconhecimento da fala
> * Extracção/recolha de informação
> * Resposta automática a perguntas
> * Linguística de corpus
> * Bibliotecas digitais
> * Avaliação de sistemas de processamento de linguagem natural
> * Ferramentas e recursos públicos ou cooperativos
> * Serviços linguísticos na rede
> * Ontologias e representação do conhecimento
> * Métodos estatísticos aplicados à língua
> * Ferramentas de apoio ao ensino de línguas
>
> Envio de artigos
>
> Os artigos devem ser enviados em PDF utilizando o sistema
> electrónico da revista (http://www.linguamatica.com/). Embora o
> número de páginas dos artigos seja flexivel, sugere-se que não
> excedam as 20 páginas. Os artigos devem ser devidamente
> identificados. Do mesmo modo, os comentários dos membros do
> comité científico serão devidamente assinados.
>
> Os artigos deverão ser escritos em português, galego, castelhano,
> catalão, basco ou inglês. Contudo, convidam-se os autores a
> apresentar as suas contribuições numa das línguas da Península
> Ibérica sempre que tal seja possível. Só serão publicados
> artigos em inglês quando nenhum dos autores tiver competencia
> linguística numa das línguas preferidas da revista (ou seja,
> português, galego, castelhano, basco ou catalão) e sempre que os
> editores considerem o artigo relevante para a ser publicado na
> revista.
>
> Os artigos têem de seguir o formato da revista. Existem modelos
> LaTeX, Microsoft Word e OpenOffice.org na página da Linguamática
> (http://www.linguamatica.com/).
>
> Datas importantes
> * Envio de artigos até: 31 de outubro de 2011
> * Resultados da selecção até: 30 de novembro de 2011
> * Versão final até: 15 de dezembro de 2011
> * Publicação da revista: dezembro de 2011
>
> A informação sobre os Editores e a Comissão Científica da
> Linguamática encontra-se no final deste texto.
>
> Contacto
> Para qualquer questão deve dirigir-se a: editores at linguamatica.com
>
>----------------------------------------------------------------------------
>
>
> PETICIÓN DE ARTIGOS [en galego]
>
> Linguamática, Revista para o Procesamento Automático das Linguas
> Ibéricas (ISSN 1647-0818), está aberta á recepción de artigos para
> o volume 3, número 2 da revista.
>
> Os artigos serán publicados de forma electrónica e postos ao
> dispor da comunidade científica con licenza Creative Commons.
>
> Temas de interese:
> * Morfoloxía, sintaxe e semántica computacional
> * Tradución automática e ferramentas de axuda á tradución
> * Terminoloxía e lexicografía computacional
> * Síntese e recoñecemento de fala
> * Extracción de información
> * Resposta automática a preguntas
> * Lingüística de corpus
> * Bibliotecas dixitais
> * Avaliación de sistemas de procesamento de linguaxe natural
> * Ferramentas e recursos públicos ou cooperativos
> * Servizos lingüísticos na rede
> * Ontoloxías e representación do coñecemento
> * Métodos estatísticos aplicados á lingua
> * Ferramentas de apoio ao ensino das linguas
>
> Envío de Artigos
>
> Os artigos deben de enviarse en PDF mediante o sistema
> electrónico da revista (http://www.linguamatica.com/). Aínda que
> o número de páxinas dos artigos sexa flexíbel, suxírese que non
> excedan das 20 páxinas. Os artigos teñen que identificarse
> debidamente. Do mesmo modo, os comentarios dos membros do comité
> científico serán debidamente asinados.
>
> Os artigos deberán ser escritos en portugués, galego, castelán,
> éuscaro ou catalán, ou ben en inglés. Con todo, convídase aos
> autores a presentar as súas contribucións nunha das linguas da
> Península Ibérica sempre que sexa posíbel. Só se publicarán
> artigos en inglés cando ningún dos seus autores teña competencia
> lingüística nalgunha das linguas preferidas da revista (isto é,
> en portugués, galego, castelán, éuscaro ou catalán) e sempre que
> os editores consideren que o artigo é relevante para a revista.
>
> Os artigos teñen que seguir o formato da revista. Existen modelos
> LaTeX, Microsoft Word e OpenOffice.org na páxina de Linguamática
> (http://www.linguamatica.com/).
>
> Datas importantes
> * Envío de artigos até: 31 de outubro de 2011
> * Resultados da selección: 30 de novembro de 2011
> * Versión final: 15 de decembro de 2011
> * Publicación da revista: decembro de 2011
>
> A información sobre os Editores e o Comité Científico de
> Linguamática atópase ao final deste texto.
>
> Contacto
> Para calquera cuestión, pode dirixirse a: editores at linguamatica.com
>
>----------------------------------------------------------------------------
>
> PETICIÓ D'ARTICLES [en català]
>
> Linguamática, Revista per al Processament Automàtic de les
> Llengües Ibèriques (ISSN 1647-0818), está oberta a la recepció
> d'artícles per al volum 3, número 2 de la revista.
>
> Els articles seràn publicats en forma electrònica i posats a
> disposició de la comunitat científica amb llicència Creative
> Commons.
>
> Temes d'interès:
> * Morfologia, sintaxi i semàntica computacional
> * Traducció automàtica i eines d'ajuda a la traducció
> * Terminologia i lexicografia computacional
> * Síntesi i reconeixement de parla
> * Extracció d'informació
> * Resposta automàtica a preguntes
> * Lingüística de corpus
> * Biblioteques digitals
> * Evaluació de sistemes de processament del llenguatge natural
> * Eines i recursos lingüístics públics o cooperatius
> * Serveis lingüístics en xarxa
> * Ontologies i representació del coneixement
> * Mètodes estadístics aplicats a la llengua
> * Eines d'ajut per a l'ensenyament de llengües
>
> Enviament d'articles
>
> Els articles s'han d'enviar en PDF mitjançant el sistema
> electrònic de la revista (http://www.linguamatica.com/). Tot i
> que el nombre de pàgines dels articles sigui flexible es
> suggereix que no ultrapassin les 20 pàgines. Els articles s'han
> d'identificar degudament. Igualment, els comentaris dels membres
> del comitè científic seràn degudament signats.
>
> Els articles han de ser escrits en portuguès, gallec, castellà,
> basc o català, o bé en anglès. Tot i així, es convida els autors
> a presentar les seves contribucions en una de les llengües de la
> Península Ibérica sempre que això sigui possible. Només es
> publicaran els articles en anglès quan cap dels seus autors
> tingui competència lingüística en alguna de les llengües
> preferides de la revista (és a dir, en portuguès, gallec, basc,
> castellà o català) i sempre que els editors considerin que
> l'article és rellevant per a la revista.
>
> Els articles han de seguir el format de la revista. Es poden
> trobar models LaTeX, Microsoft Word i OpenOffice.org a la pàgina
> de Linguamática (http://www.linguamatica.com/).
>
> Dades importants
> * Enviament d'articles fins a: 31 d'octubre de 2011
> * Resultats de la selecció: 30 de novembre de 2011
> * Versió final: 15 de desembre de 2011
> * Publicació de la revista: desembre de 2011
>
> La informació sobre els Editors i el Comitè Científic de
> Linguamática es troba al final d'aquest text.
>
> Contacte
> Per a qualsevol qüestió, pot adreçar-se a: editores at linguamatica.com
>
>----------------------------------------------------------------------------
>
> ARTILULU ESKAERA [euskaraz]
>
> Iberiar penintsulako hizkuntzei dagokienean, hizkuntza naturalen
> prozedura komunitatean dagoen hutsunea betetzea litzateke
> Linguamática izeneko aldizkariaren helburu nagusiena. Helburu
> nagusi hau buru, aurretik aipaturiko edozein hizkuntzen prozedura
> landuko duten artikuluak argitaratuko dira.
>
> Linguamática aldizkaria irekia da oso. Artikuluak elektronikoki
> argitaratuko dira, eta komunitate zientefikoaren eskura egongo
> dira honako lizentziarekin; Creative Commons.
>
> Gai interesgarriak:
> * Morfologia, sintaxia eta semantika konputazionala.
> * Itzulpen automatikoa eta itzulpengintzarako lagungarriak diren
>tresnak.
> * Terminologia eta lexikologia konputazionala.
> * Mintzamenaren sintesia eta ikuskapena.
> * Informazio ateratzea.
> * Galderen erantzun automatikoa.
> * Corpus-aren linguistika.
> * Liburutegi digitalak.
> * Hizkuntza naturalaren prozedura sistemaren ebaluaketa.
> * Tresna eta baliabide publikoak edo kooperatiboak.
> * Zerbitzu linguistikoak sarean.
> * Ezagutzaren ontologia eta adierazpideak.
> * Hizkuntzean oinarrituriko metodo estatistikoak.
> * Hizkuntzen irakaskuntzarako laguntza tresnak.
>
> Arikuluak PDF formatoan eta aldizkariaren sitema elektronikoaren
> bidez bidali behar dira. Orri kopurua malgua den arren, 20 orri
> baino gehiago ez idaztea komeni da. Artikuluak behar bezala
> identifikatu behar dira. Era berean, zientzi batzordeko kideen
> iruzkinak ere sinaturik egon beharko dira.
>
> Artikulua idazterako garaian, erabilitako hizkuntzari dagokionean,
> honako kizkuntza hauek erabili daiztezke; portugesa, galiziera,
> gaztelania, euskara, eta katalana.
>
> Artikuluek, aldizkariaren formato grafikoa jarraitu behar
> dute. ``Linguamática'' orrian LaTeX, Microsoft Word eta
> OpenOffice.org ereduak aurki ditzakegu.
>
> Data garratzitsuak
> * Artikuluak bidali ahal izateko epea: 2011ko urriaren 31.
> * Hautapen-prozesuaren jakinarazpena: 2011ko azaroaren 30a.
> * Azken bertsioaren bidalketa: 2011ko abenduaren 15a.
> * Argitarapena aldizkarian: 2011ko abendua.
>
> Edozein zalantza argitzeko, hona hemen helbide hau:
>editores at linguamatica.com.
>
>----------------------------------------------------------------------------
>
> CALL FOR PAPERS [in English]
>
> Linguamática, Journal of Automatic Processing of Iberian Languages
> (ISSN 1647-0818), is open for reception of articles for the third
> volume, second issue.
>
> Papers will be published in electronic form and freely available
> online under a Creative Commons Attribution License.
>
> Topics of interest:
> * Computational morphology, syntax and semantics
> * Machine translation and computer-assisted translation
> * Computational terminology and lexicography
> * Speech analysis and synthesis
> * Information extraction
> * Question answering systems
> * Corpus linguistics
> * Digital libraries
> * Evaluation of natural language processing systems
> * Public or cooperative linguistic tools and resources
> * Linguistic services on the Internet
> * Ontologies and knowledge representation
> * Statistical methods in natural language processing
> * Computer-assisted language learning
>
> Notes for contributors
>
> Authors should send the originals in electronic format as a PDF
> file through Linguamática site (http://www.linguamatica.com/).
> Submissions should not exceed 20 pages and must include authors
> identification. Equally, reviewers will sign their comments.
>
> Submissions should be written in one of the main languages of the
> Iberian Peninsula (Portuguese, Galician, Spanish, Basque or
> Catalan), or in English. Authors able to write in one of the
> Iberian languages are encouraged to do so. Articles written in
> English will only be published in the case that none of the
> authors is competent in any of the Journal's preferred languages
> (Portuguese, Galician, Spanish, Basque and Catalan) and provided
> that the editors consider the article to be relevant to the
> Journal.
>
> Make sure the submitted file follows the formating rules of the
> Journal. Check the LaTeX, Microsoft Word or OpenOffice templates
> at Linguamática site (http://www.linguamatica.com/).
>
> Important dates
> * Deadline for submitting papers: 31 october 2011
> * Notification of acceptance: 30 november 2011
> * Deadline for submitting the final version: 15 december 2011
> * Publication date: december 2011
>
> The names o the Linguamática Editors and Scientific Committee
> members are at the end of this text.
>
> Contact information
> For more information please e-mail: editores at linguamatica.com
>
>
>----------------------------------------------------------------------------
>
>
> EDITORS
> * Alberto Simões (Universidade do Minho)
> * José João Almeida (Universidade do Minho)
> * Xavier Gómez Guinovart (Universidade de Vigo)
>
> SCIENTIFIC COMMITTEE
> * Alberto Álvarez Lugrís (Universidade de Vigo)
> * Aline Villavicêncio (Universidade Federal do Rio Grande do Sul)
> * Álvaro Sanroman (Universidade do Minho)
> * Ana Frankenberg-Garcia (Universidade Nova de Lisboa)
> * Anselmo Peñas (Universidad Nacional de Educación a Distancia)
> * Antón Santamarina (Universidade de Santiago de Compostela)
> * Antonio Moreno Sandoval (Universidad Autónoma de Madrid)
> * António Teixeira (Universidade de Aveiro)
> * Arantza Díaz de Ilarraza (Euskal Herriko Unibertsitatea)
> * Belinda Maia (Universidade do Porto)
> * Carmen García Mateo (Universidade de Vigo)
> * Diana Santos (Linguatca/FCCN)
> * Ferran Pla (Universitat Politècnica de València)
> * Gael Harry Dias (Universidade Beira Interior)
> * Gerardo Sierra (Universidad Nacional Autónoma de México)
> * German Rigau (Euskal Herriko Unibertsitatea)
> * Helena de Medeiros Caseli (Universidade Federal de São Carlos)
> * Horacio Saggion (Universitat Pompeu Fabra)
> * Iñaki Alegria (Euskal Herriko Unibertsitatea)
> * Joaquim Llisterri (Universitat Autònoma de Barcelona)
> * José Carlos Medeiros (Porto Editora)
> * José Paulo Leal (Universidade do Porto)
> * Joseba Abaitua (Universidad de Deusto)
> * Juan-Manuel Torres-Moreno (Université d'Avignon et des Pays de
>Vaucluse)
> * Kepa Sarasola (Euskal Herriko Unibertsitatea)
> * Lluís Padró (Universitat Politècnica de Catalunya)
> * Maria das Graças Volpe Nunes (Universidade de São Paulo)
> * Mercè Lorente Casafont (Universitat Pompeu Fabra)
> * Mikel Forcada (Universitat d'Alacant)
> * Patrícia Cunha França (Universidade do Minho)
> * Pablo Gamallo Otero (Universidade de Santiago de Compostela)
> * Salvador Climent Roca (Universitat Oberta de Catalunya)
> * Susana Afonso (The University of Sheffield)
> * Tony Berber Sardinha (Pontifícia Universidade Católica de São
> Paulo)
>
>
>
>
>------------------------------
>
>Message: 2
>Date: Fri, 15 Jul 2011 15:03:28 +0100
>From: Alistair Baron <a.baron at comp.lancs.ac.uk>
>Subject: [Corpora-List] Reminder - CL2011 Workshop: Dealing with
> spelling variation in historical corpora - Call for participation
>To: Corpora Mailing List <corpora at uib.no>, VARD Mailing List
> <vard at comp.lancs.ac.uk>
>
>A reminder for those attending the Corpus Linguistics conference in
>Birmingham next week, we are running a hands-on workshop on VARD and
>historical (and other) spelling variation. Whilst it is by no means
>compulsory to pre-register, it will help our planning if you could send us
>an email (see details below) if you intend to participate.
>
>Thank you and apologies for the repeat message,
>Alistair Baron
>
>===================================================
>
>CALL FOR PARTICIPATION
>
>WORKSHOP:
>Dealing with spelling variation in historical corpora:
>Using VARD to standardise spelling variants from the EmodE period.
>
>Corpus Linguistics 2011, Birmingham, UK - 20-22 July 2011
>http://www.cl2011.org.uk/
>
>===================================================
>
>At the upcoming Corpus Linguistics 2011 conference in Birmingham (20-22 July
>2011), we will be holding a hands-on workshop titled "Dealing with spelling
>variation in historical corpora: Using VARD to standardise spelling variants
>from the EmodE period". The workshop will be centered around the VARD
>(VARiant Detector) tool (http://ucrel.lancs.ac.uk/vard) and its use with
>Early Modern English (EmodE) corpora. Participants will have the opportunity
>to use the software to standardise spelling variation in provided texts, or,
>if desired, users can bring their own texts containing spelling variation
>from any source (e.g. historical, SMS or other CMC corpora). As well as a
>presentation about the use of VARD in historical corpus linguistics, the
>workshop will also include a presentation from Anu Lehto from the University
>of Helsinki concerning the use of VARD to produce a standardised version for
>the release of the Early Modern English Medical Texts (EMEMT) corpus (
>http://www.helsinki.fi/varieng/CoRD/corpora/CEEM/EMEMTindex.html).
>
>By the end of the workshop, participants will understand how to use the VARD
>software to standardise spelling variants in EmodE corpora, how to export
>both original and standardised versions for use in other corpus linguistic
>software and how much training is required for their own corpora.
>Participants will be provided with copies of our previous studies on
>standardising historical corpora, a copy of the VARD software for academic
>use and a user manual.
>
>The workshop will be part of the Corpus Linguistics 2011 conference (
>http://www.cl2011.org.uk/) and anybody interested in attending the workshop
>is required to be registered to attend the main conference. The workshop
>will be two hours in length, with the preliminary programme indicating a
>start of 4pm on Wednesday 20th July.
>
>We are asking anybody wishing to attend the workshop to pre-register to
>allow us to plan for numbers and equipment. As there are no computer labs
>available at the conference venue, participants are asked to bring their own
>laptops where possible. To express your interest in the workshop, please
>email Alistair Baron:
>
>a.baron at comp.lancs.ac.uk
>
>with the following details:
>
>Name
>Affiliation
>Are you able to bring your own laptop?
>Bibliographical details of own text - and corpus details, if part of a
>corpus (only necessary if bringing data with you).
>
>
>Please feel free to circulate this call for participation to anybody who may
>be interested. We apologise for any cross-postings.
>
>Alistair Baron, Paul Rayson, Dawn Archer
>Workshop organisers
>
>
>
>--
>Alistair Baron
>Research Associate
>C28, School of Computing and Communications, Infolab 21, Lancaster
>University, LA1 4WA
>T: +44 (0)1524 510348
>E: a.baron at comp.lancs.ac.uk
>W: http://www.comp.lancs.ac.uk/~barona
>-------------- next part --------------
>A non-text attachment was scrubbed...
>Name: not available
>Type: text/html
>Size: 4633 bytes
>Desc: not available
>URL: <http://www.uib.
no/mailman/public/corpora/attachments/20110715/bc18d5ec/attachment.txt>
>
>------------------------------
>
>Message: 3
>Date: Fri, 15 Jul 2011 22:05:25 -0700 (PDT)
>From: fatima zuhra <fateeshah at yahoo.com>
>Subject: Re: [Corpora-List] Pashto (was: Which Statistical Test is
> Suitable)
>To: Mike Maxwell <maxwell at umiacs.umd.edu>
>Cc: corpora at uib.no
>
>> You're referring to the two Unicode characters Arabic Kaaf (U+643) and
Arabic Letter Keheh (U+06A9, also commonly called Kaf or Kaaf), right?
>
>I am referring to variations e.g. ?????????????? and ?????. It is a single
word (meaning ?mirror?), written in two styles. In the first occurrence, the
second-last grapheme is made longer. In the similar way, ?kaaf?, ?baa?, ?meem?
and many more graphemes are sometimes written longer and sometimes shorter. For
software, these are two different words.
>
>> I would guess you've also observed lots of variation in the various yehs,
right? Arabic yeh, Farsi yeh, yeh with tail,...
>
>Yes. The same word is usually written with variation in ?yehs?. The data I
extracted contain frequent examples of this variation e.g. ????? and ????? that
mean ?population?. Both are the variations of a single word.
>
>> Do you know of any corpora that deal with Pashto spelling variation? For
instance, a bitext with found spellings aligned with "correct" spellings.
>
>In my knowledge, there is no Pashto corpus that deals with Pashto spelling
variations. I and my Ph.D. supervisor have been working on Pashto corpora since
2006. I used a corpus containing 1.225 million words Pashto text, developed by
Mohammad Abid Khan (my Ph.D. supervisor) and me (work regarding this corpus was
presented in Corpus Linguistics 2009). That is, however, not an aligned corpus.
I extracted words from the corpus and then I observed a lot of spelling
variations.
>
>Regards.
>
>Fatima Tuz Zuhra
>Ph.D. Scholar and Lecturer,
>Department of Computer Science,
>University of Peshawar, Pakistan.
>
>--- On Thu, 7/14/11, Mike Maxwell <maxwell at umiacs.umd.edu> wrote:
>
>
>From: Mike Maxwell <maxwell at umiacs.umd.edu>
>Subject: Pashto (was: Which Statistical Test is Suitable)
>To: "fatima zuhra" <fateeshah at yahoo.com>
>Cc: corpora at uib.no
>Date: Thursday, July 14, 2011, 9:04 AM
>
>
>On 7/13/2011 11:40 PM, fatima zuhra wrote:
>> One of my works was concerned with extracting individual words from a
>> written Pashto corpus. The system I used for extracting individual
>> Pashto words gave me such variations of the same word that looked the
>> same at the first glance (e.g. the grapheme "kaaf" may be written a bit
>> longer than how it is written currently in the Urdu spelling of "Shakir"
>> in your name, which will result in a variation of this spelling). Are
>> you considering these variations or some others?
>
>You're referring to the two Unicode characters Arabic Kaaf (U+643) and Arabic
Letter Keheh (U+06A9, also commonly called Kaf or Kaaf), right?
>
>I would guess you've also observed lots of variation in the various yehs,
right? Arabic yeh, Farsi yeh, yeh with tail,...
>
>Do you know of any corpora that deal with Pashto spelling variation? For
instance, a bitext with found spellings aligned with "correct" spellings. I'm
not sure what "correct" spelling would mean in this context, but perhaps the
spelling according to some dictionary (of course allowing for the various
inflected forms of words).
>-- Mike Maxwell
> maxwell at umiacs.umd.edu
> "My definition of an interesting universe is
> one that has the capacity to study itself."
> --Stephen Eastmond
>-------------- next part --------------
>A non-text attachment was scrubbed...
>Name: not available
>Type: text/html
>Size: 6403 bytes
>Desc: not available
>URL: <http://www.uib.
no/mailman/public/corpora/attachments/20110715/5a5c082a/attachment.txt>
>
>------------------------------
>
>Message: 4
>Date: Fri, 15 Jul 2011 22:11:36 -0700 (PDT)
>From: fatima zuhra <fateeshah at yahoo.com>
>Subject: Re: [Corpora-List] Which Statistical Test is Suitable
>To: True Friend <true.friend2004 at gmail.com>
>Cc: corpora at uib.no
>
>I guess another type of such spelling variations is the use of ?do chashmi
hay? instead of ?hay? e.g. In my name, some people write the ?hay? in ?Zuhra?
as ?. Other people write the same ?hay? as ?do chashmi hay? i.e. ?. There are
also other examples of such variations e.g. in ?chaahaiye? (that means ?need?)
and ?rahain? (that means ?remain?).
>
>Regards.
>
>Fatima Tuz Zuhra
>Ph.D. Scholar and Lecturer,
>Department of Computer Science,
>University of Peshawar, Pakistan.
>
>
>--- On Thu, 7/14/11, True Friend <true.friend2004 at gmail.com> wrote:
>
>
>From: True Friend <true.friend2004 at gmail.com>
>Subject: Re: [Corpora-List] Which Statistical Test is Suitable
>To: "fatima zuhra" <fateeshah at yahoo.com>
>Cc: "corpora" <corpora at uib.no>
>Date: Thursday, July 14, 2011, 12:36 PM
>
>
>
>Dear Corpora Members
>Thanks for your responses. I am actually having a research on spelling
alternation of ? alif and ? hay (two Urdu letters). There has been a long
debate among scholars that which word should be written with which letter. For
example the word Ghonsa (English: Punch) can be written as ?????? (ending at
alif) or as ?????? (ending at hay) with no change in meaning. In most cases the
frequencies are clearly different. There is a clear choice for Alif or Hay
variant, but in some cases the frequencies correlate very closely. I've
selected the words which have very close frequencies in each variant (with no
change in meaning of the word of course), now I wanted to summarize the group
bahaviour by applying correlation formula etc. An example of such variant
spellings is as follows:
>
>
>
>
>
>
>Alif Variant
>Freq
>Hay Variant
>Freq.
>
>????
>587
>????
>508
>
>???
>97
>???
>116
>
>??????
>586
>??????
>725
>As you can see the frequencies are closely related, my aim was to summarize
the group behaviour. The point here is to show the general public's usage, that
despite of rules available, people are confused in spelling of these words.
>Hopefully this would elaborate why I asked.
>--
>
>Muhammad Shakir Aziz ???? ???? ????
>Masters in Applied Linguistics
>Translator, Course Developer, Linguist for Urdu, Punjabi and English
>Urdu:- http://awaz-e-dost.blogspot.com/
>English:- http://linguisticslearner.blogspot.com/
>Facebook:- http://www.facebook.com/truefriend2004
>Skype:- true_friend2004
>-------------- next part --------------
>A non-text attachment was scrubbed...
>Name: not available
>Type: text/html
>Size: 7877 bytes
>Desc: not available
>URL: <http://www.uib.
no/mailman/public/corpora/attachments/20110715/46905f45/attachment.txt>
>
>------------------------------
>
>Message: 5
>Date: Sat, 16 Jul 2011 10:23:29 +0500
>From: True Friend <true.friend2004 at gmail.com>
>Subject: Re: [Corpora-List] Which Statistical Test is Suitable
>To: fatima zuhra <fateeshah at yahoo.com>
>Cc: corpora at uib.no
>
>Well, currently I want to focus on alif and hay variation only. Because it
>is the most obvious and most confusing variation.
>Regards
>--
>*Muhammad Shakir Aziz* *???? ???? ????*
>*Masters in Applied Linguistics
>Translator, Course Developer, Linguist for Urdu, Punjabi and English*
>Urdu:- http://awaz-e-dost.blogspot.com/
>English:- http://linguisticslearner.blogspot.com/
>Facebook:- http://www.facebook.com/truefriend2004
>Skype:- true_friend2004
>-------------- next part --------------
>A non-text attachment was scrubbed...
>Name: not available
>Type: text/html
>Size: 1964 bytes
>Desc: not available
>URL: <http://www.uib.
no/mailman/public/corpora/attachments/20110716/df9699d5/attachment.txt>
>
>------------------------------
>
>Message: 6
>Date: Sat, 16 Jul 2011 10:49:40 +0200 (CEST)
>From: "rita.calabrese at libero.it" <rita.calabrese at libero.it>
>Subject: [Corpora-List] R: Corpora Digest, Vol 49, Issue 16
>To: <corpora at uib.no>
>
>
>Dear Sammy Danso,
>the quickest way to build an electronic corpus from printed materials is the
>OCR (Optical Character Recognition) application. You can freely download the
>software from the following website:
>
>http://softi-freeocr.softonic.it/download
>
>Best wishes
>
>Rita Calabrese
>University of Salerno
>via Ponte don Melillo
>84084 Fisciano (SA)
>ITALY
>
>
>
>>----Messaggio originale----
>>Da: corpora-request at uib.no
>>Data: 14/07/2011 5.40
>>A: <corpora at uib.no>
>>Ogg: Corpora Digest, Vol 49, Issue 16
>>
>>Today's Topics:
>>
>> 1. Re: Typing Urdu text in LaTeX (Paul Johnston)
>> 2. Re: Typing Urdu text in LaTeX (Alberto Simões)
>> 3. Re: Typing Urdu text in LaTeX (manaal faruqui)
>> 4. Hebrew texts in Latin lettrs (Yuri Tambovtsev)
>> 5. Re: Hebrew texts in Latin lettrs (Nomi Guthmann)
>> 6. First Call for Papers: 8th Workshop on Syntax & Semantics
>> (WoSS8) (Géraldine Walther)
>> 7. Re: Which Statistical Test is Suitable (Geoffrey Sampson)
>> 8. Re: Which Statistical Test is Suitable (Geoffrey Sampson)
>> 9. Methodology for capturing corpus from paper to computer
>> (Samuel Danso)
>> 10. Re: Which Statistical Test is Suitable (chris brew)
>> 11. Re: Which Statistical Test is Suitable (chris brew)
>> 12. Re: Which Statistical Test is Suitable (maxwell)
>> 13. Re: Which Statistical Test is Suitable (maxwell)
>> 14. Re: Which Statistical Test is Suitable (John F. Sowa)
>> 15. The ACL Anthology Searchbench is online (Ulrich Schaefer)
>> 16. Re: Methodology for capturing corpus from paper tocomputer
>> (Ana Julia)
>> 17. Re: Which Statistical Test is Suitable (fatima zuhra)
>>
>>
>>----------------------------------------------------------------------
>>
>>Message: 1
>>Date: Tue, 12 Jul 2011 12:00:28 +0000
>>From: Paul Johnston <paul.johnston at manchester.ac.uk>
>>Subject: Re: [Corpora-List] Typing Urdu text in LaTeX
>>To: manaal faruqui <manaalfar at gmail.com>, "corpora at uib.no"
>> <corpora at uib.no>
>>
>>Try something along the lines of
>>
>>\documentclass[11pt]{article}
>>\usepackage{arabtex}
>>\begin{document}
>>\begin{RLtext}
>>\seturdu
>>abcdefgijklmnop
>>\end{RLtext}
>>\end{document}
>>
>>I don't pretend to speak Urdu but it compiles and looks reasonable.
>>
>>Paul
>>
>>-----Original Message-----
>>From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
>manaal faruqui
>>Sent: 12 July 2011 11:53
>>To: corpora at uib.no
>>Subject: Re: [Corpora-List] Typing Urdu text in LaTeX
>>
>>I am using the transliteration given here: http://en.wikipedia.
>org/wiki/ArabTeX
>>
>>On Tue, Jul 12, 2011 at 4:21 PM, manaal faruqui <manaalfar at gmail.com> wrote:
>>> Hi All,
>>>
>>> I have to write a report in which I need to insert Urdu in Latex.
>>> I have used \usepackage{arabtex} and I am trying to use
>>>
>>> \texturdu{} to write the Urdu words, but its saying that its an
>>> "Undefined control sequence".
>>>
>>> I am using the transliteration given here:
>>>
>>> and the sty file from here:
>>> http://www.tex.ac.uk/tex-archive/language/arabtex/texinput/arabtex.sty
>>>
>>> Please help.
>>>
>>> Thanks a lot,
>>> Manaal Faruqui
>>> 4th year UG student
>>> IIT Kharagpur, India
>>> http://cse.iitkgp.ac.in/~manaalf
>>>
>>
>>_______________________________________________
>>UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>Corpora mailing list
>>Corpora at uib.no
>>http://mailman.uib.no/listinfo/corpora
>>
>>
>>
>>------------------------------
>>
>>Message: 2
>>Date: Wed, 13 Jul 2011 11:20:38 +0100
>>From: Alberto Simões <albie at alfarrabio.di.uminho.pt>
>>Subject: Re: [Corpora-List] Typing Urdu text in LaTeX
>>To: corpora at uib.no
>>
>>Hello
>>
>>I am a complete ignorant about Urdu, but if you are able to type Urdu
>>characters directly in UTF8, you can use XeLaTeX to typeset it.
>>
>>If this is a possibility, let me know and I'll help with the XeLaTeX
>>document structure.
>>
>>All the best,
>>Alberto
>>
>>On 12/07/2011 13:00, Paul Johnston wrote:
>>> Try something along the lines of
>>>
>>> \documentclass[11pt]{article}
>>> \usepackage{arabtex}
>>> \begin{document}
>>> \begin{RLtext}
>>> \seturdu
>>> abcdefgijklmnop
>>> \end{RLtext}
>>> \end{document}
>>>
>>> I don't pretend to speak Urdu but it compiles and looks reasonable.
>>>
>>> Paul
>>>
>>> -----Original Message-----
>>> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
>manaal faruqui
>>> Sent: 12 July 2011 11:53
>>> To: corpora at uib.no
>>> Subject: Re: [Corpora-List] Typing Urdu text in LaTeX
>>>
>>> I am using the transliteration given here: http://en.wikipedia.
>org/wiki/ArabTeX
>>>
>>> On Tue, Jul 12, 2011 at 4:21 PM, manaal faruqui<manaalfar at gmail.com>
>wrote:
>>>> Hi All,
>>>>
>>>> I have to write a report in which I need to insert Urdu in Latex.
>>>> I have used \usepackage{arabtex} and I am trying to use
>>>>
>>>> \texturdu{} to write the Urdu words, but its saying that its an
>>>> "Undefined control sequence".
>>>>
>>>> I am using the transliteration given here:
>>>>
>>>> and the sty file from here:
>>>> http://www.tex.ac.uk/tex-archive/language/arabtex/texinput/arabtex.sty
>>>>
>>>> Please help.
>>>>
>>>> Thanks a lot,
>>>> Manaal Faruqui
>>>> 4th year UG student
>>>> IIT Kharagpur, India
>>>> http://cse.iitkgp.ac.in/~manaalf
>>>>
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>
>>--
>>Alberto Simoes
>>CCTC-UM / CEHUM
>>
>>
>>
>>------------------------------
>>
>>Message: 3
>>Date: Wed, 13 Jul 2011 15:55:01 +0530
>>From: manaal faruqui <manaalfar at gmail.com>
>>Subject: Re: [Corpora-List] Typing Urdu text in LaTeX
>>To: albie at alfarrabio.di.uminho.pt
>>Cc: corpora at uib.no
>>
>>Thanks all, the problem was solved by the method told by Paul. :)
>>
>>Manaal
>>
>>2011/7/13 Alberto Simões <albie at alfarrabio.di.uminho.pt>:
>>> Hello
>>>
>>> I am a complete ignorant about Urdu, but if you are able to type Urdu
>>> characters directly in UTF8, you can use XeLaTeX to typeset it.
>>>
>>> If this is a possibility, let me know and I'll help with the XeLaTeX
>>> document structure.
>>>
>>> All the best,
>>> Alberto
>>>
>>> On 12/07/2011 13:00, Paul Johnston wrote:
>>>>
>>>> Try something along the lines of
>>>>
>>>> \documentclass[11pt]{article}
>>>> \usepackage{arabtex}
>>>> \begin{document}
>>>> \begin{RLtext}
>>>> \seturdu
>>>> abcdefgijklmnop
>>>> \end{RLtext}
>>>> \end{document}
>>>>
>>>> I don't pretend to speak Urdu but it compiles and looks reasonable.
>>>>
>>>> Paul
>>>>
>>>> -----Original Message-----
>>>> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
>>>> manaal faruqui
>>>> Sent: 12 July 2011 11:53
>>>> To: corpora at uib.no
>>>> Subject: Re: [Corpora-List] Typing Urdu text in LaTeX
>>>>
>>>> I am using the transliteration given here:
>>>> http://en.wikipedia.org/wiki/ArabTeX
>>>>
>>>> On Tue, Jul 12, 2011 at 4:21 PM, manaal faruqui<manaalfar at gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I have to write a report in which I need to insert Urdu in Latex.
>>>>> I have used \usepackage{arabtex} and I am trying to use
>>>>>
>>>>> \texturdu{} to write the Urdu words, but its saying that its an
>>>>> "Undefined control sequence".
>>>>>
>>>>> I am using the transliteration given here:
>>>>>
>>>>> and the sty file from here:
>>>>> http://www.tex.ac.uk/tex-archive/language/arabtex/texinput/arabtex.sty
>>>>>
>>>>> Please help.
>>>>>
>>>>> Thanks a lot,
>>>>> Manaal Faruqui
>>>>> 4th year UG student
>>>>> IIT Kharagpur, India
>>>>> http://cse.iitkgp.ac.in/~manaalf
>>>>>
>>>>
>>>> _______________________________________________
>>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> http://mailman.uib.no/listinfo/corpora
>>>>
>>>> _______________________________________________
>>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> http://mailman.uib.no/listinfo/corpora
>>>
>>> --
>>> Alberto Simoes
>>> CCTC-UM / CEHUM
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>>
>>
>>------------------------------
>>
>>Message: 4
>>Date: Wed, 13 Jul 2011 18:02:36 +0700
>>From: "Yuri Tambovtsev" <yutamb at mail.ru>
>>Subject: [Corpora-List] Hebrew texts in Latin lettrs
>>To: <corpora at uib.no>
>>
>>Dear Corpora colleagues, do you know any websites of Hebrew texts in Latin
>lettrs? I cannot read Hebrew letters. However, I'd like to compare Hebrew
sound
>chains with those I have in about 300 world languages. Looking forward to
>hearing from you soon to yutamb at mail.ru Yours sincerely Yuri Tambovtsev,
>Novosibirsk, Russia
>>-------------- next part --------------
>>A non-text attachment was scrubbed...
>>Name: not available
>>Type: text/html
>>Size: 680 bytes
>>Desc: not available
>>URL: <http://www.uib.
>no/mailman/public/corpora/attachments/20110713/3239269a/attachment.txt>
>>
>>------------------------------
>>
>>Message: 5
>>Date: Wed, 13 Jul 2011 15:01:25 +0300
>>From: Nomi Guthmann <nomi.guthmann at googlemail.com>
>>Subject: Re: [Corpora-List] Hebrew texts in Latin lettrs
>>To: Yuri Tambovtsev <yutamb at mail.ru>
>>Cc: corpora at uib.no
>>
>>Hi Yuri,
>>
>>The Hebrew Treebank corpus from the Mila Knowledge Center for Processing
>>Hebrew has a transliterated version. It is available here
>>http://www.mila.cs.technion.ac.il/mila/eng/resources_treebank.html
>>The transcription that was used is described in
>>http://www.cs.technion.ac.il/~winter/Corpus-Project/paper.pdf
>>
>>Noemie
>>
>>2011/7/13 Yuri Tambovtsev <yutamb at mail.ru>
>>
>>> **
>>> Dear Corpora colleagues, do you know any websites of Hebrew texts in Latin
>>> lettrs? I cannot read Hebrew letters. However, I'd like to compare Hebrew
>>> sound chains with those I have in about 300 world languages. Looking
>forward
>>> to hearing from you soon to yutamb at mail.ru Yours sincerely Yuri
>>> Tambovtsev, Novosibirsk, Russia
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>>
>>-------------- next part --------------
>>A non-text attachment was scrubbed...
>>Name: not available
>>Type: text/html
>>Size: 1706 bytes
>>Desc: not available
>>URL: <http://www.uib.
>no/mailman/public/corpora/attachments/20110713/e44c24e2/attachment.txt>
>>
>>------------------------------
>>
>>Message: 6
>>Date: Wed, 13 Jul 2011 15:44:35 +0200
>>From: Géraldine Walther <geraldine.walther at linguist.jussieu.fr>
>>Subject: [Corpora-List] First Call for Papers: 8th Workshop on Syntax
>> & Semantics (WoSS8)
>>To: corpora at uib.no
>>
>>[Apologies for cross-postings]
>>
>>***FIRST CALL FOR PAPERS***
>>
>>8th Workshop on Syntax & Semantics (WoSS)
>>November 17th-18th, 2011
>>Paris, France
>>
>>*****
>>
>>We invite PhD students to send abstracts for twenty-minute talks followed
by
>a ten-minute discussion or poster presentations on any aspect of theoretical
>linguistics for the 8th Workshop on Syntax & Semantics (WoSS).
>>
>>WoSS is a series of rotating workshops organized by PhD students from
>'neighbouring' universities (see list below) for PhD students working in
>different domains of generative linguistics, in a broad sense, e.g. syntax,
>semantics, pragmatics, morphology, phonetics, phonology, language
acquisition,
>computational linguistics, etc.
>>
>>The institutions behind WoSS are:
>>The University of Nantes
>>The University of the Basque Country in Vitoria-Gasteiz (EHU)
>>The Universities of Catalonia (UAB, UB, UPF, URV)
>>The Universities of Paris 3, Paris 7, Paris 8
>>The Universities of Madrid (IUOG, UAM, UCM)
>>The University of Sienna
>>
>>This year's WoSS is co-organized by University Paris Diderot (Paris 7) and
>University Paris Vincennes St-Denis (Paris 8), and will take place at the
CNRS
>`Pouchet' building, 59 rue Pouchet, 75017 Paris, on November 17th-18th, 2011.
>>
>>Submission instructions
>>
>>Abstracts must be anonymous and at most two pages long, examples and
>references included, on an A4 sheet with one-inch (2.54 cm) margins and 12-
>point Times New Roman font, single spacing.
>>
>>Submissions are limited to one individual and one joint abstract per
author,
>or two joint abstracts per author. The abstracts must be submitted over
>EasyChair as PDF attachment by the 31th of August.
>>
>>https://www.easychair.org/conferences/?conf=woss8
>>
>>Accepted papers will be presented orally or as posters depending on nature
>and quality of the work., you may however specifically indicate whether you
>would like to present your paper rather as an oral presentation or a poster.
>>
>>INVITED SPEAKERS
>>
>>We have the pleasure to announce that the following speakers will be giving
>an invited talk at WoSS8:
>>
>>Paolo Acquaviva (University College Dublin)
>>Bob Borsley (Univerity of Essex)
>>Philippe Schlenker (ENS-NYU)
>>
>>Important dates:
>>
>>Deadline for submission: August 31, 2011
>>
>>Notification of acceptance: October 7, 2011
>>
>>Scientific Committee:
>>
>>Xiaoliang HUANG, Paris 7.
>>Christophe ONAMBELE, Paris 8.
>>Marie PHILIPPE, Paris 8.
>>Géraldine WALTHER, Paris 7.
>>Grégoire WINTERSTEIN, Paris 7.
>>
>>A WoSS8 website is currently under construction and will be available soon.
>>
>>More informations about previous WoSS can be found at:
>>
>>http://www.woss7.univ-nantes.fr/
>>
>>If you have any questions, please contact us at:
>>
>>woss8paris at gmail.com
>>
>>
>>
>>-------------- next part --------------
>>A non-text attachment was scrubbed...
>>Name: not available
>>Type: text/html
>>Size: 8404 bytes
>>Desc: not available
>>URL: <http://www.uib.
>no/mailman/public/corpora/attachments/20110713/94267d35/attachment.txt>
>>
>>------------------------------
>>
>>Message: 7
>>Date: Wed, 13 Jul 2011 15:34:05 +0100
>>From: Geoffrey Sampson <grs2 at sussex.ac.uk>
>>Subject: Re: [Corpora-List] Which Statistical Test is Suitable
>>To: True Friend <true.friend2004 at gmail.com>
>>Cc: corpora <corpora at uib.no>, corpora at lists.uib.no
>>
>>Dear Muhammad Shakir Aziz,
>>
>>I don't see that anyone else has responded to your query, so let me do so,
>>rather late. I would say that no kind of statistical test could possibly
>>indicate whether variant spellings were errors, or allowable alternatives;
>>because this question is not to do with numbers. It is a question about
>>where authority over the norms of the language you are concerned with is
>>felt to lie, and what that authority says about orthography. Some
>>languages, at some periods, tolerate a wide variety of alternative
>>spellings for given words, while other languages (or the same languages at
>>other periods) may have extremely tightly-defined norms and strong social
>>sanctions against violating them. Carrying out statistical calculations on
>>tables of the incidence of alternatives would not tell you anything about
>>this, I believe.
>>
>>Geoffrey Sampson
>>
>>
>>
>>
>>------------------------------
>>
>>Message: 8
>>Date: Wed, 13 Jul 2011 15:34:05 +0100
>>From: Geoffrey Sampson <grs2 at sussex.ac.uk>
>>Subject: Re: [Corpora-List] Which Statistical Test is Suitable
>>To: True Friend <true.friend2004 at gmail.com>
>>Cc: corpora <corpora at uib.no>, corpora at lists.uib.no
>>
>>Dear Muhammad Shakir Aziz,
>>
>>I don't see that anyone else has responded to your query, so let me do so,
>>rather late. I would say that no kind of statistical test could possibly
>>indicate whether variant spellings were errors, or allowable alternatives;
>>because this question is not to do with numbers. It is a question about
>>where authority over the norms of the language you are concerned with is
>>felt to lie, and what that authority says about orthography. Some
>>languages, at some periods, tolerate a wide variety of alternative
>>spellings for given words, while other languages (or the same languages at
>>other periods) may have extremely tightly-defined norms and strong social
>>sanctions against violating them. Carrying out statistical calculations on
>>tables of the incidence of alternatives would not tell you anything about
>>this, I believe.
>>
>>Geoffrey Sampson
>>
>>
>>
>>
>>------------------------------
>>
>>Message: 9
>>Date: Wed, 13 Jul 2011 15:55:53 +0100
>>From: "Samuel Danso" <scsod at leeds.ac.uk>
>>Subject: [Corpora-List] Methodology for capturing corpus from paper to
>> computer
>>To: "'corpora'" <corpora at uib.no>
>>
>>Dear All
>>
>>Please advise on methodology for capturing paper forms into a computer
>>corpus.
>>
>>
>>
>>My research involves a collection of 10,000 Verbal Autopsy interviews of
>>mother or close relative of deceased, currently on paper forms. How should I
>>have these typed onto PC? - double entry by two independent clerks is twice
>>the cost of single entry (with checking by managers), is it really
>>necessary?
>>
>>
>>
>>Sammy Danso,
>>
>>Leeds University, UK and Kintampo Health Centre, Ghana
>>
>>
>>
>>
>>
>>-------------- next part --------------
>>A non-text attachment was scrubbed...
>>Name: not available
>>Type: text/html
>>Size: 2670 bytes
>>Desc: not available
>>URL: <http://www.uib.
>no/mailman/public/corpora/attachments/20110713/c672c6d3/attachment.txt>
>>
>>------------------------------
>>
>>Message: 10
>>Date: Wed, 13 Jul 2011 11:17:03 -0400
>>From: chris brew <cbrew at acm.org>
>>Subject: Re: [Corpora-List] Which Statistical Test is Suitable
>>To: Geoffrey Sampson <grs2 at sussex.ac.uk>
>>Cc: corpora <corpora at uib.no>, corpora at lists.uib.no
>>
>>I partially agree with Geoffrey Sampson's points. It is certainly true that
>>a table of numbers, in isolation, tells you nothing about the question you
>>are asking, for the reasons that Professor Sampson gives. And statistical
>>tests will not change this situation. To make progress, you need to be
>>precise about what you intend to count as a "spelling error". You could for
>>example reframe the problem by as "how likely is it that the numbers that we
>>observe are due to random mistakes in typing?", then proceed to make a
>>mathematical model of typing errors. Or you could contrast the typing error
>>hypothesis with an alternative hypothesis and frame the question as "Are the
>>numbers that we observe more likely to be the result of typing errors or
>>more likely to be due to the existence in the writing population of two
>>groups of people, one of which always tries to spell the word one way, and
>>one of which tries to spell it the other way". It will take some clear
>>thinking to get this comparison right, because you have to make a precise
>>quantitative judgement on things like the prior probability of finding
>>groups that spell differently in the way we hypothesize. From experience of
>>US/UK spelling differences, I believe that it would be a tricky and subtle
>>matter to come up with suitably precise and useful hypotheses. No surprise
>>there, as linguists we are used to working with challenging and complex
>>data.
>>
>>But, if you do manage to set up sufficiently precise hypotheses, and
>>associate numbers with the hypotheses, statistical reasoning definitely can
>>help. That's what it is for. This kind of thinking is the basis for all
>>statistical tests that I am aware of. What you are never going to find is a
>>statistical test that frees you from the necessity of making (or finding in
>>the work of other scholars) a precise and careful analysis of the problem
>>you are trying to solve.
>>
>>Chris
>>
>>On Wed, Jul 13, 2011 at 10:34 AM, Geoffrey Sampson <grs2 at sussex.ac.uk>wrote:
>>
>>> Dear Muhammad Shakir Aziz,
>>>
>>> I don't see that anyone else has responded to your query, so let me do so,
>>> rather late. I would say that no kind of statistical test could possibly
>>> indicate whether variant spellings were errors, or allowable alternatives;
>>> because this question is not to do with numbers. It is a question about
>>> where authority over the norms of the language you are concerned with is
>>> felt to lie, and what that authority says about orthography. Some
>>> languages, at some periods, tolerate a wide variety of alternative
>>> spellings for given words, while other languages (or the same languages at
>>> other periods) may have extremely tightly-defined norms and strong social
>>> sanctions against violating them. Carrying out statistical calculations
on
>>> tables of the incidence of alternatives would not tell you anything about
>>> this, I believe.
>>>
>>> Geoffrey Sampson
>>>
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>>
>>
>>--
>>Chris Brew, Ohio State University
>>-------------- next part --------------
>>A non-text attachment was scrubbed...
>>Name: not available
>>Type: text/html
>>Size: 3688 bytes
>>Desc: not available
>>URL: <http://www.uib.
>no/mailman/public/corpora/attachments/20110713/b8a69875/attachment.txt>
>>
>>------------------------------
>>
>>Message: 11
>>Date: Wed, 13 Jul 2011 11:17:03 -0400
>>From: chris brew <cbrew at acm.org>
>>Subject: Re: [Corpora-List] Which Statistical Test is Suitable
>>To: Geoffrey Sampson <grs2 at sussex.ac.uk>
>>Cc: corpora <corpora at uib.no>, corpora at lists.uib.no
>>
>>I partially agree with Geoffrey Sampson's points. It is certainly true that
>>a table of numbers, in isolation, tells you nothing about the question you
>>are asking, for the reasons that Professor Sampson gives. And statistical
>>tests will not change this situation. To make progress, you need to be
>>precise about what you intend to count as a "spelling error". You could for
>>example reframe the problem by as "how likely is it that the numbers that we
>>observe are due to random mistakes in typing?", then proceed to make a
>>mathematical model of typing errors. Or you could contrast the typing error
>>hypothesis with an alternative hypothesis and frame the question as "Are the
>>numbers that we observe more likely to be the result of typing errors or
>>more likely to be due to the existence in the writing population of two
>>groups of people, one of which always tries to spell the word one way, and
>>one of which tries to spell it the other way". It will take some clear
>>thinking to get this comparison right, because you have to make a precise
>>quantitative judgement on things like the prior probability of finding
>>groups that spell differently in the way we hypothesize. From experience of
>>US/UK spelling differences, I believe that it would be a tricky and subtle
>>matter to come up with suitably precise and useful hypotheses. No surprise
>>there, as linguists we are used to working with challenging and complex
>>data.
>>
>>But, if you do manage to set up sufficiently precise hypotheses, and
>>associate numbers with the hypotheses, statistical reasoning definitely can
>>help. That's what it is for. This kind of thinking is the basis for all
>>statistical tests that I am aware of. What you are never going to find is a
>>statistical test that frees you from the necessity of making (or finding in
>>the work of other scholars) a precise and careful analysis of the problem
>>you are trying to solve.
>>
>>Chris
>>
>>On Wed, Jul 13, 2011 at 10:34 AM, Geoffrey Sampson <grs2 at sussex.ac.uk>wrote:
>>
>>> Dear Muhammad Shakir Aziz,
>>>
>>> I don't see that anyone else has responded to your query, so let me do so,
>>> rather late. I would say that no kind of statistical test could possibly
>>> indicate whether variant spellings were errors, or allowable alternatives;
>>> because this question is not to do with numbers. It is a question about
>>> where authority over the norms of the language you are concerned with is
>>> felt to lie, and what that authority says about orthography. Some
>>> languages, at some periods, tolerate a wide variety of alternative
>>> spellings for given words, while other languages (or the same languages at
>>> other periods) may have extremely tightly-defined norms and strong social
>>> sanctions against violating them. Carrying out statistical calculations
on
>>> tables of the incidence of alternatives would not tell you anything about
>>> this, I believe.
>>>
>>> Geoffrey Sampson
>>>
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>>
>>
>>--
>>Chris Brew, Ohio State University
>>-------------- next part --------------
>>A non-text attachment was scrubbed...
>>Name: not available
>>Type: text/html
>>Size: 3688 bytes
>>Desc: not available
>>URL: <http://www.uib.
>no/mailman/public/corpora/attachments/20110713/b8a69875/attachment.txt>
>>
>>------------------------------
>>
>>Message: 12
>>Date: Wed, 13 Jul 2011 12:52:27 -0400
>>From: maxwell <maxwell at umiacs.umd.edu>
>>Subject: Re: [Corpora-List] Which Statistical Test is Suitable
>>To: chris brew <cbrew at acm.org>
>>Cc: corpora <corpora at uib.no>, corpora at lists.uib.no
>>
>>I am not at all familiar with the literature, but it's possible that the
>>literature people have looked at spelling (non-)standardization in the
>>period in English between, say, Chaucer (when not only was every writer a
>>law unto himself, but an individual writer might have a lot of variation),
>>up into the era of spelling standardization (when individual writers could
>>be law-abiding citizens or outlaws :-)). Perhaps similar sorts of things
>>happened in other languages that underwent standardization (mostly European
>>languages, I'm guessing).
>>
>>If they have worked on this, a place to start a literature search might be
>>the ALLC (Association for Linguistic and Literary Computing) and the
>>Association for Computing in the Humanities. The two orgs have met for
>>joint conferences in the last decade, I believe.
>>
>> Mike Maxwell
>>
>>
>>
>>------------------------------
>>
>>Message: 13
>>Date: Wed, 13 Jul 2011 12:52:27 -0400
>>From: maxwell <maxwell at umiacs.umd.edu>
>>Subject: Re: [Corpora-List] Which Statistical Test is Suitable
>>To: chris brew <cbrew at acm.org>
>>Cc: corpora <corpora at uib.no>, corpora at lists.uib.no
>>
>>I am not at all familiar with the literature, but it's possible that the
>>literature people have looked at spelling (non-)standardization in the
>>period in English between, say, Chaucer (when not only was every writer a
>>law unto himself, but an individual writer might have a lot of variation),
>>up into the era of spelling standardization (when individual writers could
>>be law-abiding citizens or outlaws :-)). Perhaps similar sorts of things
>>happened in other languages that underwent standardization (mostly European
>>languages, I'm guessing).
>>
>>If they have worked on this, a place to start a literature search might be
>>the ALLC (Association for Linguistic and Literary Computing) and the
>>Association for Computing in the Humanities. The two orgs have met for
>>joint conferences in the last decade, I believe.
>>
>> Mike Maxwell
>>
>>
>>
>>------------------------------
>>
>>Message: 14
>>Date: Wed, 13 Jul 2011 13:19:14 -0400
>>From: "John F. Sowa" <sowa at bestweb.net>
>>Subject: Re: [Corpora-List] Which Statistical Test is Suitable
>>To: corpora at uib.no
>>
>>On 7/13/2011 11:17 AM, chris brew wrote:
>>> I partially agree with Geoffrey Sampson's points. It is certainly true
>>> that a table of numbers, in isolation, tells you nothing about the
>>> question you are asking, for the reasons that Professor Sampson gives.
>>> And statistical tests will not change this situation...
>>
>>All statistical methods are based on some model about the processes
>>that generate the data. And as the statistician George Box observed:
>>
>> All models are wrong, but some are useful.
>>
>>Geoffrey Sampson:
>>> It is a question about where authority over the norms of the language
>>> you are concerned with is felt to lie, and what that authority says
>>> about orthography.
>>
>>Yes, and those authorities could be authors, dictionaries, or some
>>official legislation.
>>
>>CB
>>> But, if you do manage to set up sufficiently precise hypotheses,
>>> and associate numbers with the hypotheses, statistical reasoning
>>> definitely can help.
>>
>>I agree that statistics can help. But there are many models for
>>generating statistics. Should you give higher weights to typing
>>mistakes, dictionaries, legislation, or common usage?
>>
>>John
>>
>>
>>
>>
>>------------------------------
>>
>>Message: 15
>>Date: Wed, 13 Jul 2011 21:26:45 +0200
>>From: Ulrich Schaefer <ulrich.schaefer at dfki.de>
>>Subject: [Corpora-List] The ACL Anthology Searchbench is online
>>To: corpora at uib.no
>>
>>Dear all,
>>
>>the ACL Anthology Searchbench is online at http://aclasb.dfki.de (also
>>reachable from the ACL Anthology start page aclweb.org/anthology --
>>thanks to Min-Yen Kan for integrating it!).
>>
>>The Searchbench combines semantic, full text and bibliographic search
>>in more than 19,000 Computational Linguistics papers of the ACL
>>Anthology from the past 47 years, including the complete Journal.
>>
>>Highlights are
>>
>>- "statements" search: you can search for subject-predicate-object
>> triples in millions of sentences, where predicates can also be
>> synonyms, and taking passives and sentence negation into account
>>
>>- combination with bibliographic and full text filters
>>
>>- search result (filter) URLs can be bookmarked or emailed
>>
>>- display of search result sentences in original PDF layout.
>> This requires the Adobe Acrobat Reader browser plug-in with
>> Preferences/Search/"external highlight server" enabled and doesn't
>> work well on older, scanned papers (page should always be correct).
>>
>>The Searchbench itself requires a recent web browser with JavaScript
>>enabled. Details see "Help" at the left bottom of the Searchbench
>>user interface.
>>
>>The Searchbench is not perfect -- it is a milestone in an ongoing
>>research project (TAKE). There was no manual correction of OCR or NLP
>>errors. Missing author affiliation data of 2010 and 2011 papers will
>>be added later.
>>
>>However, we hope you find it a useful tool also for your scientific
>>work. Your feedback is welcome ("Feedback" button at left bottom)!
>>
>>
>>-- The TAKE Searchbench team Ulrich Schäfer, Bernd Kiefer, Christian
>>Spurk, Jörg Steffen and Rui Wang
>>
>> ...with thanks to all others who have contributed to this endeavor
>> (see "About" at left bottom, also contains a link to the ACL paper
>> describing the Searchbench internals).
>>
>>The Searchbench has been developed in the context of the BMBF-funded
>>project TAKE, the DFG Cluster of Excellence on Multimodal Computing
>>and Interaction (MMCI) and the international DELPH-IN collaboration.
>>
>>--
>>Dr. Ulrich Schäferhttp://dfki.de/~uschaefer phone:+49681857755154
>> DFKI Language Technology Lab, D-66123 Saarbruecken, Germany
>>-------------------------------------------------------------------
>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>> Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
>> Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
>>(Vorsitzender), Dr. Walter Olthoff. Vorsitzender des Aufsichtsrats:
>>Prof. Dr. h.c. Hans A. Aukes. Amtsgericht Kaiserslautern, HRB 2313
>>
>>
>>
>>
>>------------------------------
>>
>>Message: 16
>>Date: Wed, 13 Jul 2011 12:24:48 -0300
>>From: "Ana Julia" <anajulia at corpuslg.org>
>>Subject: Re: [Corpora-List] Methodology for capturing corpus from
>> paper tocomputer
>>To: "Samuel Danso" <scsod at leeds.ac.uk>
>>Cc: corpora at uib.no
>>
>>Dear Samuel
>>
>>I have faced something similar,
>>and my solution was to read all the reports (because they were handwritten)
>to my IBM Via Voice program. I couldn't think about any other better
strategy
>by the time... let's see if the colleagues have any better solutions
>>
>>regards,
>>
>>Ana Julia Perrotti-Garcia
>>Scientia Vinces Serv. Trad. Ltda
>>Translators of Dental and Medical Texts
>>Italiano > Español > Português <> English
>>Proficiency in English (CPE) University of Cambridge UK
>>Visit our webpage at www.scientiavinces.com/ana/
>>São Paulo, Brazil
>>
>>
>>----- Original Message -----
>> From: Samuel Danso
>> To: 'corpora'
>> Sent: Wednesday, July 13, 2011 11:55 AM
>> Subject: [Corpora-List] Methodology for capturing corpus from paper
>tocomputer
>>
>>
>> Dear All
>>
>> Please advise on methodology for capturing paper forms into a computer
>corpus.
>>
>>
>>
>> My research involves a collection of 10,000 Verbal Autopsy interviews of
>mother or close relative of deceased, currently on paper forms. How should I
>have these typed onto PC? - double entry by two independent clerks is twice
the
>cost of single entry (with checking by managers), is it really necessary?
>>
>>
>>
>> Sammy Danso,
>>
>> Leeds University, UK and Kintampo Health Centre, Ghana
>>
>>
>>
>>
>>
>>
>>
>
>>------------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>-------------- next part --------------
>>A non-text attachment was scrubbed...
>>Name: not available
>>Type: text/html
>>Size: 5133 bytes
>>Desc: not available
>>URL: <http://www.uib.
>no/mailman/public/corpora/attachments/20110713/e59da4cf/attachment.txt>
>>
>>------------------------------
>>
>>Message: 17
>>Date: Wed, 13 Jul 2011 20:40:09 -0700 (PDT)
>>From: fatima zuhra <fateeshah at yahoo.com>
>>Subject: Re: [Corpora-List] Which Statistical Test is Suitable
>>To: True Friend <true.friend2004 at gmail.com>
>>Cc: corpora at uib.no
>>
>>Dear Muhammad Shakir Aziz,
>>Can you please provide an example (or two) of the words, having two
>spellings? I have worked with Pashto text and I have observed that a single
>Pashto word is spelled in several (more than two) ways.
>>One of my works was concerned with extracting individual words from a
written
>Pashto corpus. The system I used for extracting individual Pashto words gave
me
>such variations of the same word that looked the same at the first glance (e.
g.
>the grapheme "kaaf" may be written a bit longer than how it is written
>currently in the Urdu spelling of "Shakir" in your name, which will result in
a
>variation of this spelling). Are you considering these variations or some
>others?
>>
>>Regards.
>>Fatima Tuz ZuhraPh.D. Scholar and Lecturer,Department of Computer Science,
>University of Peshawar, Pakistan.
>>--- On Sun, 7/10/11, True Friend <true.friend2004 at gmail.com> wrote:
>>
>>From: True Friend <true.friend2004 at gmail.com>
>>Subject: [Corpora-List] Which Statistical Test is Suitable
>>To: "corpora" <corpora at uib.no>, corpora at lists.uib.no
>>Date: Sunday, July 10, 2011, 8:23 PM
>>
>>Dear Members
>>I am working on a research paper regarding spelling variations. In my
>language, Urdu, there are some words which have two spellings. For example
the
>data can be like this:
>>
>>
>>
>> Word
>> Spelling 1
>> Spelling 2
>>
>>
>> X
>> 24
>> 40
>>
>>
>> Y
>> 600
>> 200
>>
>>
>> Z
>> 300
>> 1000
>>
>>Now what I want to show that alternate spellings do exist for this group of
>words and they are not just spelling errors. Can I use a correlation formula
to
>show that two spellings have a relation?
>>Waiting for your suggestions.
>>
>>Regards
>>--
>>Muhammad Shakir Aziz ???? corpora at uib.no? ????
>>
>>Masters in Applied Linguistics
>>Translator, Course Developer, Linguist for Urdu, Punjabi and English
>>
>>Urdu:- http://awaz-e-dost.blogspot.com/
>>
>>English:- http://linguisticslearner.blogspot.com/
>>
>>Facebook:- http://www.facebook.com/truefriend2004
>>
>>Skype:- true_friend2004
>>
>>
>>
>>-----Inline Attachment Follows-----
>>
>>_______________________________________________
>>UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>Corpora mailing list
>>Corpora at uib.no
>>http://mailman.uib.no/listinfo/corpora
>>-------------- next part --------------
>>A non-text attachment was scrubbed...
>>Name: not available
>>Type: text/html
>>Size: 5213 bytes
>>Desc: not available
>>URL: <http://www.uib.
>no/mailman/public/corpora/attachments/20110713/97908911/attachment.txt>
>>
>>----------------------------------------------------------------------
>>Send Corpora mailing list submissions to
>> corpora at uib.no
>>
>>To subscribe or unsubscribe via the World Wide Web, visit
>> http://mailman.uib.no/listinfo/corpora
>>or, via email, send a message with subject or body 'help' to
>> corpora-request at uib.no
>>
>>You can reach the person managing the list at
>> corpora-owner at uib.no
>>
>>When replying, please edit your Subject line so it is more specific
>>than "Re: Contents of Corpora digest..."
>>
>>
>>_______________________________________________
>>Corpora mailing list
>>Corpora at uib.no
>>http://mailman.uib.no/listinfo/corpora
>>
>>
>>End of Corpora Digest, Vol 49, Issue 16
>>***************************************
>>
>
>
>
>
>
>----------------------------------------------------------------------
>Send Corpora mailing list submissions to
> corpora at uib.no
>
>To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.uib.no/listinfo/corpora
>or, via email, send a message with subject or body 'help' to
> corpora-request at uib.no
>
>You can reach the person managing the list at
> corpora-owner at uib.no
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of Corpora digest..."
>
>
>_______________________________________________
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora
>
>
>End of Corpora Digest, Vol 49, Issue 19
>***************************************
>
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list