German MOR
Brian MacWhinney
macw at cmu.edu
Fri Sep 30 17:38:54 UTC 2011
Dear Rasmus,
I had hoped to work on the German tagger during August, but I never quite managed to find enough time. I spent three hours just now trying to get categories documented and synchronized. I have also updated the version of German (deu) MOR on the web. It is clear that we need to use real German umlauts and scharfes "s" consistently. But we also need to make a decision regarding the status of capitalization for common nouns. German is the only language that capitalizes these and it makes the tagging job more difficult, because then one cannot readily distinguish proper from common nouns. This then means that you have to list all proper nouns, which is a big job that could never be close to complete. Some of the corpora have already been placed in a format that uses lowercase for common nouns. Your "Anfang" problem is based on this confusion.
If you or others are willing to help on the refinement of the German MOR, that would be great. The first decision would be about the common nouns. If we could agree to put these into lowercase, then we could move on to the next steps. For existing corpora that have common nouns in caps, we can configure the CLAN program called LOWCASE to convert them to lowercase.
Regarding your question about when to update your version of CLAN, I would say that updating every 2-3 months is a good policy. When getting a new version of a MOR grammar, you should also update.
Regarding the minMOR mentioned in the Stephany file, it does not exist as far as I know.
Best regards,
-- Brian MacWhinney
On Sep 29, 2011, at 3:40 AM, RSteinkrauss wrote:
> Hello,
>
> we are trying to tag a German corpus morphologically with the German
> MOR grammar from the website and are experiencing some problems with
> that. Sometimes a word is not recognized although it is in the
> lexicon, and different CLAN versions are yielding different results -
> notably, older CLAN versions (Dec 2009) detect more than the newest
> version.
>
> For example, while the noun "Anfang" is part of the lexicon (file
> n.cut), it is not recognized:
> ?|Anfang
>
> When writing it with a lower-case letter (which is ungrammatical for
> nouns in German), it is recognized - three times, once as a noun and
> twice as a verb form:
> v|anfangen^n|anfang&an#v|fangen
> However, while "anfangen" is a verb in German, "anfang" is not an
> existing form of that verb.
>
> And, to give an example of the differences between versions, the older
> CLAN version would add the gender &M to the noun:
> v|anfangen^n|anfang&M^an#v|fangen
> The newest version does not do this (see above).
>
> Can anyone help us with this? We would be happy to invest time into
> improving the German tagger, but we are not sure how to go about this
> and would first like to sort out the errors in the existing MOR
> grammar. Any hints are greatly appreciated!
>
> On a related note: Is the info on the German minMOR grammar mentioned
> in the file
> http://childes.psy.cmu.edu/intro/stephany.pdf
> still correct?
>
> Thanks!
> Rasmus Steinkrauss
>
> --
> You received this message because you are subscribed to the Google Groups "chibolts" group.
> To post to this group, send email to chibolts at googlegroups.com.
> To unsubscribe from this group, send email to chibolts+unsubscribe at googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/chibolts?hl=en.
>
>
--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To post to this group, send email to chibolts at googlegroups.com.
To unsubscribe from this group, send email to chibolts+unsubscribe at googlegroups.com.
For more options, visit this group at http://groups.google.com/group/chibolts?hl=en.
More information about the Chibolts
mailing list