[Corpora-List] transform UMLS into a bilingual lexicon
Pierre Zweigenbaum
pz at limsi.fr
Tue Nov 17 15:27:13 UTC 2009
Hi Emmanuel,
Le lundi 16 novembre 2009, 17:15:12, Emmanuel Prochasson a écrit :
> Dear all,
>
> I need a medical, specialised bilingual lexicon to run an experiment and
> plan on building one from the UMLS metathesaurus (French and English
> parts). However, even though it looks like it can be done with some
> time, the UMLS was not really designed for that and the file
> specification makes the process not straightforward.
>
> I think some of you might have had this problem before : does anyone
> have a solution for that (such as : a simple script to run on the UMLS
> file) ?
>
> Regards,
A common way to do this is to (load the files into a relational
database such as MySQL and) use SQL statements to build
what you need. For instance, here is one of the SQL scripts
used to prepare the data in
@InProceedings{Langlais:EACL2009,
author = {Philippe Langlais and François Yvon and Pierre Zweigenbaum},
title = {Improvements in Analogical Learning: Application to
Translating Multi-Terms of the Medical Domain},
booktitle = {Proceedings 12th Conference of the European Chapter of the ACL (EACL 2009)},
year = 2009,
address = {Athens, Greece},
publisher = {Association for Computational Linguistics},
pages = {487--495},
url = {http://www.aclweb.org/anthology/E09-1056},
}
-- Collect pairs of UMLS terms (T1, T2)
-- where (T1, T2) are respectively in languages (L1, L2),
-- here through versions of the MeSH thesaurus
-- see http://www.nlm.nih.gov/research/umls/metab3.html for TTY abbreviations
set character_set_results='utf8';
SELECT M.CUI, M.STR, M.TTY, N.STR, N.TTY,
M.LAT, M.TS, M.LUI, M.STT, M.SUI, M.ISPREF, M.SAB, M.CODE,
N.LAT, N.TS, N.LUI, N.STT, N.SUI, N.ISPREF, N.SAB, N.CODE
FROM MRCONSO M
LEFT JOIN MRCONSO N
on M.CUI = N.CUI and M.SDUI = N.SDUI
where M.SAB = 'MSHFRE'
and N.SAB = 'MSH'
and M.ISPREF = 'Y'
and N.ISPREF = 'Y'
and M.STT = 'PF'
and N.STT = 'PF'
and M.TTY = 'MH'
and N.TTY = 'MH'
;
You'll notice that instead of simply collecting any pair of strings
(STR) in two languages that share the same concept identifier (CUI),
this script enforces more constraints on pairs of strings, such as
coming from versions of the same source vocabulary (French and
English MeSH), and being preferred strings for each concept,
in the hope to find actual term translations instead of simply
co-referring terms. I append a simpler one which does not
filter on MeSH terms.
We can talk together offline if you wish, we might come up
with queries more adapted to your actual needs.
Best,
Pierre.
-- Collect pairs of UMLS terms (T1, T2)
-- where (T1, T2) are respectively in languages (L1, L2)
-- see http://www.nlm.nih.gov/research/umls/metab3.html for TTY abbreviations
SELECT M.CUI, M.STR, M.TTY, N.STR, N.TTY,
M.LAT, M.TS, M.LUI, M.STT, M.SUI, M.ISPREF, M.SAB, M.CODE,
N.LAT, N.TS, N.LUI, N.STT, N.SUI, N.ISPREF, N.SAB, N.CODE
FROM MRCONSO M
LEFT JOIN MRCONSO N
on M.CUI = N.CUI and M.CODE = N.CODE
where M.LAT = 'ENG'
and N.LAT = 'FRE'
and M.TTY != 'PM'
and N.TTY != 'PM'
;
--
Pierre Zweigenbaum
----
LIMSI - CNRS
Groupe ILES / Dépt. Communication Homme-Machine
Tél : (+33) (0)1 69 85 80 04 ; Fax : (+33) (0)1 69 85 80 88
Mél : pz at limsi.fr ; Toile : http://www.limsi.fr/~pz/
Lieu : Bâtiment 508, Université Paris-Sud 11
Courrier : LIMSI, BP 133, 91403 ORSAY Cedex, France
----
ERTIM, Institut National des Langues et Civilisations Orientales
----
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list