Correspondence analysis for data mining with applications in medicine

Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

Correspondence analysis • Statistical vizualization method for displaying the associations between the levels of a two-contingency table and the distances between the categories of each variable => exploratory method • Usually, Chi-square test for independence in a contingency table

CA • Duality between the row and the columns • Use of the row profiles and of the column profiles • Use of chi-square distance (distributional equivalence) • Factorial analysis method (eigen values of a ad-hoc matrix) and reduction of dimensionality

Example : Frequency table

Row-profiles

Column profile

D4 D1 animal heart forest surgery D2 D3

Distances Between two columns Between two rows

Diagonalization of a « covariance matrix » to find the eigenvalues and corresponding eigenvectors • λ1≥λ2≥…….. ≥λp • Inertia of the cloud is ∑λi =2 / n • Distance to the independence model

Simultaneous representation • Of the rows and of the columns profiles on the same factorial plane • Validity of representation : • Inertia : contributions that describe the proportion of variance explained provided by each element (row or column profile) in building an axis • Quality of representation of each element by the axes

Applications in medicine • Pharmacology • Therapeutic trials (to avoid double blind procedures) : CA allows the physician to follow the evolution of the illness or/and of the therapy • Textual analysis : reports, business intelligence, bibliometry

Application on mucoviscidosis • Mucoviscidosis : rare disease • No specific keywords • No specific magazines • Goal : To define a minimum common vocabulary for the researchers working on mucoviscidosis (clinicians, geneticists, etc..)

SURGEON WORDS GENETICS WORDS TOPIC WORDS HYPOTHESIS : THE TYPICAL WORDS FOR A GIVEN TOPIC ARE INDEPENDENT OF THE TECHNIQUES

Processing • First step of the study : to create a “kernel” base which contains the references of scientific documents used by people working on the disease => 612 publications

30 axes with a positive side and a negative one • Each side of each axis is characterized by the words with a high relative contribution to the inertia (greatest than a threshold).

DATA • Two-table crossing the 612 documents (summaries) and 850 words • CA on this two-way table

Dimension of a word • The words of a topic are one-dimensional • The words of a filed are multidimensional • The dimension of a word is the number of axis on which this word has a high relative contribution to inertia • If we want to find the minimum common vocabulary, the dimension of a word must be high

MUCOVISCIDOSIS BASE

81 words have a dimension greatest than 10

Is a high dimension a sufficient condition to characterize the disease? To check it, we use other thematic databases and in each of them, we count the number of documents with at least two words among the previous 81 words.

5 thematic databases • BREAST CANCER …………………………..9871 doc • POLYAMINES……………………………...12726 doc • LEUCOCYTE INFILTRATED TUMOR ……586 doc • ACUTE LYPMPHOBLAST LEUKEMIA …2063 doc • MUCOVISCIDOSCIS………………………...612 doc

RETRIEVAL STATISTICS WITH THE 81 WORDS

CA of the 5 databases and 81 words

20 left words

Retrieval statistics with thess 20 words

Conclusion • CA is a very powerful methof to display teh association among variables • It can be used with large datasets (one of the dimension must be « tractable »)

Thanks to Michel Kerbaol for allowing me to use its data on mucoviscidosis • Michel.Kerbaol@univ-rennes1.fr • Software : Qnomis

Correspondence analysis for data mining with applications in medicine