|
1
|
- Speaker: Xin Yan
- A joint work with Prof. Shaw-Hwa Lo, Prof. Tian Zheng
- This research was supported in
part by National Science Foundation
|
|
2
|
- Background information
- The process of gene expression:
- DNA chips and Microarrays
- A technology enables people to exam transcription levels for thousands
of genes at one time.
- cDNA microarray and Oligonucleotide microarray
|
|
3
|
- A typical setting of two category cancer classification
|
|
4
|
- Previous gene selection methods in cancer classification
- Golub, et.al, 1999
- Neighborhood analysis
- Measure of correlation
- The set of informative genes consists of n/2 genes closest to class
vector high in class 1 and n/2 high in class 2
|
|
5
|
- Tusher, et, al, 2001
- Significance Analysis of Microarrays (SAM)
- Measure of relative difference
- Gene specific scatter s(i) is the sdv of repeated expression measures:
- s0 is chosen to
minimize the coefficient of variance.
|
|
6
|
- Dataset used in our analysis
- Efron et al. (2001)
- Training:
- 24 type I ( less serious stomach caner) samples
- 24 type II (more serious stomach caner) samples
- Van’t Veer et al. (2002)
- Training:
- 44 good prognosis breast cancer patients
- 34 poor prognosis breast cancer patients
|
|
7
|
- Gene Selection by GPAS
- Discretizing
- Assuming a gene has 3 states: a, b, c indicates under-expressed,
normally expressed and over-expressed, respectively.
- K-means + learning initial points from data
|
|
8
|
- Multiple gene profiles
- Random pick 10 genes out of all,
for a given sample, the vector indicates the corresponding gene
states is called a multiple
gene profile.
- For a given sample, a total of N=3^10 possible gene profiles, denote
samples space Ώ={g1, g2, ….., gN}.
- Important statistics
- Gene Profile Total Difference (GPTD)
- Where
- ni1 = count of observed profile =gi in cancer I
samples
- ni2 = count of observed profile =gi in cancer
II samples
|
|
9
|
- Gene Profile Association Score (GPAS)
- Difference of GPTD
- Definition of GPAS
- Adjusted item is to make sure that GPAS has a expectation of 0 under
the null hypothesis. Please refer to our paper for detailed
inference about GPAS.
|
|
10
|
- Example:
- Random draw 10 genes, for instance, k: (6,7,8,…15) ->(1,2,…10). GPASk1=0;GPASk2=
(1-0)(1-0)+(0-1)(0-1) =2; GPASk3 = 0 since there are no
replications of profiles exclude the first or the third gene.
- Delete the second, k: (6,8,…15)->(1,3,…10), repeat above procedure.
- Finally remained genes for k: ( 6,9)->(1,3)
- Repeat the process 500, 000 times.
- Count the survival frequency for each gene, counts= (F1, … F2638)
- Choose genes based on significance of counts.
|
|
11
|
|
|
12
|
- Validation Process and Results
- Effron data
- For each turn, 20 were randomly chosen from 24 type I samples,
another 20 were randomly chosen from 24 type II samples.
- Based on the 40 training samples, top 50 genes based on Gene Voting,
t-test, correlation, and GPAS were selected respectively.
- Classification results were made for the rest 8 samples using both
Golub’s classifier and Diagonal linear discriminant analysis ( DLDA).
- In total, 10 turns were made.
|
|
13
|
- Table I: Test Set Error for Effron data
|
|
14
|
- Breast Cancer Data
- For each turn, 34 were randomly chosen from 44 good prognosis
samples, another 26 were randomly chosen from 34 poor prognosis
samples.
- Based on the 60 training samples, top 50 genes based on Gene Voting,
t-test, correlation and GPAS were selected respectively.
- Classification results were made for the rest 18 samples using both
Golub’s classification method and Diagonal linear discriminant
analysis ( DLDA).
- In total, 15 turns were made.
- Remark: Both data sets were standardized before analysis. For breast
cancer data, we used a 4918 genes set after preliminary process.
|
|
15
|
- Table II: Test Set Error for Breast Cancer Data
|
|
16
|
|
|
17
|
|
|
18
|
- Discussion
- Major advantage of GPAS
- While agrees with marginal methods, (such as t-test P-value), GPAS
takes into account the effects of other genes within the multi-gene
profile.
- Informative genes picked up by GPAS with a comparatively large p-value
might be favorable to biologists who have some preliminary acknowledge
of interesting genes, which can not be seen when looking at the single
gene only.
- Major disadvantages
- GPAS is derived from a greedy learning process. To make sense of data,
a huge number of repeat is needed. Computation time using C language
was about 6 hours for a 4918*78 size data running 500, 000 different
10-gene profiles.
|
|
19
|
- Future Objectives
- When we ran GPAS process, we observed that certain genes were always
returned together with certain other genes. Based on this phenomenon,
several clusters can be produced by considering a similar problem to
“financial basket”.
- Hopefully, these clusters will help biologist explore the genetic
networks for a particular type of molecular cancer.
- Acknowledge
- Thanks for Prof. Efron and Prof. Tibishirani for providing the 4918 *
78 breast cancer data.
- The End
|