Notes
Slide Show
Outline
1
 Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
  • Speaker: Xin Yan


  • A joint work with Prof. Shaw-Hwa Lo, Prof. Tian Zheng
  •  This research was supported in part by National Science Foundation
2
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
  • Background information
    • The process of gene expression:






    • DNA chips and Microarrays
      • A technology enables people to exam transcription levels for thousands of genes at one time.
      • cDNA microarray and Oligonucleotide microarray


3
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
  • A typical setting of two category cancer classification








4
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
  • Previous gene selection methods in cancer classification
    • Golub, et.al, 1999
      • Neighborhood analysis
      • Measure of correlation



      • The set of informative genes consists of n/2 genes closest to class vector high in class 1 and n/2 high in class 2
5
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
    • Tusher, et, al, 2001
      • Significance Analysis of Microarrays (SAM)


      • Measure of relative difference



      • Gene specific scatter s(i) is the sdv of repeated expression measures:



      • s0  is chosen to minimize the coefficient of variance.
6
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
  • Dataset used in our analysis
    • Efron et al. (2001)
      • Training:
        • 24 type I ( less serious stomach caner) samples
        • 24 type II (more serious stomach caner) samples
    • Van’t Veer et al. (2002)
      • Training:
        • 44 good prognosis breast cancer patients
        • 34 poor prognosis breast cancer patients


7
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
  • Gene Selection by GPAS
    • Discretizing
      • Assuming a gene has 3 states: a, b, c indicates under-expressed, normally expressed and over-expressed, respectively.
      • K-means + learning initial points from data
8
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
    • Multiple gene profiles
      • Random pick 10 genes out of all,  for a given sample, the vector indicates the corresponding gene states  is called a multiple gene profile.
      • For a given sample, a total of N=3^10 possible gene profiles, denote samples space Ώ={g1, g2, ….., gN}.
    • Important statistics
      • Gene Profile Total Difference (GPTD)




        • Where


        • ni1 = count of observed profile =gi in cancer I samples
        • ni2 = count of observed profile =gi in cancer II samples
9
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
      • Gene Profile Association Score (GPAS)
        • Difference of GPTD







        • Definition of GPAS



          • Adjusted item is to make sure that GPAS has a expectation of 0 under the null hypothesis. Please refer to our paper for detailed inference about GPAS.

10
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
  • Example:




    • Random draw 10 genes, for instance, k: (6,7,8,…15) ->(1,2,…10). GPASk1=0;GPASk2= (1-0)(1-0)+(0-1)(0-1) =2; GPASk3 = 0 since there are no replications of profiles exclude the first or the third gene.
    • Delete the second, k: (6,8,…15)->(1,3,…10), repeat above procedure.
    • Finally remained genes for k: ( 6,9)->(1,3)
    • Repeat the process 500, 000 times.
    • Count the survival frequency for each gene, counts= (F1, … F2638)
    • Choose genes based on significance of counts.
11
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
12
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
  • Validation Process and Results
    • Effron data
        • For each turn, 20 were randomly chosen from 24 type I samples, another 20 were randomly chosen from 24 type II samples.
        • Based on the 40 training samples, top 50 genes based on Gene Voting, t-test, correlation, and GPAS were selected respectively.
        • Classification results were made for the rest 8 samples using both Golub’s classifier and Diagonal linear discriminant analysis ( DLDA).
        • In total, 10 turns were made.
13
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
    • Table I: Test Set Error for Effron data
14
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
    • Breast Cancer Data
        • For each turn, 34 were randomly chosen from 44 good prognosis samples, another 26 were randomly chosen from 34 poor prognosis samples.
        • Based on the 60 training samples, top 50 genes based on Gene Voting, t-test, correlation and GPAS were selected respectively.
        • Classification results were made for the rest 18 samples using both Golub’s classification method and Diagonal linear discriminant analysis ( DLDA).
        • In total, 15 turns were made.
    • Remark: Both data sets were standardized before analysis. For breast cancer data, we used a 4918 genes set after preliminary process.


15
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
    • Table II: Test Set Error for Breast Cancer Data



16
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
17
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
18
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
  • Discussion
    • Major advantage of GPAS
      • While agrees with marginal methods, (such as t-test P-value), GPAS takes into account the effects of other genes within the multi-gene profile.
      • Informative genes picked up by GPAS with a comparatively large p-value might be favorable to biologists who have some preliminary acknowledge of interesting genes, which can not be seen when looking at the single gene only.
    • Major disadvantages
      • GPAS is derived from a greedy learning process. To make sense of data, a huge number of repeat is needed. Computation time using C language was about 6 hours for a 4918*78 size data running 500, 000 different 10-gene profiles.


19
Greedy Learning from Multiple Gene Profiles for Selecting Informative Genes in Molecular Cancer Classification
  • Future Objectives
    • When we ran GPAS process, we observed that certain genes were always returned together with certain other genes. Based on this phenomenon, several clusters can be produced by considering a similar problem to “financial basket”.
    • Hopefully, these clusters will help biologist explore the genetic networks for a particular type of molecular cancer.
  • Acknowledge
    • Thanks for Prof. Efron and Prof. Tibishirani for providing the 4918 * 78 breast cancer data.




    • The End