July 2, 2015

Introduction

Goal

  • Genomic data is high dimensional:
    • thousand of features
    • few or even many samples with many covariates
  • Extracting and representing the main patterns of the data complex
  • New package for visualization of such high-dimensional dataset
    → for exploratory analysis of genomic data in R

Example dataset: ALL dataset (Chiaretti, et al. 2004)

  • microarrays of patients with Acute Lymphoblastic Leukemia
  • annotation:
    • 12143 unique genes, gene annotation with hgu95av2.db
      PROBEID ENTREZID SYMBOL GENENAME
      1000_at 5595 MAPK3 mitogen-activated protein kinase 3
      1001_at 7075 TIE1 tyrosine kinase with immunoglobulin-like and EGF-like domains 1
      1002_f_at 1557 CYP2C19 cytochrome P450, family 2, subfamily C, polypeptide 19
    • 128 patients, 21 covariates:
      cod sex age BT remissionType
      1005 M 53 B2 achieved
      1010 M 19 B2 achieved
      1003 M 31 T achieved

esetVis package

  • input: Bioconductor expressionSet object
  • 3 types of visualizations:
    • spectral map (Lewi,P.J 1976): mpm package
      esetSpectralMap
    • T-Distributed Stochastic Neighbor Embedding (tsne) (Van der Maaten and Hinton 2008): Rtsne package
      esetTsne
    • linear discriminant analysis (Fisher, R. A. 1936): MASS package → esetLda
  • 2 implementations:
    • static with ggplot2 package
    • (potentially) interactive with ggvis package

Why use expressionSet ?

  • combine 3 (or more) layers for a experiment of genomic data in one R object:
    • data, i.e. change of gene expression upon certain condition → slot: assayData
    • gene annotation, i.e. symbol, probe set ID, family → slot: featureData
    • sample annotation, i.e. treatment, time point, concentration → slot: phenoData

Spectral map

Default parameters

print(esetSpectralMap(eset = ALL))

Custom sample annotation

print(esetSpectralMap(eset = ALL, 
    title = paste("Acute lymphoblastic leukemia",
        "dataset \n Spectral map",
        "\n Sample annotation"),
    # sample annotation
    colorVar = "BT", 
    color = colorPalette,
    shapeVar = "sex",
    sizeVar = "age", 
    sizeRange = c(2,6),
    # outlying features
    topSamples = 0, 
    topGenes = 0, 
    cloudGenes = FALSE)
)

Label outlying genes/samples

print(esetSpectralMap(eset = ALL, 
    title = paste("Acute lymphoblastic leukemia",
        "dataset \n Spectral map \n",
        "Label outlying samples and genes"),
    colorVar = "BT", color = colorPalette, 
    shapeVar = "sex", sizeVar = "age", 
    topGenes = 10, topGenesVar = "SYMBOL",
    topSamples = 15, topSamplesVar = "cod"
    )
)

Group gene in gene sets - Extract gene sets annotation

  • Why: give additional biological meaning
  • getGeneSetsForPlot: format gene set annotation for plot (use MLP package)
  • Example:
    extract gene sets from Gene Ontology and KEGG databases for ALL dataset:
geneSets <- getGeneSetsForPlot(
entrezIdentifiers = fData(ALL)$ENTREZID, 
species = "Human", 
geneSetSource = c('GOBP', 'GOMF', 'GOCC', 'KEGG'),  
useDescription = TRUE)

Add gene set annotation in spectral map

print(esetSpectralMap(eset = ALL, 
    title = paste("Acute lymphoblastic leukemia",
        "dataset \n Spectral map \n",
        "Gene set annotation"),
    colorVar = "BT", color = colorPalette, 
    shapeVar = "sex", 
    sizeVar = "age", 
    topGenes = 0,
    geneSets = geneSets, geneSetsVar = "ENTREZID", 
    geneSetsMaxNChar = 30))

T-Distributed Stochastic Neighbor Embedding

T-Distributed Stochastic Neighbor Embedding

print(esetTsne(eset = ALL, 
    title = paste("Acute lymphoblastic leukemia",
        "dataset \n Tsne"),
    colorVar = "BT", color = colorPalette,
    shapeVar = "sex",
    sizeVar = "age", sizeRange = c(2, 6),
    topSamplesVar = "cod"
))

Linear discriminant analysis

Linear discriminant analysis

# run the analysis
pathOutputEsetLda <- 
    myObjectPath("outputEsetLda.RData")

if(createObjects){
    outputEsetLda <- esetLda(eset = ALL,
        ldaVar = "BT",
        title = paste(
            "Acute lymphoblastic leukemia",
            "dataset \n",
            "Linear discriminant analysis",
            "BT variable"),
        colorVar = "BT", color = colorPalette,
        shapeVar = "sex",
        sizeVar = "age", sizeRange = c(2, 6),
        topSamplesVar = "cod", 
        topGenesVar = "SYMBOL",
        returnAnalysis = TRUE)
    save(outputEsetLda, 
        file = pathOutputEsetLda)
}else if(!exists("outputEsetLda"))  
        load(pathOutputEsetLda)

# extract and print the ggplot object
print(outputEsetLda$plot)

Interactive visualization with ggvis

Interactive visualization with ggvis

  • typePlot argument: 'static' (by default) or 'interactive'
  • example on the ALL dataset

Thanks!

Conclusion

Chiaretti, et al. 2004. “Gene Expression Profile of Adult T-Cell Acute Lymphocytic Leukemia Identifies Distinct Subsets of Patients with Different Response to Therapy and Survival” 103. Blood.

Fisher, R. A. 1936. “The Use of Multiple Measurements in Taxonomic Problems” 7. Annals of Eugenics: 179–88.

Lewi,P.J. 1976. “Spectral Mapping, a Technique for Classifying Biological Activity Profiles of Chemical Compounds” 26. Arzneimittel Forschung (Drug Research): 1295–1300.

Van der Maaten and Hinton. 2008. “Visualizing High-Dimensional Data Using T-SNE.” Journal of Machine Learning Research, 2579–2605.