Lingli He1, Kai Song1, Sitan Qiao1, Yabin Chen3, Jiang Li1, Lin Qi1, and Xin Wang1, 2, 3

1 Department of Surgery, The Chinese University of Hong Kong, Hong Kong SAR, China. 2 Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China. 3 Research Institute, The Chinese University of Hong Kong, Shenzhen, China.

1 Introduction

The vignette helps the user to do multi-omics colorectal cancer subtyping using sparse mCCA (Witten and Tibshirani (2009)) and weighted average. Paired mRNA expression, microRNA expression, and DNA methylation data from TCGA-COAD and TCGA-READ datasets were used for the training of multi-omics colorectal classifier. The package accepts any combination of mRNA expression, microRNA expression, and DNA methylation data as input.

2 Package installation

Please run all analyses in this vignette under version 2.10 of R prior to installation of package MSCRCclassifier, R packages caret, naivebayes should be installed. These packages can be installed directly from CRAN (Comprehensive R Archive Network):

options(repos = c(CRAN = "https://cloud.r-project.org/"))
install.packages(c("caret", "naivebayes"))
library(caret)
library(naivebayes)
library(devtools)
# install the "MSCRCclassifier" package
install_github("CityUHK-CompBio/MSCRCclassifier")

3 A quick start

3.1 Projecting each omics data into a unified space and integrating them

The example dataset used in this analysis comes from a microarray experiment on 566 colon cancer patients, identified by the GEO number GSE39582 (Marisa et al. (2013)).

  • input data can be any combination of mRNA expression, microRNA expression, and DNA methylation data.
  • if the single-omics or multi-omics integrated data includes mRNA expression or miRNA expression, log2 transformation is required before integration.
  • all types of input data should be pre-processed and z-score normalized.
options(knitr.duplicate.label = "allow")
library(MSCRCclassifier)
library(dplyr)
#> Warning: 程辑包'dplyr'是用R版本4.3.3 来建造的
#> 
#> 载入程辑包:'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# Load example mRNA expression profile
data("GSE39582_expr")
# Load projection matrices derived from 315 TCGA-COAD and TCGA-READ samples
data("projection_mxs")
dim(projection_mxs$ws[[1]])  
#> [1] 951 196
# Projecting each omics data into an unified space
mRNAexprCCA <- t(GSE39582_expr) %*% projection_mxs$ws[[1]]
mRNAexprCCA <- scale(mRNAexprCCA)
mRNAexprCCA[1:5,1:5] # samples in rows and genes in columns 
#>                 [,1]       [,2]        [,3]      [,4]       [,5]
#> GSM971957  0.1101710  0.1097191  0.06029913 -1.238951 -0.1891825
#> GSM971958  0.4386480  0.5585471  1.04738767 -2.080878 -0.7772895
#> GSM971959 -1.1243716 -1.0543246  0.54838954  1.235461  0.3554117
#> GSM971960  1.6357041  1.5142172 -1.42486629  1.064347 -1.5720998
#> GSM971961  0.2828736  0.2701505 -0.28705388 -1.389689  0.3078082
  
a1<-0.4
data_input <- scale(a1*mRNAexprCCA)
colnames(data_input) <- paste("X",1:ncol(projection_mxs$ws[[1]]), sep = "")
data_input[1:5,1:5]
#>                   X1         X2          X3        X4         X5
#> GSM971957  0.1101710  0.1097191  0.06029913 -1.238951 -0.1891825
#> GSM971958  0.4386480  0.5585471  1.04738767 -2.080878 -0.7772895
#> GSM971959 -1.1243716 -1.0543246  0.54838954  1.235461  0.3554117
#> GSM971960  1.6357041  1.5142172 -1.42486629  1.064347 -1.5720998
#> GSM971961  0.2828736  0.2701505 -0.28705388 -1.389689  0.3078082

3.2 Multi-omics colorectal cancer subtype classification

The classifyMSCRC function requires an expression matrix with samples in rows and multi-omics features in columns. The column names of the expression profile should be X1, X2, …, X196. The code chunk below demonstrates how to perform classification using primary colorectal cancer example data.


# MSCRC prediction of primary colorectal cancer
result <- classifyMSCRC(data_input)
label <- result$label
prob <- result$prob %>%
  `colnames<-`(paste("MSCRC", 1:5, "_prob", sep = ""))
res <- data.frame(prob, subtype = paste("MSCRC", label, sep = "") ) %>%
  `rownames<-`(names(label))
head(res)
#>             MSCRC1_prob   MSCRC2_prob   MSCRC3_prob   MSCRC4_prob   MSCRC5_prob
#> GSM971957  4.142554e-63 6.260621e-168  1.257628e-41  1.000000e+00  2.772609e-40
#> GSM971958 6.262548e-114  0.000000e+00  1.030506e-08  1.000000e+00 5.121095e-143
#> GSM971959 1.766813e-126  1.000000e+00 9.327311e-248  0.000000e+00 5.320991e-140
#> GSM971960 5.292759e-194  1.000000e+00 1.351652e-263 1.658364e-285 2.211272e-128
#> GSM971961  4.432350e-73 1.189128e-284  1.000000e+00  7.527886e-15 1.601268e-263
#> GSM971962  8.771647e-88  0.000000e+00  1.000000e+00 4.865823e-118  0.000000e+00
#>           subtype
#> GSM971957  MSCRC4
#> GSM971958  MSCRC4
#> GSM971959  MSCRC2
#> GSM971960  MSCRC2
#> GSM971961  MSCRC3
#> GSM971962  MSCRC3

4 Session Info

#> R version 4.3.2 (2023-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> time zone: Asia/Hong_Kong
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4           MSCRCclassifier_0.1.0 devtools_2.4.5       
#> [4] usethis_3.1.0         naivebayes_1.0.0      caret_7.0-1          
#> [7] lattice_0.21-9        ggplot2_3.5.1         BiocStyle_2.30.0     
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1     timeDate_4041.110    fastmap_1.2.0       
#>  [4] promises_1.3.2       pROC_1.18.5          digest_0.6.37       
#>  [7] rpart_4.1.21         mime_0.12            timechange_0.3.0    
#> [10] lifecycle_1.0.4      ellipsis_0.3.2       survival_3.5-7      
#> [13] magrittr_2.0.3       compiler_4.3.2       rlang_1.1.4         
#> [16] sass_0.4.9           tools_4.3.2          yaml_2.3.10         
#> [19] data.table_1.16.2    knitr_1.48           htmlwidgets_1.6.4   
#> [22] curl_6.0.1           pkgbuild_1.4.4       plyr_1.8.9          
#> [25] pkgload_1.4.0        miniUI_0.1.1.1       withr_3.0.2         
#> [28] purrr_1.0.2          nnet_7.3-19          grid_4.3.2          
#> [31] stats4_4.3.2         urlchecker_1.0.1     profvis_0.4.0       
#> [34] xtable_1.8-4         colorspace_2.1-1     future_1.34.0       
#> [37] globals_0.16.3       scales_1.3.0         iterators_1.0.14    
#> [40] MASS_7.3-60          cli_3.6.3            rmarkdown_2.28      
#> [43] remotes_2.5.0        generics_0.1.3       rstudioapi_0.17.0   
#> [46] future.apply_1.11.3  reshape2_1.4.4       sessioninfo_1.2.2   
#> [49] cachem_1.1.0         stringr_1.5.1        splines_4.3.2       
#> [52] parallel_4.3.2       BiocManager_1.30.25  vctrs_0.6.5         
#> [55] hardhat_1.4.0        Matrix_1.6-1.1       jsonlite_1.8.9      
#> [58] bookdown_0.41        listenv_0.9.1        foreach_1.5.2       
#> [61] gower_1.0.1          jquerylib_0.1.4      recipes_1.1.0       
#> [64] glue_1.8.0           parallelly_1.38.0    codetools_0.2-19    
#> [67] lubridate_1.9.3      stringi_1.8.4        gtable_0.3.6        
#> [70] later_1.4.1          munsell_0.5.1        tibble_3.2.1        
#> [73] pillar_1.10.1        htmltools_0.5.8.1    ipred_0.9-15        
#> [76] lava_1.8.0           R6_2.5.1             evaluate_1.0.1      
#> [79] shiny_1.10.0         memoise_2.0.1        httpuv_1.6.15       
#> [82] bslib_0.8.0          class_7.3-22         Rcpp_1.0.13         
#> [85] nlme_3.1-163         prodlim_2024.06.25   xfun_0.48           
#> [88] fs_1.6.4             ModelMetrics_1.2.2.2 pkgconfig_2.0.3

References

Marisa, Laetitia, Aurélien de Reyniès, Alex Duval, Janick Selves, Marie Pierre Gaub, Laure Vescovo, Marie-Christine Etienne-Grimaldi, et al. 2013. “Gene Expression Classification of Colon Cancer into Molecular Subtypes: Characterization, Validation, and Prognostic Value.” PLoS Medicine 10 (5): e1001453.
Witten, Daniela M, and Robert J Tibshirani. 2009. “Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data.” Statistical Applications in Genetics and Molecular Biology 8 (1).