Lingli He¹ and Xin Wang¹

¹ Department of Surgery, The Chinese University of Hong Kong, Hong Kong SAR, China.

contact: xinwang@cuhk.edu.hk
date: 2024-12-23
package: MSOCclassifier 0.1.0

1 Introduction

The vignette helps the user to do multi-omics high-grade serous ovarian cancer subtyping using sparse mCCA (Witten and Tibshirani (2009)) and weighted average. Paired mRNA expression, microRNA expression, DNA methylation, copy number variation, and mutation data from TCGA-OV dataset were used for the training of multi-omics high-grade serous ovarian cancer classifier. The package accepts any combination of mRNA expression, microRNA expression, DNA methylation, copy number variation, and mutation data as input.

2 Package installation

Please run all analyses in this vignette under version 2.10 of R prior to installation of package MSOCclassifier, R packages caret should be installed. The package can be installed directly from CRAN (Comprehensive R Archive Network):

options(repos = c(CRAN = "https://cloud.r-project.org/"))
install.packages("caret")
library(caret)
library(devtools)
# install the "MSOCclassifier" package
install_github("Carpentierbio/MSOCclassifier")

3 A quick start

3.1 Projecting each omics data into a unified space and integrating them

The example dataset used in this analysis comes from the ICGC-OV cohort on 79 ovarian cancer patients, downloaded from https://dcc.icgc.org/projects/OV-AU (This link may no longer be accessible as of now).

input data can be any combination of mRNA expression, microRNA expression, DNA methylation, copy number variation, and mutation data.
if the single-omics or multi-omics integrated data includes mRNA expression or miRNA expression, log₂ transformation is required before integration.
mRNA expression, microRNA expression, DNA methylation input data should be pre-processed and z-score normalized.

options(knitr.duplicate.label = "allow")
library(MSOCclassifier)
library(dplyr)
#> Warning: 程辑包'dplyr'是用R版本4.3.3 来建造的
#> 
#> 载入程辑包：'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# Load example multi-omics expression profile
data("rna_log_tpm_ICGC")
data("mir_ICGC")
data("methy_M_ICGC")
data("cnv_ICGC")
data("mut_ICGC")
# Load projection matrices derived from 226 TCGA-OV samples
data("TCGA_projection_mx")
# Load pre-processed TCGA-OV multi-omics data for feature selection in validation cohort
data("TCGAmRNAscaled")
data("TCGAmiRNAscaled")
data("TCGAmetscaled")
data("TCGAcnvscaled")
data("TCGAmutscaled")
# ensure that the number and order of features in the test data are identical to those in the training data (TCGA)
geneexp_bf_mapping = rna_log_tpm_ICGC[colnames(TCGAmRNAscaled), ]
geneexp_bf_mapping = t(scale(t(geneexp_bf_mapping)))
mirexp_bf_mapping = mir_ICGC[colnames(TCGAmiRNAscaled), ]
mirexp_bf_mapping = t(scale(t(mirexp_bf_mapping)))
methy_bf_mapping = methy_M_ICGC[colnames(TCGAmetscaled), ]
methy_bf_mapping = t(scale(t(methy_bf_mapping)))
cnv_bf_mapping = cnv_ICGC[colnames(TCGAcnvscaled), ]
cnv_bf_mapping = t(scale(t(cnv_bf_mapping)))
mut_bf_mapping = mut_ICGC[colnames(TCGAmutscaled), ]
mut_bf_mapping = t(scale(t(mut_bf_mapping)))
mut_bf_mapping[is.na(mut_bf_mapping)]=0
# Projecting each omics data into an unified space
mRNAexprCCA = t(geneexp_bf_mapping) %*% TCGA_projection_mx$ws[[1]]
mRNAexprCCA_2 = scale(mRNAexprCCA)
miRNAexprCCA = t(mirexp_bf_mapping) %*% TCGA_projection_mx$ws[[2]]
miRNAexprCCA_2 = scale(miRNAexprCCA)
methyexprCCA = t(methy_bf_mapping) %*% TCGA_projection_mx$ws[[3]]
methyexprCCA_2 = scale(methyexprCCA)
cnvexprCCA = t(cnv_bf_mapping) %*% TCGA_projection_mx$ws[[4]]
cnvexprCCA_2 = scale(cnvexprCCA)
mutexprCCA = t(mut_bf_mapping) %*% TCGA_projection_mx$ws[[5]]
mutexprCCA_2 = scale(mutexprCCA) # samples in rows and genes in columns 
# Multi-omics data fusion
a1 = a2 = a3 = a4 = a5 = 0.2
data_input = a1*mRNAexprCCA_2 + a2*miRNAexprCCA_2 + a3*methyexprCCA_2 + a4*cnvexprCCA_2 + a5*mutexprCCA_2
colnames(data_input) = paste("X",1:ncol(data_input),sep = "")

3.2 Multi-omics high-grade serous ovarian cancer subtype classification

The classifyMSOC function requires an expression matrix with samples in rows and multi-omics features in columns. The column names of the expression profile should be X1, X2, …, X100. The code chunk below demonstrates how to perform classification using primary high-grade serous ovarian cancer example data.


# MSOC prediction of primary high-grade serous ovarian cancer
result <- classifyMSOC(data_input)
label <- result$label
prob <- result$prob %>%
  `colnames<-`(paste("MSOC", 1:5, "_prob", sep = ""))
res <- data.frame(prob, subtype = paste("MSOC", label, sep = "") ) %>%
  `rownames<-`(names(label))
head(res)
#>         MSOC1_prob MSOC2_prob MSOC3_prob MSOC4_prob MSOC5_prob      subtype
#> DO46325 0.11748928 0.21354437 0.05026826 0.21696765 0.40173043 MSOCCluster5
#> DO46326 0.06550225 0.07277660 0.07769433 0.48309416 0.30093266 MSOCCluster4
#> DO46327 0.21244671 0.24825725 0.08599393 0.05895582 0.39434629 MSOCCluster5
#> DO46328 0.59245296 0.03296241 0.26581691 0.08998380 0.01878392 MSOCCluster1
#> DO46329 0.12995571 0.36259823 0.06636200 0.17749717 0.26358689 MSOCCluster2
#> DO46330 0.28355179 0.06877282 0.58781733 0.01797763 0.04188043 MSOCCluster3

4 Session Info

#> R version 4.3.2 (2023-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> time zone: Asia/Hong_Kong
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4          MSOCclassifier_0.1.0 devtools_2.4.5      
#> [4] usethis_3.1.0        caret_6.0-94         lattice_0.21-9      
#> [7] ggplot2_3.5.1        BiocStyle_2.30.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] pROC_1.18.5          remotes_2.5.0        rlang_1.1.4         
#>  [4] magrittr_2.0.3       e1071_1.7-16         compiler_4.3.2      
#>  [7] callr_3.7.6          vctrs_0.6.5          reshape2_1.4.4      
#> [10] stringr_1.5.1        profvis_0.4.0        pkgconfig_2.0.3     
#> [13] fastmap_1.2.0        ellipsis_0.3.2       utf8_1.2.4          
#> [16] promises_1.3.2       rmarkdown_2.28       prodlim_2024.06.25  
#> [19] sessioninfo_1.2.2    ps_1.8.0             purrr_1.0.2         
#> [22] xfun_0.48            cachem_1.1.0         jsonlite_1.8.9      
#> [25] recipes_1.1.0        later_1.4.1          parallel_4.3.2      
#> [28] R6_2.5.1             bslib_0.8.0          stringi_1.8.4       
#> [31] parallelly_1.38.0    pkgload_1.4.0        rpart_4.1.21        
#> [34] lubridate_1.9.3      jquerylib_0.1.4      Rcpp_1.0.13         
#> [37] bookdown_0.41        iterators_1.0.14     knitr_1.48          
#> [40] future.apply_1.11.3  httpuv_1.6.15        Matrix_1.6-1.1      
#> [43] splines_4.3.2        nnet_7.3-19          timechange_0.3.0    
#> [46] tidyselect_1.2.1     rstudioapi_0.17.0    yaml_2.3.10         
#> [49] timeDate_4041.110    codetools_0.2-19     miniUI_0.1.1.1      
#> [52] curl_6.0.1           processx_3.8.4       listenv_0.9.1       
#> [55] pkgbuild_1.4.4       tibble_3.2.1         plyr_1.8.9          
#> [58] shiny_1.10.0         withr_3.0.2          evaluate_1.0.1      
#> [61] future_1.34.0        desc_1.4.3           survival_3.5-7      
#> [64] proxy_0.4-27         urlchecker_1.0.1     pillar_1.9.0        
#> [67] BiocManager_1.30.25  foreach_1.5.2        stats4_4.3.2        
#> [70] generics_0.1.3       munsell_0.5.1        scales_1.3.0        
#> [73] globals_0.16.3       xtable_1.8-4         class_7.3-22        
#> [76] glue_1.7.0           tools_4.3.2          data.table_1.16.2   
#> [79] ModelMetrics_1.2.2.2 gower_1.0.1          fs_1.6.4            
#> [82] grid_4.3.2           ipred_0.9-15         colorspace_2.1-1    
#> [85] nlme_3.1-163         cli_3.6.3            fansi_1.0.6         
#> [88] lava_1.8.0           gtable_0.3.6         sass_0.4.9          
#> [91] digest_0.6.37        htmlwidgets_1.6.4    memoise_2.0.1       
#> [94] htmltools_0.5.8.1    lifecycle_1.0.4      hardhat_1.4.0       
#> [97] mime_0.12            MASS_7.3-60

References

Witten, Daniela M, and Robert J Tibshirani. 2009. “Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data.” Statistical Applications in Genetics and Molecular Biology 8 (1).