Lingli He1 and Xin Wang1

1 Department of Surgery, The Chinese University of Hong Kong, Hong Kong SAR, China.

1 Introduction

The vignette helps the user to do multi-omics high-grade serous ovarian cancer subtyping using sparse mCCA (Witten and Tibshirani (2009)) and weighted average. Paired mRNA expression, microRNA expression, DNA methylation, copy number variation, and mutation data from TCGA-OV dataset were used for the training of multi-omics high-grade serous ovarian cancer classifier. The package accepts any combination of mRNA expression, microRNA expression, DNA methylation, copy number variation, and mutation data as input.

2 Package installation

Please run all analyses in this vignette under version 2.10 of R prior to installation of package MSOCclassifier, R packages caret should be installed. The package can be installed directly from CRAN (Comprehensive R Archive Network):

options(repos = c(CRAN = "https://cloud.r-project.org/"))
install.packages("caret")
library(caret)
library(devtools)
# install the "MSOCclassifier" package
install_github("Carpentierbio/MSOCclassifier")

3 A quick start

3.1 Projecting each omics data into a unified space and integrating them

The example dataset used in this analysis comes from the ICGC-OV cohort on 79 ovarian cancer patients, downloaded from https://dcc.icgc.org/projects/OV-AU (This link may no longer be accessible as of now).

  • input data can be any combination of mRNA expression, microRNA expression, DNA methylation, copy number variation, and mutation data.
  • if the single-omics or multi-omics integrated data includes mRNA expression or miRNA expression, log2 transformation is required before integration.
  • mRNA expression, microRNA expression, DNA methylation input data should be pre-processed and z-score normalized.
options(knitr.duplicate.label = "allow")
library(MSOCclassifier)
library(dplyr)
#> Warning: 程辑包'dplyr'是用R版本4.3.3 来建造的
#> 
#> 载入程辑包:'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# Load example multi-omics expression profile
data("rna_log_tpm_ICGC")
data("mir_ICGC")
data("methy_M_ICGC")
data("cnv_ICGC")
data("mut_ICGC")
# Load projection matrices derived from 226 TCGA-OV samples
data("TCGA_projection_mx")
# Load pre-processed TCGA-OV multi-omics data for feature selection in validation cohort
data("TCGAmRNAscaled")
data("TCGAmiRNAscaled")
data("TCGAmetscaled")
data("TCGAcnvscaled")
data("TCGAmutscaled")
# ensure that the number and order of features in the test data are identical to those in the training data (TCGA)
geneexp_bf_mapping = rna_log_tpm_ICGC[colnames(TCGAmRNAscaled), ]
geneexp_bf_mapping = t(scale(t(geneexp_bf_mapping)))
mirexp_bf_mapping = mir_ICGC[colnames(TCGAmiRNAscaled), ]
mirexp_bf_mapping = t(scale(t(mirexp_bf_mapping)))
methy_bf_mapping = methy_M_ICGC[colnames(TCGAmetscaled), ]
methy_bf_mapping = t(scale(t(methy_bf_mapping)))
cnv_bf_mapping = cnv_ICGC[colnames(TCGAcnvscaled), ]
cnv_bf_mapping = t(scale(t(cnv_bf_mapping)))
mut_bf_mapping = mut_ICGC[colnames(TCGAmutscaled), ]
mut_bf_mapping = t(scale(t(mut_bf_mapping)))
mut_bf_mapping[is.na(mut_bf_mapping)]=0
# Projecting each omics data into an unified space
mRNAexprCCA = t(geneexp_bf_mapping) %*% TCGA_projection_mx$ws[[1]]
mRNAexprCCA_2 = scale(mRNAexprCCA)
miRNAexprCCA = t(mirexp_bf_mapping) %*% TCGA_projection_mx$ws[[2]]
miRNAexprCCA_2 = scale(miRNAexprCCA)
methyexprCCA = t(methy_bf_mapping) %*% TCGA_projection_mx$ws[[3]]
methyexprCCA_2 = scale(methyexprCCA)
cnvexprCCA = t(cnv_bf_mapping) %*% TCGA_projection_mx$ws[[4]]
cnvexprCCA_2 = scale(cnvexprCCA)
mutexprCCA = t(mut_bf_mapping) %*% TCGA_projection_mx$ws[[5]]
mutexprCCA_2 = scale(mutexprCCA) # samples in rows and genes in columns 
# Multi-omics data fusion
a1 = a2 = a3 = a4 = a5 = 0.2
data_input = a1*mRNAexprCCA_2 + a2*miRNAexprCCA_2 + a3*methyexprCCA_2 + a4*cnvexprCCA_2 + a5*mutexprCCA_2
colnames(data_input) = paste("X",1:ncol(data_input),sep = "")
 

3.2 Multi-omics high-grade serous ovarian cancer subtype classification

The classifyMSOC function requires an expression matrix with samples in rows and multi-omics features in columns. The column names of the expression profile should be X1, X2, …, X100. The code chunk below demonstrates how to perform classification using primary high-grade serous ovarian cancer example data.


# MSOC prediction of primary high-grade serous ovarian cancer
result <- classifyMSOC(data_input)
label <- result$label
prob <- result$prob %>%
  `colnames<-`(paste("MSOC", 1:5, "_prob", sep = ""))
res <- data.frame(prob, subtype = paste("MSOC", label, sep = "") ) %>%
  `rownames<-`(names(label))
head(res)
#>         MSOC1_prob MSOC2_prob MSOC3_prob MSOC4_prob MSOC5_prob      subtype
#> DO46325 0.11748928 0.21354437 0.05026826 0.21696765 0.40173043 MSOCCluster5
#> DO46326 0.06550225 0.07277660 0.07769433 0.48309416 0.30093266 MSOCCluster4
#> DO46327 0.21244671 0.24825725 0.08599393 0.05895582 0.39434629 MSOCCluster5
#> DO46328 0.59245296 0.03296241 0.26581691 0.08998380 0.01878392 MSOCCluster1
#> DO46329 0.12995571 0.36259823 0.06636200 0.17749717 0.26358689 MSOCCluster2
#> DO46330 0.28355179 0.06877282 0.58781733 0.01797763 0.04188043 MSOCCluster3

4 Session Info

#> R version 4.3.2 (2023-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> time zone: Asia/Hong_Kong
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4          MSOCclassifier_0.1.0 devtools_2.4.5      
#> [4] usethis_3.1.0        caret_6.0-94         lattice_0.21-9      
#> [7] ggplot2_3.5.1        BiocStyle_2.30.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] pROC_1.18.5          remotes_2.5.0        rlang_1.1.4         
#>  [4] magrittr_2.0.3       e1071_1.7-16         compiler_4.3.2      
#>  [7] callr_3.7.6          vctrs_0.6.5          reshape2_1.4.4      
#> [10] stringr_1.5.1        profvis_0.4.0        pkgconfig_2.0.3     
#> [13] fastmap_1.2.0        ellipsis_0.3.2       utf8_1.2.4          
#> [16] promises_1.3.2       rmarkdown_2.28       prodlim_2024.06.25  
#> [19] sessioninfo_1.2.2    ps_1.8.0             purrr_1.0.2         
#> [22] xfun_0.48            cachem_1.1.0         jsonlite_1.8.9      
#> [25] recipes_1.1.0        later_1.4.1          parallel_4.3.2      
#> [28] R6_2.5.1             bslib_0.8.0          stringi_1.8.4       
#> [31] parallelly_1.38.0    pkgload_1.4.0        rpart_4.1.21        
#> [34] lubridate_1.9.3      jquerylib_0.1.4      Rcpp_1.0.13         
#> [37] bookdown_0.41        iterators_1.0.14     knitr_1.48          
#> [40] future.apply_1.11.3  httpuv_1.6.15        Matrix_1.6-1.1      
#> [43] splines_4.3.2        nnet_7.3-19          timechange_0.3.0    
#> [46] tidyselect_1.2.1     rstudioapi_0.17.0    yaml_2.3.10         
#> [49] timeDate_4041.110    codetools_0.2-19     miniUI_0.1.1.1      
#> [52] curl_6.0.1           processx_3.8.4       listenv_0.9.1       
#> [55] pkgbuild_1.4.4       tibble_3.2.1         plyr_1.8.9          
#> [58] shiny_1.10.0         withr_3.0.2          evaluate_1.0.1      
#> [61] future_1.34.0        desc_1.4.3           survival_3.5-7      
#> [64] proxy_0.4-27         urlchecker_1.0.1     pillar_1.9.0        
#> [67] BiocManager_1.30.25  foreach_1.5.2        stats4_4.3.2        
#> [70] generics_0.1.3       munsell_0.5.1        scales_1.3.0        
#> [73] globals_0.16.3       xtable_1.8-4         class_7.3-22        
#> [76] glue_1.7.0           tools_4.3.2          data.table_1.16.2   
#> [79] ModelMetrics_1.2.2.2 gower_1.0.1          fs_1.6.4            
#> [82] grid_4.3.2           ipred_0.9-15         colorspace_2.1-1    
#> [85] nlme_3.1-163         cli_3.6.3            fansi_1.0.6         
#> [88] lava_1.8.0           gtable_0.3.6         sass_0.4.9          
#> [91] digest_0.6.37        htmlwidgets_1.6.4    memoise_2.0.1       
#> [94] htmltools_0.5.8.1    lifecycle_1.0.4      hardhat_1.4.0       
#> [97] mime_0.12            MASS_7.3-60

References

Witten, Daniela M, and Robert J Tibshirani. 2009. “Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data.” Statistical Applications in Genetics and Molecular Biology 8 (1).