BatchSevrer is a web server for batch effect evaluation, visualization and correction.

Update log

[v1.0.2] 2022-03-11

Fixed bug in 'keep' option of Minus replacement after Combat.

[v1.0.1] 2021-08-18

Fixed color error of UMAP label, etc.

It's mainly based on following R packages, which are further wrapped by `R/Shiny`.

pvca: For batch effects evaluation

umap, ggplot2, plotly: For batch effects visualization

sva: For batch effects correction (improved version)

fitdistrplus, extraDistr: For goodness-of-fit test

The batchSever is composed of the following three layers:

Please choose your data file

Browse...

Separator

Comma Semicolon Tab xls/xlsx

Missing value replacement

None 1 0 10% of minimum minimum

Quantile normalization

Log2 transform

Please choose your sample information file

Browse...

Separator

Comma Semicolon Tab xls/xlsx

PVCA

PVCA assess the batch sources by fitting all "sources" as random effects including two-way interaction terms in the Mixed Model(depends on lme4 package) to selected principal components, which were obtained from the original data correlation matrix. Pierre Bushel (2019). pvca: Principal Variance Component Analysis (PVCA). R package version 1.24.0.

Select contributing effect column name(s)

Set the percentile value of the minimum amount of the variabilities that the selected principal components need to explain

Pieplot
Barplot

UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. Please note that missing value is not allowed in data matrix in umap.

Missing value replacement

None 1 0 10% of minimum minimum

number of nearest neighbors:

distances method

euclidean manhattan cosine pearson

number of iterations:

initial coordinates

spectral random

minimumn dist in the final layout

initial value of 'learning rate'

learning rate

non-neighbor points are used per point and per iteration during layout optimization

Select contributing effect column name(s)

ComBat

The ComBat function adjusts for known batches using an empirical Bayesian framework. So known batch variable is required in your dataset. Here you should pay attention to the [parametric estimate method] choice, which was improved compare to the original ComBat method. The option [automatic] will automatically decide to set parametric estimate method to parametric or nonparametric according to the data distribution.

Select known batch effect column name

Select surrogate variable(s)

Parametric estimate method

automatic parameter noparameter

Only adjusts the mean of the batch effects across batches (default adjusts the mean and variance)

No Yes

Readme

Batch effects are unwanted data variations that may obscure biological signals, leading to bias or errors in subsequent data analyses. Effective evaluation and elimination of batch effects are necessary for omics data analysis. In order to facilitate the evaluation and correction of batch effects, here we present BatchSever, an open-source R/Shiny based user-friendly interactive graphical web platform for batch effects analysis. As the autohrs of original ComBat have extensively investigated, if the experiment has not been properly designed, or if the batch design information is missing, no effective batch correction could be performed. Unbalanced batch-group design and inappropriate missing value imputation will pose challenges to effective batch effect correction.

Test data download

dataMatrix sampleInfo

Tutorial

Data input

Users are required to prepare and upload two files: a data file and a sample information file. The format of these files can be tab-delimited, space-separated, comma-delimited or Excel file. BatchServer also provides test data files (see above section). Users can upload these two files in the Data Input menu, then click the Submit button. The data read module will read, process, and store the files for subsequent usage.

Notice:

Missing value imputation

We offer multiple means to impute missing values, when needed. While these methods have worked well for our test data, these methods may be surpassed by other methods in certain data sets. Users may refer to a recently compiled missing value imputation tool kit provided in a web server tool NAguideR (http://www.omicsolution.org/wukong/NAguideR/).

Quantile normalization

In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. Quantile normalization is data handling technique that works well on microarrays or proteomics in practise.

Log2 transform

The general reason to log-transform data (log2 or otherwise) is to make variation similar across orders of magnitude. This isn't really a must, but usually makes things more convenient.

Batch effect estimation, visualization and correction

After uploaded both data and sample information files. Users are advised to evaluate whether their data have batch effects using PVCA or UMAP equipped with the online server. Both methods can show the visualized results of batch effects. If the batch effect is heavy, the next step is to remove it using the improved ComBat.

Notice:

PVCA & UMAP

In most cases the user does not need to change the parameter Settings.

Balanced batch-group design

If the batch-group design is balanced, ComBat approach removes most variation attributable to the batch effect, increasing statistical power. However, if unbalanced, the batch variation is underestimated, and corrected data still retain batch variation, reducing the statistical power.

ComBat

In BatchServer we introduced autoComBat, a modified version of ComBat, which is the most widely adopted tool for batch effect correction. The autoComBat could automatly determine whether to use the parametric bayes or nonparametric bayes method using Kolmogorov-Smirnov Goodness of Fit Test. The goal of the ComBat is to remove all unwanted sources of variation while protecting the contrasts due to the primary variables included in surrogate variables. This leads to the identification of features that are consistently different between groups, removing all common sources of latent variation.

Surrogate variables are covariates constructed directly from high-dimensional data (like gene expression/RNA sequencing/methylation/brain imaging data) that can be used in subsequent analyses to adjust for unknown, unmodeled, or latent sources of noise. Therefore if sample sizes are large enough, it is recommended to model all available covariates expected to be significant. Without setting the surrogate variables will only remove the effect of knowing batch variables. All sources of latent biological variation will remain in the data using this approach.

The ComBat from sva package has a parameter for only adjusts the mean of the batch effects across batches (default adjusts the mean and variance). This option is recommended for cases where milder batch effects are expected (so no need to adjust the variance), or in cases where the variances are expected to be different across batches due to the biology. This option is recommended set 'yes' for cases where milder batch effects are expected (so no need to adjust the variance), or in cases where the variances are expected to be different across batches due to the biology. For example,suppose a researcher wanted to project a knock-down genomic signature to be projected into the TCGA data. In this case, the knockdowns samples may be very similar to each other (low variance) whereas the signature will be at varying levels in the TCGA patient data. Thus the variances may be very different between the two batches (signature perturbation samples vs TCGA), so only adjusting the mean of the batch effect across the samples might be desired in this case.

Data output

Users could examine and download the result figures of batch effect evaluation to evaluate the batch effects from the figures by manual inspection. The batch effect corrected data obtained by the improved ComBat is also provided for users to download.

References

Tomasz Konopka (2020). umap: Uniform Manifold Approximation and Projection. R package version 0.2.5.0. https://CRAN.R-project.org/package=umap
Pierre Bushel (2018). pvca: Principal Variance Component Analysis (PVCA). R package version 1.22.0.
Jeffrey T. Leek, W. Evan Johnson, Hilary S. Parker, Elana J. Fertig, Andrew E. Jaffe, John D. Storey, Yuqing Zhang and Leonardo Collado Torres (2019). sva: Surrogate Variable Analysis. R package version 3.30.1.
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118-127, doi:10.1093/biostatistics/kxj037 (2007).
Zhang, Y., Jenkins, D. F., Manimaran, S., Johnson, W. E. (2018). Alternative empirical Bayes models for adjusting for batch effects in genomic studies. BMC bioinformatics, 19 (1), 262.

Example

- Input data

- The screenshot of using PVCA

- The screenshot of using UMAP

- Parameters settings and result of using improved ComBat

Software author:

Tiansheng Zhu; tszhu @ fudan.edu.cn

License:

BatchServer is an open-source software implemented in pure R language and the source code is freely available at https://github.com/zhutiansheng/BatchServer. Now Batch Server is supported by both school of computer science of Fudan University (zhou's lab: admis.fudan.edu.cn) and school of life sciences of Westlake University (guo's lab: www.guomics.com). The software is published by Journal of Proteome Research with assigned DOI: 10.1021/acs.jproteome.0c00488.