ProteomeExpert-Overview

The rapid progresses of high throughput mass spectrometry (MS)-based proteomics such as data-independent acquisition (DIA) and its penetration to clinical studies have generated increasing number of proteomic data sets containing 100s-1000s samples. To analyze these big proteomic data sets more efficiently and conveniently, we present a web server-based software tool ProteomeExpert implemented in Docker, which offers various analysis tools for experimental design, data mining, interpretation, and visualization of big proteomic data set. ProteomeExpert can be deployed on an operating system with Docker installed. Availability: The Docker image of ProteomeExpert is freely available from https://hub.docker.com/r/lifeinfo/proteomeexpert. The source code of ProteomeExpert is also openly accessible at http://www.github.com/lifeinfo/ProteomeExpert/.

Modules include:

Experimental Design: Power Analysis & Batch Design
Data Upload All modules rely on data uploading at this step, except experimental design and other tools
Data Preprocesing
Quality Control: Missing Value & Pearson Correlation
Statistics
Machine Learning: Feature selection, Unsupervised learning and Supervised learning
Annotations
Other tools: Peptide to protein

ProteomeExpert-Experimental design

Power analysis

Power analysis is a statistical strategy that allows us to determine the sample size required to detect a preset effect under a given test statistics, such as Chi-square test or t-test. In particular, we need to pay attention to the calculated sample size and the actual sample size in an experiment. As observed empirically, when the expression of a protein is not that high, say less than 17 after log2 scale transformation, the required sample size will be compromised due to missing data and the statistical power is compromised too.

Power analysis, set parameters:
Number of Proteins (default=5000) is the estimated number of proteins identified of your dataset.
Mean abundance (default=13) is the average intensity (take log2)of proteins in the experimental group.
Mean abundance 0 (default=13.5) is the average intensity (log2 transformed)of proteins in the control group.
Alpha (default=0.05) is the significance level of the test (the p value).
Beta (default=0.2) is the probability of accepting the null hypothesis even though the null hypothesis is false (i.e. false negative), when the real difference is equal to the minimum effect size. Beta = 1 - power.
Standard deviations (default=0.75) have to be estimated for measured variables, usually we assumed it equals to the standard deviation of control group.
Click on the Submit and the estimated sample size and figure will be shown on the right side of the browser window.

Batch design

The main purpose for batch design is to allocate samples into balanced groups to minimize technical bias, say instrumental variations, for large cohort studies.
Upload your sample data matrix in .txt or .csv format with the bottom Browse. Choose separator for the file according to its format. Comma for .csv, Semicolon , Comma or Tab for .txt. file.
Select columns for balanced batch design means choosing the column names (attributes) of your sample dataset that need to be balanced.
Weights for columns is the different weights between different attributes when the influencers are considered together. For example, if you submit 1,1,2 as input for selected columns A, B, C. This means the weight of column A is 25%, of column B is 25% and of column C is 50%. Normalize with , n is the number of attributes.
Number of samples in each batch means how many samples you expected in each balanced group.
Select numeric columns for balanced batch design means if the attribute is numeric type, such as age, tumor node, metastasis, et.al, it needed to be clarified.

Download the batch design test data

For this demo, we will be using a Delayed post-hypoxic leukoencephalopathy (not published) dataset, comprised of 168 samples' information, along with characters of “Sample ID”, “Type”, “Sex”, “Age”, “TNM” and “Fuhrman”. Download the batchdesign.csv file from “Online Help - Test data files used for batch design - Get”
Click on Browse.. to upload the batch_design.csv file, choose Comma as separator.
Select “Type”, “Sex”, “Age”, “TNM” and “Fuhrman” for balanced batch design.
Input weights “1,1,1,1,1” for “Type”, “Sex”, “Age”, “TNM” and “Fuhrman”.
Input “15” as number of samples in each batch.
Select “Age” and “TNM” as numeric columns for balance batch design.
Click on Submit, waiting for the result that would be shown on the right side. batchId shows divided group number.
Click on Download to get the BatchDesignResult.txt file.

ProteomeExpert-Data upload

Overview

Data Upload

Data Upload is the core data input interface for user to upload your own data file. The Data Upload module allows uploading your specific protein matrix and sample annotation file (including experiment run sample file and individual file if it has) as the input data for most of modules. Moreover, it interactively merges the sample and individual information into one file which is required by some modules such as statistics, data mining, data pre-processing etc. It includes two ways to upload data: Two files format; Three files format. Choose two or three files format depends on whether the sample file stores the enough annotation you want.

Two files format

Upload your files (protein, sample) .txt or .csv format with the Browse button. Choose separator for the file according to its format. Comma for .csv, Semicolon, Comma or Tab for .txt. file.

Three files format

Upload your files (protein, sample, individual) .txt or .csv format with the bottom Browse . Choose separator for the file according to its format. Comma for .csv, Semicolon, Comma or Tab for .txt. file. Select sample id (protein file should have the same sample id as sample file) columns for further analysis, individual id/name in sample file and individual file as reference for data merge. This tool will merge multiple files as template for further analysis.

Tutorial

Two files format

Upload your protein matrix file and sample information file as follows:

Three files format

Download the protein, sample and individual test data
Download test_prot.txt, test_sample.csv and test_individual.csv files from “Online Help - Test data files used for batch design - Get”, test_individual.csv files comprise 21 individual information, _test_sample.csv _contain individual and sample information. All of the test data are from Delayed post-hypoxic leukoencephalopathy (DPHL) dataset.
Select your protein file: click on the Browse.. to upload the test_prot.txt file, choose Tab as separator.
Select your sample file: click on the Browse.. to upload the sample_individual.csv, choose Comma as separator.
Select your individual file click on the Browse.. to upload the individual_prot.txt file, choose Comma as separator.
Annotate sample columns: select SampleName as sample id; select Individual_ID as individual id/name, click on Submit .
Annotate individual columns: select Individual_ID as individual id/name, click on Submit .
Click on Merge, result would be shown on the bottom on the page. After merge, all the data uploaded would save in the sever for further analysis.

ProteomeExpert-Data preprocessing

Overview

Data Preprocessing

Data Preprocessing is used to transform data in accordance with modeling experiment conditions configured in the project.

The protein matrix analyzed in Data Preprocessing should be uploaded in “Data Upload - select your protein file”. Choose select matrix: uploadedProtMatrix for the following analysis. This module including methods for log transformation, missing value substitution, normalization, batch effects adjustments (using Combat), and replicates treatment (using values of replicates to fill up missing values).

If you need to do a logarithmic transformation of the data, select Log Transform to accomplish. Here we display Log2 and Log10 method. Choose None to skip this step.
We provide four options for Missing Value Substitution: “1”, “0”, “10% of minimum” or “minimum”.Choose None to skip this step.
Three functions, i.e. “Quantile”, “Z-score” and “Max-Min” could be used for data Normalization.

Besides, we also offer replacement of Mean value or Median value in the treatment of Technical Replicates and Biological Replicates data. For unnecessary batches, select batch column name to Remove Batch Effects.

Tutorial

Download the test data
For this demo, we will be using a proteins matrix dataset, comprised of 3724 identified proteins from 24 samples. Download the test_prot.txt file from “Online Help - Test data files used for data console - The test protein matrix contains 24 DIA runs”
Upload the test file to “Data Upload - select your protein file”.
Go to Data Preprocessing and select “uploadedProtMatrix”, set parameters e.g.:
- Log Transform: Log2
- Missing Value Substitution: 0
- Normalization: Quantile
- Batch effects correction: None
- Technical Replicates: None
- Biological Replicates: None
Click on Submit, waiting for the result that would be shown on the right side.
Click on Download to get the PreProcessed.txt file.

ProteomeExpert-Quality control

Overview

Quality control (QC)

'QC' allows to show the quality of your protein matrix, which is measured by missing value ratio and reproducibility (Pearson Correlation). The objective protein matrix data is required to be uploaded in the step of 'Data Upload - select your protein file' before starting this step. Select matrix select uploadedProMatrix as the matrix to be analyzed and Select your interesting column name as your interesting attribute, which will count the number of proteins hierarchy according to the non-missing ratio of each protein. Then choose MissingValueExplore and Reproducibility as the modules you want to process. MissValueExplore module design to explore missing data distributions, focusing on numeric missing data. The report shows missing data distributions in different tissue/disease types and show distributions in each row (proteins) and each column (samples). Reproducibility module is designed to explore correlations between each sample by Pearson Correlation.

Tutorial

Select uploadedProMatrix as the matrix to be analyzed.
Select MissValueExplore and Reproducibility as modules you want to process. click on Submit, results would be shown on the right side on the page.

The screenshot of reproducibility

ProteomeExpert-Statistics

Overview

Here we provide some statistical methods for analyzing identified proteins:

t-test

t-test is any statistical hypothesis test in which the test statistic follows a student's t-distribution under the null hypothesis.

VolcanoPlot

Volcano plot is a type of scatterplot that is used to quickly identify changes in large datasets composed of replicate data. It plots significance versus fold-change on the y- and x-axes, respectively.

ViolinPlot

Violin plot is a method of plotting numeric data. It is a box plot with a rotated kernel density plot on each side. The violin plot is similar to box plots, except that it also shows the probability density of the data at different values (in the simplest case this could be a histogram).

RadarMap

Radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. The relative position and angle of the axes is typically uninformative.

Parameters

For t-test, we provide the following parameters to be set:

Type:
- Two sided t-test, H0: mu=0, Ha: mu!=0
- One sided t-test:
- Less (H0: mu=0, Ha: mu<0)
- Greater (H0: mu=0, Ha: mu>0)
- Click on Paired samples if you need to test whether the average of two samples is significantly different from the population they represent.
- Click on Equal variance if two samples with equal variance suppose
Critical: In the case of single-group design, a standard value or population mean must be given and a set of quantitative observation results should be provided. In the case of paired design, the difference of each pair of data must follow normal distribution. In the case of group design, individuals are independent with each other. Both groups of data are taken from the population of normal distribution and meet the homogeneity of variance. The reason why these preconditions are needed is that t statistics calculated under such preconditions must obey t distribution, and t test is the test method that takes t distribution as its theoretical basis.
Confidence level: usually 95%
Adjust p-value method:
- none
- bonferroni
- hochberg
- hommel
- holm
- BH
- BY
- fdr

For VolcanoPlot, we provide the following parameters to be set:

Click on Already Log2 transformed protein matrix if your proteins data had already been transformed at Data Preprocessing.
Adjust p-value threshold.
Fold change threshold.

Tutorial

The data analyzed in this part should have been uploaded in Data Console part.
Select your interesting column name, it is usually tissue/disease type. Here we choose TissueType.
Click on the four statistic methods: t test, Volcano Plot, Violin Plot and Radar Map
Click on Submit.
Click on t-test and set t-test parameters on the right side as following:
- two.sided
Equal variance
Confidence level = 0.95
Adjust p-value method: none
Select the first group: Prostate cancer
Select the second group: Benign prostate hyperplasia tissue
Click on Submit.
Click on VocanoPlot and set VocanoPlot parameters on the right side as following:
- Already Log2 transformed protein matrix
Adjust p-value threshold: 0.05
Fold change threshold: 2
Click on Submit.
Click on ViolinPlot to see the ViolinPlot figure.
Click on RadarMap to see the RadarMap figure.

ProteomeExpert-Machine learning

Overview

Machine learning module includes feature selection, clustering and classification. In the feature selection module, users not only apply filter methods to filter features of near zero variance and high correlation, but also apply additional feature selection methods: LASSO (Tibshirani, 1996), genetic algorithm, and random forest. As in clinical application classifying disease into subtypes is of great interest in the fields of diagnose and prognosis, users can perform various machine learning analyses: PCA, t-SNE, and UMAP for unsupervised analysis, and decision tree, random forest, and XGBoost for supervised learning.

Feature selection

The feature selection is the process that choose a reduced number of explanatory variables to describe a response variable. The feature selection is even more important for the high-dimensional datasets, such as genomics and proteomics data. The main goal of proteomics biomarker discovery is to identify which are the most importance proteins associated with the disease. Here we used three well known feature selection methods: LASSO, genetic algorithm, and random forest. We show the process of these feature selection and describing how to use them for biomarker discover. LASSO is short of Least Absolute Shrinkage and Selection Operator. We used glmnet package in R (employ cv.glmnet function to choose the most appropriate tuning parameter λ, that controls the strength of the penalty and set α = 1 for LASSO regularization). Once the λ is set, glmnet function is used to do the feature selection according to this λ. Genetic Algorithm (GA) is a stochastic optimization method inspired by the famous Charles Darwin’s idea of natural selection. Here we used the GA to select the right number of proteins to find positive biomarkers. We defined fitness function as the ROC divided number of features for two classification and accuracy divided number of features for multiple classification. Selection, crossover and mutation were done automatically by the ga function in GA package. Random forest is very popular in bioinformatics area and has achieved fantastic results. Functions sbf and rfe in randomForest package were applied here to afford feature selection using cross validation.

Upload the test files in “Data Upload” then set parameters e.g.:

Unsupervised

Following algorithms (implementation mainly based on R packages pheatmap, stats tsne and umap) are provided for selection:

Heatmap
PCA
t-SNE
UMAP

Upload the test files in “Data Upload” then set parameters e.g.:

The result of PCA (demo).

Supervised

It includes the following algorithm:

Decision Tree: implementation based on R packages rpart and rpart.plot
Random forest: implementation based on R package randomForest
XGBoost: implementation based on R package xgboost

Upload the test files in “Data Upload” then set parameters e.g.:

The result of using decision tree (demo).

ProteomeExpert-Other tools

Overview

This module provides some optional tools which may be used in quantitative proteomics data analysis. Currently only peptide to protein function is available.

Procedure

The peptide to protein inference integrates data normalization, batch effects reduction, missing value imputation and peptide matrix to protein matrix transformation. We provide two alternative methods: the mean of the top 3 precursor intensity and the linear regression of top 3 precursor intensity.

Steps of the linear regression of top 3 precursor intensity

1.Prepare the peptide matrix, where each row represents a peptide and each column represents a sample.
2.Log2 -transform the precursor intensities.
3.perform quantile normalization across all samples via the normalize.quantiles function from Bioconductor R library preprocessCore.
4.Technical imputation, using technical replica to substitute NAs in each other. 
5.Batch correction based on user defined batch using the R package Combat. 
6.Calculate the mean expression of each row to rank the peptide precursor intensity.
7.Order peptide precursors in protein group, first by number of NAs (Intensity equal to 0, ascending) and then by the mean expression (descending)
8.Keep no more than top 3 precursors (# of NAs ascending and order by mean expression descending) for each protein group. 
9.To impute some missing values at protein level for some sample from multiple peptides quantified, we built linear model for step 8 matrix, always use the top protein groups as dependent variable (γ) and chose values greater than 0 as multiple independent variables (χ) by decreasing order of priority. Moreover, impute y using a linear combination of X. Only the regression coefficient with P value<=0.05 and R2 > 0.36 were accepted and rounded up to two decimal places. Then we used the intensity value from the top1 peptide precursor to represent the protein intensity. We apply lm function in R to build the model and do the above imputation.  
10.Keep the top1 precursor corresponding proteins and its' intensity.

Steps of the mean of top 3 precursor intensity

Steps 1 to 8 are the same as above. Then, calculate the mean of the top 3 precursors intensity for each protein as its protein intensity. Finally, keep proteins along with its intensity.

Tutorial

Download the peptides.txt, technical_replicas.txt and batch_file.txt files from “Online Help - Test data files used for peptide to protein inference - Get”
Select your peptide file: click on the Browse.. to upload the peptides.txt file, choose Tab as separator and with Header checkbox selected.
Select your technical file: click on the Browse.. to upload the technical_replicas.txt, choose Tab as separator.
Select your individual file click on the Browse.. to upload the batch_file.txt file, choose Tab as separator.
Click Submit button.

Test data files used for peptide to protein inference

Test data files used for data upload

The test protein matrix contains 24 DIA runs. Missing value imputation (using data preprocessing is a simple way) is required in order to use all functions except quality control.

Get

The test sample information file contains 24 DIA samples.

Get

The test individual file contains 21 individuals (Only for three files input).

Get

Test data files used for batch design

Get

Update log

Power analysis

Description:

Reference:

Results:

Batch Design

Click to design:

Description:

Upload Data (require two steps):

Step 1

Step 2

Data preview

Protien matrix preview

Sample annotation display

Description:

Upload Data (require 3 steps):

Step 1

Step 2

Step 3

Annotate sample columns:

Annotate individual columns:

Result

Log Transform:

Missing Value Substitution:

Normaliztion:

Remove Batch Effect :

Technical Replicates:

Biological Replicates :

Description:

Select matrix and label you want to process:

Select modules you want to process:

Click to process:

Description:

Description

The correlation between two variables reflects the degree to which the variables are related.

Please upload your data files in data console first:

Description:

Set parameters:

Description:

Set parameters for volcano plot

Set parameters for t test:

Description:

Description:

Data section:

Please note: protein matrix and annotation file shoule be upload in data console first.

Feature selection including three catgory: filter, wrapper and embedding

Summary:

Data section:

Please note: protein matrix and annotation file shoule be upload in data console first.

Log Transform:

Module section:

Description:

Description:

Description:

Reference:

Description:

Reference:

Data section:

Please note: protein matrix and annotation file shoule be upload in data console first.

Summary

Results

Summary

Database

Description:

Click to process:

Summary

Download all protein matrix

ProteomeExpert-Overview

Modules include:

ProteomeExpert-Experimental design

Power analysis

Batch design

ProteomeExpert-Data upload

Overview

Data Upload

Two files format

Three files format

Tutorial

Two files format

Three files format