Welcome to ProteomeExpert, which is a user friendly web server wrapped by R/Shiny for quantitative proteomics data analysis.To analyze large-scale proteomic data sets more efficiently and conveniently, we present a web server-based software tool ProteomeExpert implemented in Docker, which offers various analysis tools for experimental design, data mining, interpretation, and visualization of big proteomic data set.

image.png

 

Description:

Power analysis is a statistical device that allows us to determine the sample size required to detect a preset effect under a given test statistics, such as Chi-square test or t-test. In particular, here we need pay attention to the calculated sample size and the realized sample size in an experiment. As observed empirically, when the expression of a protein is not that high, say less than 17 after log2 scale transformation, the required sample size will be compromised due to missing data. The statistical power is compromised too.

Reference:

Lynch Michael, Walsh Bruce. 1998. Genetics and Analysis of Quantitative Traits.

Results:

                

Description:

Data preview

Protien matrix preview

Sample annotation display

Description:

Annotate sample columns:


Annotate individual columns:




Result

Description:

Data Preprocessing is used to transform data in accordance with modeling experiment conditions configured in the project.

Download
Description:

This module design to explore missing data distributions, focusing on numeric missing data.



                
                


Description
The correlation between two variables reflects the degree to which the variables are related.

Description:

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.


Set parameters:
Description:

In statistics, a volcano plot is a type of scatter-plot that is used to quickly identify chans in large datasets composed of replicate data. It plots significance versus fold-change on the y-and-axes, respectively.


Set parameters for volcano plot

Set parameters for t test:
Description:

A violin plot is a method of plotting numeric data. It is a box plot with a rotated kernel densy plot on each side. The violin plot is similar to box plots, except that they also show the probility density of the data at different values (in the simplest case this could be a histogram).


Description:

A radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. The relative position and angle of the axes is typically uninformative.


Feature selection including three catgory: filter, wrapper and embedding

Summary:

                
Download
Description:

A heat map (or heatmap) is a graphical representation of data where the individual values contained in a matrix are represented as colors.


Description:

Principal component analysis (PCA) is an exploratory analysis tool that emphasizes variation and visualizes possible patterns underlying a dataset. It uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. Upon on the context, PCA is also called eigenvalue decomposition, and eigenvalues (vector) and eigenvectors (matrix) are often used to represent the data.

Mark 1: In proteomic data matrix, missing data (often more missing values for control samples) plays a role in determining the outcome of PCA.

Mark 2: If blank controls (AQUA) are available in the experiment, the coordinates of blank controls can tell the quality of the data.


Description:

T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reductiontechnique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions.

Reference:

L. van der Maaten, H. Geoffrey, Visualizing Data using t-SNE. Journal of Machine Learning Research.


Description:

UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance.

Reference:

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426,2018


Summary

(*Note:If you have a large matrix, the system may be slow, please be patient.)

Results


                  

                  

Summary

Input:


                
String


Database

  • UniProt:The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

  • String-db:Protein-Protein Interaction Networks.

  • KEGG:KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.

  • GO:The Gene Ontology (GO) knowledgebase is the world largest source of information on the functions of genes.

  • Reactome:Reactome is a free, open-source, curated and peer-reviewed pathway database.


Description:

Peptide2Protein provide protein inference function, details are in help page.

Summary

Download all protein matrix

Download

ProteomeExpert-Overview

The rapid progresses of high throughput mass spectrometry (MS)-based proteomics such as data-independent acquisition (DIA) and its penetration to clinical studies have generated increasing number of proteomic data sets containing 100s-1000s samples. To analyze these big proteomic data sets more efficiently and conveniently, we present a web server-based software tool ProteomeExpert implemented in Docker, which offers various analysis tools for experimental design, data mining, interpretation, and visualization of big proteomic data set. ProteomeExpert can be deployed on an operating system with Docker installed. Availability: The Docker image of ProteomeExpert is freely available from https://hub.docker.com/r/lifeinfo/proteomeexpert. The source code of ProteomeExpert is also openly accessible at http://www.github.com/lifeinfo/ProteomeExpert/.


image.png

 

Modules include:

  • Experimental Design: Power Analysis & Batch Design
  • Data Upload All modules rely on data uploading at this step, except experimental design and other tools
  • Data Preprocesing
  • Quality Control: Missing Value & Pearson Correlation
  • Statistics
  • Machine Learning: Feature selection, Unsupervised learning and Supervised learning
  • Annotations
  • Other tools: Peptide to protein

ProteomeExpert-Experimental design

Power analysis

Power analysis is a statistical strategy that allows us to determine the sample size required to detect a preset effect under a given test statistics, such as Chi-square test or t-test. In particular, we need to pay attention to the calculated sample size and the actual sample size in an experiment. As observed empirically, when the expression of a protein is not that high, say less than 17 after log2 scale transformation, the required sample size will be compromised due to missing data and the statistical power is compromised too.

Power analysis, set parameters:
Number of Proteins (default=5000) is the estimated number of proteins identified of your dataset.
Mean abundance (default=13) is the average intensity (take log2)of proteins in the experimental group.
Mean abundance 0 (default=13.5) is the average intensity (log2 transformed)of proteins in the control group.
Alpha (default=0.05) is the significance level of the test (the p value).
Beta (default=0.2) is the probability of accepting the null hypothesis even though the null hypothesis is false (i.e. false negative), when the real difference is equal to the minimum effect size. Beta = 1 - power.
Standard deviations (default=0.75) have to be estimated for measured variables, usually we assumed it equals to the standard deviation of control group.
Click on the Submit and the estimated sample size and figure will be shown on the right side of the browser window.

Batch design

The main purpose for batch design is to allocate samples into balanced groups to minimize technical bias, say instrumental variations, for large cohort studies.
Upload your sample data matrix in .txt or .csv format with the bottom Browse. Choose separator for the file according to its format. Comma for .csv, Semicolon , Comma or Tab for .txt. file.
Select columns for balanced batch design means choosing the column names (attributes) of your sample dataset that need to be balanced.
Weights for columns is the different weights between different attributes when the influencers are considered together. For example, if you submit 1,1,2 as input for selected columns A, B, C. This means the weight of column A is 25%, of column B is 25% and of column C is 50%. Normalize with , n is the number of attributes.
Number of samples in each batch means how many samples you expected in each balanced group.
Select numeric columns for balanced batch design means if the attribute is numeric type, such as age, tumor node, metastasis, et.al, it needed to be clarified.

  1. Download the batch design test data

    For this demo, we will be using a Delayed post-hypoxic leukoencephalopathy (not published) dataset, comprised of 168 samples' information, along with characters of “Sample ID”, “Type”, “Sex”, “Age”, “TNM” and “Fuhrman”. Download the batchdesign.csv file from “Online Help - Test data files used for batch design - Get”

  2. Click on Browse.. to upload the batch_design.csv file, choose Comma as separator.

  3. Select “Type”, “Sex”, “Age”, “TNM” and “Fuhrman” for balanced batch design.

  4. Input weights “1,1,1,1,1” for “Type”, “Sex”, “Age”, “TNM” and “Fuhrman”.

  5. Input “15” as number of samples in each batch.

  6. Select “Age” and “TNM” as numeric columns for balance batch design.

  7. Click on Submit, waiting for the result that would be shown on the right side. batchId shows divided group number.

  8. Click on Download to get the BatchDesignResult.txt file.

ProteomeExpert-Data upload

Overview

Data Upload

Data Upload is the core data input interface for user to upload your own data file. The Data Upload module allows uploading your specific protein matrix and sample annotation file (including experiment run sample file and individual file if it has) as the input data for most of modules. Moreover, it interactively merges the sample and individual information into one file which is required by some modules such as statistics, data mining, data pre-processing etc. It includes two ways to upload data: Two files format; Three files format. Choose two or three files format depends on whether the sample file stores the enough annotation you want.

Two files format

Upload your files (protein, sample) .txt or .csv format with the Browse button. Choose separator for the file according to its format. Comma for .csv, Semicolon, Comma or Tab for .txt. file.

Three files format

Upload your files (protein, sample, individual) .txt or .csv format with the bottom Browse . Choose separator for the file according to its format. Comma for .csv, Semicolon, Comma or Tab for .txt. file. Select sample id (protein file should have the same sample id as sample file) columns for further analysis, individual id/name in sample file and individual file as reference for data merge. This tool will merge multiple files as template for further analysis.

Tutorial

Two files format

Upload your protein matrix file and sample information file as follows: image.png

Three files format

  1. Download the protein, sample and individual test data
    Download test_prot.txt, test_sample.csv and test_individual.csv files from “Online Help - Test data files used for batch design - Get”, test_individual.csv files comprise 21 individual information, _test_sample.csv _contain individual and sample information. All of the test data are from Delayed post-hypoxic leukoencephalopathy (DPHL) dataset.
  2. Select your protein file: click on the Browse.. to upload the test_prot.txt file, choose Tab as separator.
  3. Select your sample file: click on the Browse.. to upload the sample_individual.csv, choose Comma as separator.
  4. Select your individual file click on the Browse.. to upload the individual_prot.txt file, choose Comma as separator.
  5. Annotate sample columns: select SampleName as sample id; select Individual_ID as individual id/name, click on Submit .
  6. Annotate individual columns: select Individual_ID as individual id/name, click on Submit .
  7. Click on Merge, result would be shown on the bottom on the page. After merge, all the data uploaded would save in the sever for further analysis.

    image.png

ProteomeExpert-Data preprocessing

Overview

Data Preprocessing

Data Preprocessing is used to transform data in accordance with modeling experiment conditions configured in the project.

The protein matrix analyzed in Data Preprocessing should be uploaded in “Data Upload - select your protein file”. Choose select matrix: uploadedProtMatrix for the following analysis. This module including methods for log transformation, missing value substitution, normalization, batch effects adjustments (using Combat), and replicates treatment (using values of replicates to fill up missing values).

  • If you need to do a logarithmic transformation of the data, select Log Transform to accomplish. Here we display Log2 and Log10 method. Choose None to skip this step.
  • We provide four options for Missing Value Substitution: “1”, “0”, “10% of minimum” or “minimum”.Choose None to skip this step.
  • Three functions, i.e. “Quantile”, “Z-score” and “Max-Min” could be used for data Normalization.

Besides, we also offer replacement of Mean value or Median value in the treatment of Technical Replicates and Biological Replicates data. For unnecessary batches, select batch column name to Remove Batch Effects.

Tutorial

  1. Download the test data
    For this demo, we will be using a proteins matrix dataset, comprised of 3724 identified proteins from 24 samples. Download the test_prot.txt file from “Online Help - Test data files used for data console - The test protein matrix contains 24 DIA runs”
  2. Upload the test file to “Data Upload - select your protein file”.
  3. Go to Data Preprocessing and select “uploadedProtMatrix”, set parameters e.g.:
    • Log Transform: Log2
    • Missing Value Substitution: 0
    • Normalization: Quantile
    • Batch effects correction: None
    • Technical Replicates: None
    • Biological Replicates: None
  4. Click on Submit, waiting for the result that would be shown on the right side.
  5. Click on Download to get the PreProcessed.txt file.

image.png

ProteomeExpert-Quality control

Overview

Quality control (QC)

'QC' allows to show the quality of your protein matrix, which is measured by missing value ratio and reproducibility (Pearson Correlation). The objective protein matrix data is required to be uploaded in the step of 'Data Upload - select your protein file' before starting this step. Select matrix select uploadedProMatrix as the matrix to be analyzed and Select your interesting column name as your interesting attribute, which will count the number of proteins hierarchy according to the non-missing ratio of each protein. Then choose MissingValueExplore and Reproducibility as the modules you want to process. MissValueExplore module design to explore missing data distributions, focusing on numeric missing data. The report shows missing data distributions in different tissue/disease types and show distributions in each row (proteins) and each column (samples). Reproducibility module is designed to explore correlations between each sample by Pearson Correlation.

Tutorial

  1. Select uploadedProMatrix as the matrix to be analyzed.
  2. Select MissValueExplore and Reproducibility as modules you want to process. click on Submit, results would be shown on the right side on the page.

image.png

image.png

The screenshot of reproducibility

image.png

ProteomeExpert-Statistics

Overview

Here we provide some statistical methods for analyzing identified proteins:

  • t-test

t-test is any statistical hypothesis test in which the test statistic follows a student's t-distribution under the null hypothesis.

  • VolcanoPlot

Volcano plot is a type of scatterplot that is used to quickly identify changes in large datasets composed of replicate data. It plots significance versus fold-change on the y- and x-axes, respectively.

  • ViolinPlot

Violin plot is a method of plotting numeric data. It is a box plot with a rotated kernel density plot on each side. The violin plot is similar to box plots, except that it also shows the probability density of the data at different values (in the simplest case this could be a histogram).

  • RadarMap

Radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. The relative position and angle of the axes is typically uninformative.

Parameters

For t-test, we provide the following parameters to be set:

  1. Type:

    • Two sided t-test, H0: mu=0, Ha: mu!=0
    • One sided t-test:
    • Less (H0: mu=0, Ha: mu<0)
    • Greater (H0: mu=0, Ha: mu>0)
    • Click on Paired samples if you need to test whether the average of two samples is significantly different from the population they represent.
    • Click on Equal variance if two samples with equal variance suppose

    Critical: In the case of single-group design, a standard value or population mean must be given and a set of quantitative observation results should be provided. In the case of paired design, the difference of each pair of data must follow normal distribution. In the case of group design, individuals are independent with each other. Both groups of data are taken from the population of normal distribution and meet the homogeneity of variance. The reason why these preconditions are needed is that t statistics calculated under such preconditions must obey t distribution, and t test is the test method that takes t distribution as its theoretical basis.

  2. Confidence level: usually 95%

  3. Adjust p-value method:

    • none
    • bonferroni
    • hochberg
    • hommel
    • holm
    • BH
    • BY
    • fdr

For VolcanoPlot, we provide the following parameters to be set:

  1. Click on Already Log2 transformed protein matrix if your proteins data had already been transformed at Data Preprocessing.
  2. Adjust p-value threshold.
  3. Fold change threshold.

Tutorial

  1. The data analyzed in this part should have been uploaded in Data Console part.
  2. Select your interesting column name, it is usually tissue/disease type. Here we choose TissueType.
  3. Click on the four statistic methods: t test, Volcano Plot, Violin Plot and Radar Map
  4. Click on Submit.
  5. Click on t-test and set t-test parameters on the right side as following:
    • two.sided
  6. Equal variance
  7. Confidence level = 0.95
  8. Adjust p-value method: none
  9. Select the first group: Prostate cancer
  10. Select the second group: Benign prostate hyperplasia tissue
  11. Click on Submit.
  12. Click on VocanoPlot and set VocanoPlot parameters on the right side as following:
    • Already Log2 transformed protein matrix
  13. Adjust p-value threshold: 0.05
  14. Fold change threshold: 2 image.png
  15. Click on Submit. image.png
  16. Click on ViolinPlot to see the ViolinPlot figure. image.png
  17. Click on RadarMap to see the RadarMap figure.

image.png

ProteomeExpert-Machine learning

Overview

Machine learning module includes feature selection, clustering and classification. In the feature selection module, users not only apply filter methods to filter features of near zero variance and high correlation, but also apply additional feature selection methods: LASSO (Tibshirani, 1996), genetic algorithm, and random forest. As in clinical application classifying disease into subtypes is of great interest in the fields of diagnose and prognosis, users can perform various machine learning analyses: PCA, t-SNE, and UMAP for unsupervised analysis, and decision tree, random forest, and XGBoost for supervised learning.

Feature selection

The feature selection is the process that choose a reduced number of explanatory variables to describe a response variable. The feature selection is even more important for the high-dimensional datasets, such as genomics and proteomics data. The main goal of proteomics biomarker discovery is to identify which are the most importance proteins associated with the disease. Here we used three well known feature selection methods: LASSO, genetic algorithm, and random forest. We show the process of these feature selection and describing how to use them for biomarker discover. LASSO is short of Least Absolute Shrinkage and Selection Operator. We used glmnet package in R (employ cv.glmnet function to choose the most appropriate tuning parameter λ, that controls the strength of the penalty and set α = 1 for LASSO regularization). Once the λ is set, glmnet function is used to do the feature selection according to this λ. Genetic Algorithm (GA) is a stochastic optimization method inspired by the famous Charles Darwin’s idea of natural selection. Here we used the GA to select the right number of proteins to find positive biomarkers. We defined fitness function as the ROC divided number of features for two classification and accuracy divided number of features for multiple classification. Selection, crossover and mutation were done automatically by the ga function in GA package. Random forest is very popular in bioinformatics area and has achieved fantastic results. Functions sbf and rfe in randomForest package were applied here to afford feature selection using cross validation.

Upload the test files in “Data Upload” then set parameters e.g.:
image.png

Unsupervised

Following algorithms (implementation mainly based on R packages pheatmap, stats tsne and umap) are provided for selection:

  • Heatmap
  • PCA
  • t-SNE
  • UMAP

Upload the test files in “Data Upload” then set parameters e.g.:
image.png

image.png
The result of PCA (demo).

Supervised

It includes the following algorithm:

  • Decision Tree: implementation based on R packages rpart and rpart.plot
  • Random forest: implementation based on R package randomForest
  • XGBoost: implementation based on R package xgboost

Upload the test files in “Data Upload” then set parameters e.g.:
image.png
The result of using decision tree (demo).

ProteomeExpert-Other tools

Overview

This module provides some optional tools which may be used in quantitative proteomics data analysis. Currently only peptide to protein function is available.

Procedure

The peptide to protein inference integrates data normalization, batch effects reduction, missing value imputation and peptide matrix to protein matrix transformation. We provide two alternative methods: the mean of the top 3 precursor intensity and the linear regression of top 3 precursor intensity.

Steps of the linear regression of top 3 precursor intensity

1.Prepare the peptide matrix, where each row represents a peptide and each column represents a sample.
2.Log2 -transform the precursor intensities.
3.perform quantile normalization across all samples via the normalize.quantiles function from Bioconductor R library preprocessCore.
4.Technical imputation, using technical replica to substitute NAs in each other. 
5.Batch correction based on user defined batch using the R package Combat. 
6.Calculate the mean expression of each row to rank the peptide precursor intensity.
7.Order peptide precursors in protein group, first by number of NAs (Intensity equal to 0, ascending) and then by the mean expression (descending)
8.Keep no more than top 3 precursors (# of NAs ascending and order by mean expression descending) for each protein group. 
9.To impute some missing values at protein level for some sample from multiple peptides quantified, we built linear model for step 8 matrix, always use the top protein groups as dependent variable (γ) and chose values greater than 0 as multiple independent variables (χ) by decreasing order of priority. Moreover, impute y using a linear combination of X. Only the regression coefficient with P value<=0.05 and R2 > 0.36 were accepted and rounded up to two decimal places. Then we used the intensity value from the top1 peptide precursor to represent the protein intensity. We apply lm function in R to build the model and do the above imputation.  
10.Keep the top1 precursor corresponding proteins and its' intensity. 

Steps of the mean of top 3 precursor intensity

Steps 1 to 8 are the same as above. Then, calculate the mean of the top 3 precursors intensity for each protein as its protein intensity. Finally, keep proteins along with its intensity.

Tutorial

  1. Download the peptides.txt, technical_replicas.txt and batch_file.txt files from “Online Help - Test data files used for peptide to protein inference - Get”
  2. Select your peptide file: click on the Browse.. to upload the peptides.txt file, choose Tab as separator and with Header checkbox selected.
  3. Select your technical file: click on the Browse.. to upload the technical_replicas.txt, choose Tab as separator.
  4. Select your individual file click on the Browse.. to upload the batch_file.txt file, choose Tab as separator.
  5. Click Submit button. image.png

Test data files used for peptide to protein inference


The test peptide matrix contains 24 DIA runs.
Get
The test technical replicates file
Get
The test batch name file
Get

Test data files used for data upload


The test protein matrix contains 24 DIA runs. Missing value imputation (using data preprocessing is a simple way) is required in order to use all functions except quality control.
Get
The test sample information file contains 24 DIA samples.
Get
The test individual file contains 21 individuals (Only for three files input).
Get

Test data files used for batch design


Get

Please go to https://github.com/lifeinfo/proteomeExpert/issues