Welcome to ProteomeExpert, which is a user friendly web server wrapped by R/Shiny for quantitative proteomics data analysis.To analyze large-scale proteomic data sets more efficiently and conveniently, we present a web server-based software tool ProteomeExpert implemented in Docker, which offers various analysis tools for experimental design, data mining, interpretation, and visualization of big proteomic data set.

image.png

 

Description:

Power analysis is a statistical device that allows us to determine the sample size required to detect a preset effect under a given test statistics, such as Chi-square test or t-test. In particular, here we need pay attention to the calculated sample size and the realized sample size in an experiment. As observed empirically, when the expression of a protein is not that high, say less than 17 after log2 scale transformation, the required sample size will be compromised due to missing data. The statistical power is compromised too.

Reference:

Lynch Michael, Walsh Bruce. 1998. Genetics and Analysis of Quantitative Traits.

Results:

                

Description:

Data preview

Protien matrix preview

Sample annotation display

Description:

Annotate sample columns:


Annotate individual columns:




Result

Description:

Data Preprocessing is used to transform data in accordance with modeling experiment conditions configured in the project.

Download
Description:

This module design to explore missing data distributions, focusing on numeric missing data.



                
                


Description
The correlation between two variables reflects the degree to which the variables are related.

Description:

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.


Set parameters:
Description:

In statistics, a volcano plot is a type of scatter-plot that is used to quickly identify chans in large datasets composed of replicate data. It plots significance versus fold-change on the y-and-axes, respectively.


Set parameters for volcano plot

Set parameters for t test:
Description:

A violin plot is a method of plotting numeric data. It is a box plot with a rotated kernel densy plot on each side. The violin plot is similar to box plots, except that they also show the probility density of the data at different values (in the simplest case this could be a histogram).


Description:

A radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. The relative position and angle of the axes is typically uninformative.


Feature selection including three catgory: filter, wrapper and embedding

Summary:

                
Download
Description:

A heat map (or heatmap) is a graphical representation of data where the individual values contained in a matrix are represented as colors.


Description:

Principal component analysis (PCA) is an exploratory analysis tool that emphasizes variation and visualizes possible patterns underlying a dataset. It uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. Upon on the context, PCA is also called eigenvalue decomposition, and eigenvalues (vector) and eigenvectors (matrix) are often used to represent the data.

Mark 1: In proteomic data matrix, missing data (often more missing values for control samples) plays a role in determining the outcome of PCA.

Mark 2: If blank controls (AQUA) are available in the experiment, the coordinates of blank controls can tell the quality of the data.


Description:

T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reductiontechnique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions.

Reference:

L. van der Maaten, H. Geoffrey, Visualizing Data using t-SNE. Journal of Machine Learning Research.


Description:

UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance.

Reference:

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426,2018


Summary

(*Note:If you have a large matrix, the system may be slow, please be patient.)

Results


                  

                  

Summary

Input:


                
String


Database

  • UniProt:The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

  • String-db:Protein-Protein Interaction Networks.

  • KEGG:KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.

  • GO:The Gene Ontology (GO) knowledgebase is the world largest source of information on the functions of genes.

  • Reactome:Reactome is a free, open-source, curated and peer-reviewed pathway database.


Description:

Peptide2Protein provide protein inference function, details are in help page.

Summary

Download all protein matrix

Download

ProteomeExpert-Overview

The rapid progresses of high throughput mass spectrometry (MS)-based proteomics such as data-independent acquisition (DIA) and its penetration to clinical studies have generated increasing number of proteomic data sets containing 100s-1000s samples. To analyze these big proteomic data sets more efficiently and conveniently, we present a web server-based software tool ProteomeExpert implemented in Docker, which offers various analysis tools for experimental design, data mining, interpretation, and visualization of big proteomic data set. ProteomeExpert can be deployed on an operating system with Docker installed. Availability: The Docker image of ProteomeExpert is freely available from https://hub.docker.com/r/lifeinfo/proteomeexpert. The source code of ProteomeExpert is also openly accessible at http://www.github.com/lifeinfo/ProteomeExpert/.


image.png

 

Modules include:

  • Experimental Design: Power Analysis & Batch Design
  • Data Upload All modules rely on data uploading at this step, except experimental design and other tools
  • Data Preprocesing
  • Quality Control: Missing Value & Pearson Correlation
  • Statistics
  • Machine Learning: Feature selection, Unsupervised learning and Supervised learning
  • Annotations
  • Other tools: Peptide to protein

ProteomeExpert-Experimental design

Power analysis

Power analysis is a statistical strategy that allows us to determine the sample size required to detect a preset effect under a given test statistics, such as Chi-square test or t-test. In particular, we need to pay attention to the calculated sample size and the actual sample size in an experiment. As observed empirically, when the expression of a protein is not that high, say less than 17 after log2 scale transformation, the required sample size will be compromised due to missing data and the statistical power is compromised too.

Power analysis, set parameters:
Number of Proteins (default=5000) is the estimated number of proteins identified of your dataset.
Mean abundance (default=13) is the average intensity (take log2)of proteins in the experimental group.
Mean abundance 0 (default=13.5) is the average intensity (log2 transformed)of proteins in the control group.
Alpha (default=0.05) is the significance level of the test (the p value).
Beta (default=0.2) is the probability of accepting the null hypothesis even though the null hypothesis is false (i.e. false negative), when the real difference is equal to the minimum effect size. Beta = 1 - power.
Standard deviations (default=0.75) have to be estimated for measured variables, usually we assumed it equals to the standard deviation of control group.
Click on the Submit and the estimated sample size and figure will be shown on the right side of the browser window.

Batch design

The main purpose for batch design is to allocate samples into balanced groups to minimize technical bias, say instrumental variations, for large cohort studies.
Upload your sample data matrix in .txt or .csv format with the bottom Browse. Choose separator for the file according to its format. Comma for .csv, Semicolon , Comma or Tab for .txt. file.
Select columns for balanced batch design means choosing the column names (attributes) of your sample dataset that need to be balanced.
Weights for columns is the different weights between different attributes when the influencers are considered together. For example, if you submit 1,1,2 as input for selected columns A, B, C. This means the weight of column A is 25%, of column B is 25% and of column C is 50%. Normalize with , n is the number of attributes.
Number of samples in each batch means how many samples you expected in each balanced group.
Select numeric columns for balanced batch design means if the attribute is numeric type, such as age, tumor node, metastasis, et.al, it needed to be clarified.

  1. Download the batch design test data

    For this demo, we will be using a Delayed post-hypoxic leukoencephalopathy (not published) dataset, comprised of 168 samples' information, along with characters of “Sample ID”, “Type”, “Sex”, “Age”, “TNM” and “Fuhrman”. Download the batchdesign.csv file from “Online Help - Test data files used for batch design - Get”

  2. Click on Browse.. to upload the batch_design.csv file, choose Comma as separator.

  3. Select “Type”, “Sex”, “Age”, “TNM” and “Fuhrman” for balanced batch design.

  4. Input weights “1,1,1,1,1” for “Type”, “Sex”, “Age”, “TNM” and “Fuhrman”.

  5. Input “15” as number of samples in each batch.

  6. Select “Age” and “TNM” as numeric columns for balance batch design.

  7. Click on Submit, waiting for the result that would be shown on the right side. batchId shows divided group number.

  8. Click on Download to get the BatchDesignResult.txt file.

ProteomeExpert-Data upload

Overview

Data Upload

Data Upload is the core data input interface for user to upload your own data file. The Data Upload module allows uploading your specific protein matrix and sample annotation file (including experiment run sample file and individual file if it has) as the input data for most of modules. Moreover, it interactively merges the sample and individual information into one file which is required by some modules such as statistics, data mining, data pre-processing etc. It includes two ways to upload data: Two files format; Three files format. Choose two or three files format depends on whether the sample file stores the enough annotation you want.

Two files format

Upload your files (protein, sample) .txt or .csv format with the Browse button. Choose separator for the file according to its format. Comma for .csv, Semicolon, Comma or Tab for .txt. file.

Three files format

Upload your files (protein, sample, individual) .txt or .csv format with the bottom Browse . Choose separator for the file according to its format. Comma for .csv, Semicolon, Comma or Tab for .txt. file. Select sample id (protein file should have the same sample id as sample file) columns for further analysis, individual id/name in sample file and individual file as reference for data merge. This tool will merge multiple files as template for further analysis.

Tutorial

Two files format

Upload your protein matrix file and sample information file as follows: image.png

Three files format

  1. Download the protein, sample and individual test data
    Download test_prot.txt, test_sample.csv and test_individual.csv files from “Online Help - Test data files used for batch design - Get”, test_individual.csv files comprise 21 individual information, _test_sample.csv _contain individual and sample information. All of the test data are from Delayed post-hypoxic leukoencephalopathy (DPHL) dataset.
  2. Select your protein file: click on the Browse.. to upload the test_prot.txt file, choose Tab as separator.
  3. Select your sample file: click on the Browse.. to upload the sample_individual.csv, choose Comma as separator.
  4. Select your individual file click on the Browse.. to upload the individual_prot.txt file, choose Comma as separator.
  5. Annotate sample columns: select SampleName as sample id; select Individual_ID as individual id/name, click on Submit .
  6. Annotate individual columns: select Individual_ID as individual id/name, click on Submit .
  7. Click on Merge, result would be shown on the bottom on the page. After merge, all the data uploaded would save in the sever for further analysis.

    image.png

ProteomeExpert-Data preprocessing

Overview

Data Preprocessing

Data Preprocessing is used to transform data in accordance with modeling experiment conditions configured in the project.

The protein matrix analyzed in Data Preprocessing should be uploaded in “Data Upload - select your protein file”. Choose select matrix: uploadedProtMatrix for the following analysis. This module including methods for log transformation, missing value substitution, normalization, batch effects adjustments (using Combat), and replicates treatment (using values of replicates to fill up missing values).

  • If you need to do a logarithmic transformation of the data, select Log Transform to accomplish. Here we display Log2 and Log10 method. Choose None to skip this step.
  • We provide four options for Missing Value Substitution: “1”, “0”, “10% of minimum” or “minimum”.Choose None to skip this step.
  • Three functions, i.e. “Quantile”, “Z-score” and “Max-Min” could be used for data Normalization.

Besides, we also offer replacement of Mean value or Median value in the treatment of Technical Replicates and Biological Replicates data. For unnecessary batches, select batch column name to Remove Batch Effects.

Tutorial

  1. Download the test data
    For this demo, we will be using a proteins matrix dataset, comprised of 3724 identified proteins from 24 samples. Download the test_prot.txt file from “Online Help - Test data files used for data console - The test protein matrix contains 24 DIA runs”
  2. Upload the test file to “Data Upload - select your protein file”.
  3. Go to Data Preprocessing and select “uploadedProtMatrix”, set parameters e.g.:
    • Log Transform: Log2
    • Missing Value Substitution: 0
    • Normalization: Quantile
    • Batch effects correction: None
    • Technical Replicates: None
    • Biological Replicates: None
  4. Click on Submit, waiting for the result that would be shown on the right side.
  5. Click on Download to get the PreProcessed.txt file.

image.png

ProteomeExpert-Quality control

Overview

Quality control (QC)

'QC' allows to show the quality of your protein matrix, which is measured by missing value ratio and reproducibility (Pearson Correlation). The objective protein matrix data is required to be uploaded in the step of 'Data Upload - select your protein file' before starting this step. Select matrix select uploadedProMatrix as the matrix to be analyzed and Select your interesting column name as your interesting attribute, which will count the number of proteins hierarchy according to the non-missing ratio of each protein. Then choose MissingValueExplore and Reproducibility as the modules you want to process. MissValueExplore module design to explore missing data distributions, focusing on numeric missing data. The report shows missing data distributions in different tissue/disease types and show distributions in each row (proteins) and each column (samples). Reproducibility module is designed to explore correlations between each sample by Pearson Correlation.

Tutorial

  1. Select uploadedProMatrix as the matrix to be analyzed.
  2. Select MissValueExplore and Reproducibility as modules you want to process. click on Submit, results would be shown on the right side on the page.

image.png