ProteomeExpert-Overview
The rapid progresses of high throughput mass spectrometry (MS)-based proteomics such as data-independent acquisition (DIA) and its penetration to clinical studies have generated increasing number of proteomic data sets containing 100s-1000s samples. To analyze these big proteomic data sets more efficiently and conveniently, we present a web server-based software tool ProteomeExpert implemented in Docker, which offers various analysis tools for experimental design, data mining, interpretation, and visualization of big proteomic data set. ProteomeExpert can be deployed on an operating system with Docker installed.
Availability: The Docker image of ProteomeExpert is freely available from https://hub.docker.com/r/lifeinfo/proteomeexpert. The source code of ProteomeExpert is also openly accessible at http://www.github.com/lifeinfo/ProteomeExpert/.
Modules include:
- Experimental Design: Power Analysis & Batch Design
- Data Upload
All modules rely on data uploading at this step, except experimental design and other tools
- Data Preprocesing
- Quality Control: Missing Value & Pearson Correlation
- Statistics
- Machine Learning: Feature selection, Unsupervised learning and Supervised learning
- Annotations
- Other tools: Peptide to protein
ProteomeExpert-Experimental design
Power analysis
Power analysis is a statistical strategy that allows us to determine the sample size required to detect a preset effect under a given test statistics, such as Chi-square test or t-test. In particular, we need to pay attention to the calculated sample size and the actual sample size in an experiment. As observed empirically, when the expression of a protein is not that high, say less than 17 after log2 scale transformation, the required sample size will be compromised due to missing data and the statistical power is compromised too.
Power analysis, set parameters:
Number of Proteins (default=5000)
is the estimated number of proteins identified of your dataset.
Mean abundance (default=13)
is the average intensity (take log2)of proteins in the experimental group.
Mean abundance 0 (default=13.5)
is the average intensity (log2 transformed)of proteins in the control group.
Alpha (default=0.05)
is the significance level of the test (the p value).
Beta (default=0.2)
is the
probability of accepting the null hypothesis even though the null hypothesis is false (i.e. false negative),
when the real difference is equal to the minimum effect size. Beta = 1 - power.
Standard
deviations (default=0.75)
have to be estimated for measured variables, usually we assumed it equals to the standard deviation of control group.
Click on the Submit
and the estimated sample size and figure will be shown on the right side of the browser window.
Batch design
The main purpose for batch design is to allocate samples into balanced groups to minimize technical bias, say instrumental variations, for large cohort studies.
Upload your sample data matrix in .txt or .csv format with the bottom Browse
. Choose separator for the file according to its format. Comma
for .csv, Semicolon
, Comma
or Tab
for .txt. file.
Select columns for balanced batch design
means choosing the column names (attributes) of your sample dataset that need to be balanced.
Weights for columns
is the different weights between different attributes when the influencers are considered together. For example, if you submit 1,1,2
as input for selected columns A, B, C. This means the weight of column A is 25%, of column B is 25% and of column C is 50%. Normalize with , n is the number of attributes.
Number of samples in each batch
means how many samples you expected in each balanced group.
Select numeric columns for balanced batch design
means if the attribute is numeric type, such as age, tumor node, metastasis, et.al, it needed to be clarified.
Download the batch design test data
For this demo, we will be using a Delayed post-hypoxic leukoencephalopathy (not published) dataset, comprised of 168 samples' information, along with characters of “Sample ID”, “Type”, “Sex”, “Age”, “TNM” and “Fuhrman”. Download the batchdesign.csv file from “Online Help - Test data files used for batch design - Get”
Click on Browse..
to upload the batch_design.csv file, choose Comma
as separator.
Select “Type”, “Sex”, “Age”, “TNM” and “Fuhrman” for balanced batch design.
Input weights “1,1,1,1,1” for “Type”, “Sex”, “Age”, “TNM” and “Fuhrman”.
Input “15” as number of samples in each batch.
Select “Age” and “TNM” as numeric columns for balance batch design.
Click on Submit
, waiting for the result that would be shown on the right side. batchId shows divided group number.
Click on Download to get the BatchDesignResult.txt file.
ProteomeExpert-Data upload
Overview
Data Upload
Data Upload is the core data input interface for user to upload your own data file. The Data Upload module allows uploading your specific protein matrix and sample annotation file (including experiment run sample file and individual file if it has) as the input data for most of modules. Moreover, it interactively merges the sample and individual information into one file which is required by some modules such as statistics, data mining, data pre-processing etc. It includes two ways to upload data: Two files format; Three files format. Choose two or three files format depends on whether the sample file stores the enough annotation you want.
Two files format
Upload your files (protein
, sample
) .txt or .csv format with the Browse button. Choose separator for the file according to its format. Comma
for .csv, Semicolon
, Comma
or Tab
for .txt. file.
Three files format
Upload your files (protein
, sample
, individual
) .txt or .csv format with the bottom Browse . Choose separator for the file according to its format. Comma
for .csv, Semicolon
, Comma
or Tab
for .txt. file. Select sample id
(protein file should have the same sample id as sample file) columns for further analysis, individual id/name
in sample file and individual file as reference for data merge. This tool will merge multiple files as template for further analysis.
Tutorial
Two files format
Upload your protein matrix file and sample information file as follows:
Three files format
- Download the protein, sample and individual test data
Download test_prot.txt, test_sample.csv and test_individual.csv files from “Online Help - Test data files used for batch design - Get”, test_individual.csv files comprise 21 individual information, _test_sample.csv _contain individual and sample information. All of the test data are from Delayed post-hypoxic leukoencephalopathy (DPHL) dataset.
- Select your protein file: click on the
Browse..
to upload the test_prot.txt file, choose Tab
as separator.
- Select your sample file: click on the
Browse..
to upload the sample_individual.csv, choose Comma
as separator.
- Select your individual file click on the
Browse..
to upload the individual_prot.txt file, choose Comma
as separator.
- Annotate sample columns: select
SampleName
as sample id; select Individual_ID
as individual id/name, click on Submit
.
- Annotate individual columns: select
Individual_ID
as individual id/name, click on Submit
.
Click on Merge
, result would be shown on the bottom on the page. After merge, all the data uploaded would save in the sever for further analysis.
ProteomeExpert-Data preprocessing
Overview
Data Preprocessing
Data Preprocessing is used to transform data in accordance with modeling experiment conditions configured in the project.
The protein matrix analyzed in Data Preprocessing should be uploaded in “Data Upload - select your protein file”. Choose select matrix: uploadedProtMatrix
for the following analysis. This module including methods for log transformation, missing value substitution,
normalization, batch effects adjustments (using Combat), and replicates treatment (using values of replicates to fill up missing values).
- If you need to do a logarithmic transformation of the data, select
Log Transform
to accomplish. Here we display Log2 and Log10 method. Choose None to skip this step.
- We provide four options for Missing Value Substitution: “1”, “0”, “10% of minimum” or “minimum”.Choose None to skip this step.
- Three functions, i.e. “Quantile”, “Z-score” and “Max-Min” could be used for data Normalization.
Besides, we also offer replacement of Mean value or Median value in the treatment of Technical Replicates
and Biological Replicates
data. For unnecessary batches, select batch column name to Remove Batch Effects
.
Tutorial
- Download the test data
For this demo, we will be using a proteins matrix dataset, comprised of 3724 identified proteins from 24 samples. Download the test_prot.txt file from “Online Help - Test data files used for data console - The test protein matrix contains 24 DIA runs”
- Upload the test file to “Data Upload - select your protein file”.
- Go to Data Preprocessing and select “uploadedProtMatrix”, set parameters e.g.:
- Log Transform: Log2
- Missing Value Substitution: 0
- Normalization: Quantile
- Batch effects correction: None
- Technical Replicates: None
- Biological Replicates: None
- Click on
Submit
, waiting for the result that would be shown on the right side.
- Click on
Download
to get the PreProcessed.txt file.
ProteomeExpert-Quality control
Overview
Quality control (QC)
'QC' allows to show the quality of your protein matrix, which is measured by missing value ratio and reproducibility (Pearson Correlation). The objective protein matrix data is required to be uploaded in the step of 'Data Upload - select your protein file' before starting this step. Select matrix
select uploadedProMatrix as the matrix to be analyzed and Select your interesting column name
as your interesting attribute, which will count the number of proteins hierarchy according to the non-missing ratio of each protein. Then choose MissingValueExplore
and Reproducibility
as the modules you want to process.
MissValueExplore
module design to explore missing data distributions, focusing on numeric missing data. The report shows missing data distributions in different tissue/disease types and show distributions in each row (proteins) and each column (samples).
Reproducibility
module is designed to explore correlations between each sample by Pearson Correlation.
Tutorial
- Select
uploadedProMatrix
as the matrix to be analyzed.
- Select
MissValueExplore
and Reproducibility
as modules you want to process. click on Submit
, results would be shown on the right side on the page.