omicspreprocessing package
Submodules
omicspreprocessing.core module
- omicspreprocessing.core.ROC_curve_analysis(labels, scores, curve_title, plot=True)
Compute and optionally plot the ROC curve for binary classification.
- Return type:
float
Parameters:
- labelsnp.ndarray
Array of true binary labels (0 or 1), where 1 indicates the positive class.
- scoresnp.ndarray
Array of predicted scores or probabilities for the positive class. Higher scores indicate higher likelihood of positive class.
- curve_titlestr
Title or label for the ROC curve plot.
- plotbool, optional
Whether to plot the ROC curve. Default is True.
Returns:
: float
Area Under the Curve (AUC) of the ROC curve.
- omicspreprocessing.core.anova_test(inputDF, protein_peptide, metaDataColumn)
Perform one-way ANOVA tests for each protein/peptide across groups defined in metadata.
- Return type:
DataFrame
Parameters:
- inputDFpd.DataFrame
DataFrame where rows are samples/patients and columns are proteins/peptides.
- protein_peptideList[str]
List of protein/peptide column names to perform ANOVA on.
- metaDataColumnstr
Name of the metadata column in inputDF that contains group labels for ANOVA.
Returns:
: pd.DataFrame
DataFrame with columns: - ‘F_tests’: ANOVA F-statistics per protein/peptide, - ‘p_values’: Corresponding p-values, - ‘means_pergroup’: List of group means per protein/peptide, - ‘Majority protein IDs’: Protein/peptide names, - ‘fdr’: FDR-adjusted p-values using Benjamini-Hochberg correction.
- omicspreprocessing.core.calculate_median_z_scores(df)
Calculate column-wise median-centered z-scores for a DataFrame.
For each column, the z-score is computed by subtracting the column median and dividing by the column standard deviation, ignoring NaNs.
Parameters:
- dfpd.DataFrame
Input DataFrame with numerical values. NaNs and infinite values will be handled.
Returns:
: pd.DataFrame
DataFrame of the same shape with median-centered z-scores computed column-wise.
- omicspreprocessing.core.check_path_exist(func)
- omicspreprocessing.core.compare_two_distributions(data1, data2)
Compare two empirical distributions using the two-sample Kolmogorov-Smirnov (KS) test.
This function tests whether the two input datasets are drawn from the same continuous distribution. It prints the KS test statistic, the p-value, and an interpretation based on a significance level of 0.05.
Parameters:
- data1array-like
The first sample distribution (e.g., list, NumPy array, or pandas Series).
- data2array-like
The second sample distribution.
Returns:
: None
This function prints the KS statistic, p-value, and an interpretation of the test result.
Notes:
Null hypothesis (H0): The two distributions are identical.
A p-value > 0.05 suggests that the two distributions are similar (fail to reject H0).
A p-value ≤ 0.05 indicates that the two distributions are statistically different (reject H0).
This test is non-parametric and sensitive to differences in both location and shape of the distributions.
- omicspreprocessing.core.custom_imputation(input)
Custom imputation for triplicate samples in proteomics data.
Rules: - If only one replicate out of three is expressed (non-NaN/non-zero), set all three replicates to zero. - If exactly two replicates are expressed and one is missing, impute the missing value with the median of the two expressed replicates.
Assumptions: - Input DataFrame has samples as the index, with replicate identifiers ‘_1’, ‘_2’, ‘_3’ suffixing sample names. - Columns correspond to proteins.
- Return type:
DataFrame
Parameters:
- dfpd.DataFrame
DataFrame with samples as index and proteins as columns. Sample names must include ‘_1’, ‘_2’, or ‘_3’ to identify replicates.
Returns:
: pd.DataFrame
DataFrame with the same shape as input but with missing replicate values imputed or zeroed as per the rules.
- omicspreprocessing.core.do_normalize_with_target_df(z_scored_value, average_target, std_target)
Convert a z-scored value to the scale of a target distribution.
This function takes a standardized (z-scored) value and transforms it to a value in a target distribution with a specified mean and standard deviation. It performs the inverse z-score transformation:
normalized_value = mean_target + z * std_target
Parameters:
- z_scored_valuefloat
The z-scored (standardized) value to be mapped to the target distribution.
- average_targetfloat
The mean (μ) of the target distribution.
- std_targetfloat
The standard deviation of the target distribution.
Returns:
: float
The normalized value on the scale of the target distribution.
Example:
do_normalize_with_target_df(1.5, 100, 15) 122.5
- omicspreprocessing.core.do_shapiro_test(df, column_to_do_test)
Perform the Shapiro-Wilk test for normality on a specific column of a DataFrame.
This function applies the Shapiro-Wilk test to determine whether the values in the specified column are normally distributed. It is suitable for small sample sizes (n < 5000) and prints both the test statistic and the interpretation based on a significance level of 0.05.
Parameters:
- dfpd.DataFrame
The DataFrame containing the data to test.
- column_to_do_teststr
The name of the column in df on which to perform the Shapiro-Wilk test.
Returns:
: None
This function prints the results to the console but does not return anything.
Notes:
The null hypothesis (H0) of the Shapiro-Wilk test is that the data is normally distributed.
A p-value greater than 0.05 suggests normality (fail to reject H0).
A p-value less than or equal to 0.05 suggests deviation from normality (reject H0).
- omicspreprocessing.core.do_smirnov_test(df, column_to_do_test)
Perform the Kolmogorov-Smirnov test for normality on a specific column of a DataFrame.
This function applies the one-sample Kolmogorov-Smirnov (K-S) test to compare the empirical distribution of the specified column with a standard normal distribution.
Parameters:
- dfpd.DataFrame
The DataFrame containing the data to test.
- column_to_do_teststr
The name of the column in df to test for normality.
Returns:
: None
The function prints the test statistic, p-value, and an interpretation based on a significance level of 0.05.
Notes:
The null hypothesis (H0) is that the data comes from a standard normal distribution.
A p-value > 0.05 suggests the data may be normally distributed (fail to reject H0).
A p-value ≤ 0.05 indicates the data does not follow a normal distribution (reject H0).
This test assumes the data is already standardized; otherwise, results may be misleading.
- omicspreprocessing.core.get_cv_from_melted_df(melted_df, protein_col='Proteins_names', value_col='value')
Calculate the coefficient of variation (CV) for each protein from a long-format DataFrame.
The function groups the data by protein names, computes the standard deviation and mean of the specified values, and then calculates the CV as std/mean.
- Return type:
DataFrame
Parameters:
- melted_dfpd.DataFrame
A long-format DataFrame containing protein names and corresponding values.
- protein_colstr, optional
The column name containing protein identifiers. Default is ‘Proteins_names’.
- value_colstr, optional
The column name containing numeric values for which the CV is calculated. Default is ‘value’.
Returns:
: pd.DataFrame
A DataFrame with columns: ‘std’, ‘mean’, ‘cv’, and ‘Gene names’ (protein identifiers). The ‘Gene names’ column is copied from the index.
- omicspreprocessing.core.get_replicate_number(df, column_name='Sample_name')
Add a ‘replicate’ column to the DataFrame by assigning replicate numbers within groups.
For each group defined by the unique values in the specified column_name, this function assigns sequential replicate numbers starting from 1. It adds a new column called ‘replicate’ to the returned DataFrame.
- Return type:
DataFrame
Parameters:
- dfpd.DataFrame
The input DataFrame containing the data to process.
- column_namestr, optional
The column to group by when determining replicates. Default is ‘Sample_name’.
Returns:
: pd.DataFrame
A new DataFrame with an additional ‘replicate’ column indicating the replicate number within each group.
Raises:
- ValueError
If the input is not a DataFrame or the specified column does not exist.
Notes:
Replicate numbers start at 1.
This function preserves the original DataFrame structure but returns a modified copy.
- omicspreprocessing.core.get_t_across_all_proteins(df, group_1_name, group_2_name, data_Set_name, Samplename2TMTchannel)
Perform t-tests across all proteins between two groups and return statistics with FDR correction.
- Return type:
DataFrame
Parameters:
- dfpd.DataFrame
DataFrame with proteins as index and samples as columns.
- group_1_namestr
Name of the first group to compare.
- group_2_namestr
Name of the second group to compare.
- data_Set_namestr
Dataset identifier used in value extraction.
- Samplename2TMTchanneldict
Mapping of sample names to TMT channel lists.
Returns:
: pd.DataFrame
DataFrame with columns: - ‘proteins’: protein names - ‘p_values’: p-values from t-test - ‘t_statistics’: t statistics - ‘g1_mean’: mean of group 1 - ‘g2_mean’: mean of group 2 - ‘FDR_correction’: adjusted p-values (FDR) - ‘pP_VALUE’: transformed adjusted p-values (using _pfunclog) - ‘delta_mean’: difference of means (g1_mean - g2_mean) - ‘type’: ‘up’ if delta_mean > 0 else ‘down’
- omicspreprocessing.core.impute_normal_down_shift_distribution(unimputerd_df, column_wise=True, width=0.3, downshift=1.8, seed=100)
Perform missing value imputation by replacing NaNs with values drawn from a normal distribution shifted downward relative to the observed data distribution.
The imputed distribution has mean shifted down by downshift standard deviations and scaled by width.
- Return type:
DataFrame
Parameters:
- unimputerd_dfpd.DataFrame
DataFrame with missing values (NaNs) to be imputed.
- column_wisebool, optional
If True, imputation is done separately for each column using that column’s statistics. If False, global mean and std across the entire DataFrame are used. Default is True.
- widthfloat, optional
Scale factor for the standard deviation of the imputed distribution relative to the sample std.
- downshiftfloat, optional
Number of standard deviations by which to downshift the mean of the imputed distribution.
- seedint, optional
Random seed for reproducibility.
Returns:
: pd.DataFrame
DataFrame with imputed values replacing NaNs.
Reference:
Imputation method inspired by: https://rdrr.io/github/jdreyf/jdcbioinfo/man/impute_normal.html#google_vignette
- omicspreprocessing.core.intersection(lst1, lst2)
Return the intersection of two lists as a list of unique elements present in both.
Parameters:
- lst1list
The first list.
- lst2list
The second list.
Returns:
: list
A list containing the unique elements common to both lst1 and lst2.
- omicspreprocessing.core.log2_transform_intensities(df)
Apply log2 transformation to intensity values in the DataFrame.
Zeros are replaced with NaN before transformation to avoid -inf values.
- Return type:
DataFrame
Parameters:
- dfpd.DataFrame
DataFrame containing intensity values (numeric).
Returns:
: pd.DataFrame
DataFrame of the same shape with log2-transformed intensities. Zeros replaced by NaN.
- omicspreprocessing.core.log2fold_change_calculator(df)
Calculate the log2 fold change of intensities column-wise relative to the column mean.
For each value in the DataFrame, this function subtracts the mean of its column, resulting in log2 fold changes relative to the average intensity across all samples per column.
- Return type:
DataFrame
Parameters:
- dfpd.DataFrame
A DataFrame of log-transformed intensity values with samples as columns.
Returns:
: pd.DataFrame
A DataFrame of the same shape as df, containing log2 fold changes for each entry.
- omicspreprocessing.core.log2fold_change_calculator_LOO(df)
Calculate leave-one-out (LOO) log2 fold changes column-wise relative to the mean excluding the current sample.
For each sample (row), this function computes the mean of all other samples (excluding the current one) for each column, then calculates the difference between the sample’s value and this leave-one-out mean.
- Return type:
DataFrame
Parameters:
- dfpd.DataFrame
A DataFrame of log-transformed intensity values with samples as rows and features as columns.
Returns:
: pd.DataFrame
A DataFrame of the same shape as df, containing the LOO log2 fold changes.
- omicspreprocessing.core.log_transform_intensities(df)
Apply log10 transformation to intensity values in the DataFrame.
Zeros are replaced with NaN before transformation to avoid -inf values.
- Return type:
DataFrame
Parameters:
- dfpd.DataFrame
DataFrame containing intensity values (must be numeric).
Returns:
: pd.DataFrame
DataFrame of the same shape with log10-transformed intensities. Zeros replaced by NaN.
- omicspreprocessing.core.make_cv_plot(df)
Create and display a cumulative distribution plot of the coefficient of variation (CV).
- Return type:
None
Parameters:
- dfpd.DataFrame
DataFrame containing a ‘cv’ column representing coefficient of variation values (expected as decimal fractions, e.g., 0.05 for 5%).
Raises:
- ValueError:
If the ‘cv’ column is not present in the DataFrame.
- omicspreprocessing.core.make_pair_combinations(items)
Generate all unique pairwise combinations from a list of items.
- Parameters:
items (list) – List of elements (e.g., strings, numbers) from which to create pairs.
- Returns:
A list containing all possible unique pairs, where each pair is represented as a two-element list. Order within each pair follows the original input order.
- Return type:
list of list
Examples
>>> make_pair_combinations(["A", "B", "C"]) [['A', 'B'], ['A', 'C'], ['B', 'C']]
Notes
Uses itertools.combinations, so no repeated elements and no reversed duplicates.
If items has fewer than 2 elements, the result will be an empty list.
- omicspreprocessing.core.median_centering(df)
Normalize samples by centering their distributions using median scaling.
Each sample (column) is multiplied by a correction factor defined as the ratio of the average median of reference channels to the sample median. This aligns sample medians around the same central value.
- Return type:
DataFrame
Parameters:
- dfpd.DataFrame
DataFrame of batch intensities with samples as columns and features (e.g., reporter channels) as rows. Zero values are treated as missing and ignored in median calculations.
Returns:
: pd.DataFrame
Median-centered DataFrame with the same shape as input.
- omicspreprocessing.core.median_centering_ms1(merged_ms1_df)
Normalize MS1 intensity samples by median centering with batch-specific correction factors.
This function computes the median intensity per sample considering only peptides detected in more than 70% of samples to avoid bias from low-abundance peptides. Each sample’s intensities are then scaled by a correction factor so that all medians align to the global mean median.
- Return type:
DataFrame
Parameters:
- merged_ms1_dfpd.DataFrame
DataFrame with proteins/peptides as rows and samples as columns containing intensity values.
Returns:
: pd.DataFrame
Normalized DataFrame with the same shape as input, with each sample scaled by its correction factor.
- omicspreprocessing.core.one_vs_all_t_test(inputDF, protein_peptide, favoriteentity, metaDataColumn)
Perform a one-vs-all independent t-test for each protein/peptide comparing the favorite entity group against all other groups.
- Return type:
DataFrame
Parameters:
- inputDFpd.DataFrame
DataFrame with patients/samples as rows and proteins/peptides as columns.
- protein_peptideList[str]
List of protein/peptide column names on which to perform the t-tests.
- favoriteentitystr
The group of interest (e.g., ‘chordoma’) to compare against all others.
- metaDataColumnstr
Column name in inputDF that contains group labels for each sample.
Returns:
: pd.DataFrame
DataFrame containing t-statistics, p-values, group means, sample sizes, FDR-adjusted p-values, and direction (‘up’/’down’) indicating if favoriteentity group is higher or lower than other groups.
- omicspreprocessing.core.plot_cv_per_condition(df, condition, Samplename2TMTchannel, data_Set_name)
Plot cumulative histogram of coefficient of variation (CV) percentages for a given condition using Reporter intensity corrected channels.
- Return type:
None
Parameters:
- dfpd.DataFrame
The TMT evidence or proteinGroup DataFrame.
- conditionstr
The name of the condition to analyze.
- Samplename2TMTchanneldict
Dictionary mapping condition names to lists of TMT channel names.
- data_Set_namestr
The dataset name repeated after “Reporter intensity corrected” columns. Used to filter relevant intensity columns.
Returns:
: None
Displays the cumulative histogram plot of CV percentages.
- omicspreprocessing.core.post_hoc_ANOVA(ANOVA_df, protein_list, group_col='Brain region', p_adjust='fdr_bh')
Perform post-hoc pairwise t-tests after ANOVA for multiple proteins.
For each protein in protein_list, this function performs a post-hoc pairwise t-test using the specified grouping column in the ANOVA_df DataFrame, applying multiple testing correction as specified by p_adjust.
Parameters:
- ANOVA_dfpd.DataFrame
DataFrame where rows are samples and columns are protein measurements.
- protein_listlist
List of protein column names in ANOVA_df to perform the post-hoc tests on.
- group_colstr, optional
Name of the metadata column in ANOVA_df used to group samples for testing. Default is ‘Brain region’.
- p_adjuststr, optional
Method for p-value adjustment for multiple comparisons. Default is ‘fdr_bh’ (Benjamini-Hochberg false discovery rate).
Returns:
: list of dict
A list where each element is a dictionary with keys: - ‘protein’: the protein name, - ‘post_hoc_res’: the DataFrame of adjusted p-values from the post-hoc test.
Notes:
Requires sp.posthoc_ttest from the scikit-posthocs package.
- omicspreprocessing.core.protein_remover_by_sparcity(df, minimum_samples_inside=20)
Remove proteins from the DataFrame based on sparsity threshold.
Proteins (rows) with fewer than minimum_samples_inside non-NA values across samples (columns) are removed from the DataFrame.
- Return type:
DataFrame
Parameters:
- dfpd.DataFrame
Input DataFrame with proteins as the index and samples as columns.
- minimum_samples_insideint, optional
Minimum number of samples (columns) in which a protein must have non-NA values to be retained. Proteins with fewer non-NA samples are removed. Default is 20.
Returns:
: pd.DataFrame
Filtered DataFrame containing only proteins meeting the sparsity criterion.
- omicspreprocessing.core.raw_median_centering_normalization(df, general_median)
Perform column-wise median centering normalization on a DataFrame using an external reference median.
This function normalizes each column of the input DataFrame by scaling its values so that the column’s median matches the specified general_median. This is useful when adjusting datasets to a common scale based on a known or reference distribution.
Parameters:
- dfpd.DataFrame
A pandas DataFrame containing numerical data to normalize. It is assumed that the DataFrame is indexed and contains no non-numeric columns.
- general_medianfloat
The reference median value from another distribution that each column should be normalized to.
Returns:
: pd.DataFrame
A new DataFrame of the same shape as df with normalized values such that the median of each column is approximately equal to general_median.
Notes:
Columns containing NaN values will be normalized ignoring the NaNs in median computation.
The index and column names of the original DataFrame are preserved.
- omicspreprocessing.core.seaborn_volcano(df, fc_thresh=0.25, p_thresh=0.05, xaxis=1, draw_dashed_lines=True, p_value_col='p_values', foldchange_col='delta_mean', gene_col='Gene Names', title='Volcano Plot', where_to_save=None)
Create a volcano plot using Seaborn to visualize statistical significance (p-values) versus magnitude of change (fold change) for features such as genes or proteins.
- The plot highlights three categories of points:
“up”: Fold change above fc_thresh and p-value ≤ p_thresh
“down”: Fold change below -fc_thresh and p-value ≤ p_thresh
“ns”: Not significant
- Parameters:
df (pandas.DataFrame) – Input DataFrame containing at least the columns specified in p_value_col, foldchange_col, and gene_col.
fc_thresh (float, default=0.25) – Fold change threshold for significance classification. Values greater than this (in absolute value) are considered significant if p-value passes.
p_thresh (float, default=0.05) – P-value threshold for statistical significance.
xaxis (float, default=1) – X-axis range limit for plotting threshold lines.
draw_dashed_lines (bool, default=True) – Whether to draw dashed threshold lines for fold change and p-value cutoffs.
p_value_col (str, default="p_values") – Column name in df containing p-values.
foldchange_col (str, default="delta_mean") – Column name in df containing fold change values.
gene_col (str, default="Gene Names") – Column name in df containing gene/protein identifiers for labeling.
title (str, default="Volcano Plot") – Title for the plot.
where_to_save (str or None, default=None) – File path to save the plot. If None, the plot is displayed instead.
- Returns:
Displays or saves the volcano plot.
- Return type:
None
Notes
p-values are transformed to -log10(p) for the y-axis.
Only points meeting significance thresholds are labeled.
- Color scheme:
Red: “up” (significant positive fold change)
Blue: “down” (significant negative fold change)
Grey: “ns” (not significant)
- omicspreprocessing.core.show_spent_time(func)
Decorator to measure and display the execution time of a function.
This decorator wraps any function and prints the time it took to execute, which can be helpful for performance monitoring or benchmarking.
Usage:
@show_spent_time def some_function(…):
…
Parameters:
- funccallable
The function whose execution time is to be measured.
Returns:
: callable
A wrapped version of the original function that prints the time spent during execution.
- omicspreprocessing.core.t_test(x, y)
Perform an independent two-sample t-test between two numerical groups.
- Return type:
Tuple[float,float]
Parameters:
- xtuple of floats
First group of numerical observations.
- ytuple of floats
Second group of numerical observations.
Returns:
: t_stat : float
The computed t-statistic.
- p_valuefloat
The two-tailed p-value for the test.
- omicspreprocessing.core.univariate_ROC_analysis_by_CV_permutation(pre, favoriteentity, kFold=5, repeats=10, threshold=0.5, scores='scores', labels='labels')
Compute the stability percentage of ROC AUCs above a given threshold using repeated stratified k-fold cross-validation.
- Return type:
float
Parameters:
- prepd.DataFrame
DataFrame containing prediction scores and true labels. Must include columns named as specified by scores and labels.
- favoriteentitystr
The label considered as the positive class.
- kFoldint, optional
Number of folds for cross-validation. Default is 5.
- repeatsint, optional
Number of repeated cross-validation runs. Default is 10.
- thresholdfloat, optional
Threshold for considering AUC as stable. Default is 0.5.
- scoresstr, optional
Column name in pre containing the prediction scores. Default is ‘scores’.
- labelsstr, optional
Column name in pre containing the true labels. Default is ‘labels’.
Returns:
: float
Percentage of AUCs that are above threshold or below 1 - threshold across all CV splits.
- omicspreprocessing.core.unnest_proteingroups(df)
Split multi-protein group entries in the DataFrame index (separated by ‘;’) into separate rows, duplicating associated data for each protein group.
- Return type:
DataFrame
Parameters:
- dfpd.DataFrame
DataFrame with protein groups as the index, where some index entries may contain multiple protein names separated by semicolons (‘;’).
Returns:
: pd.DataFrame
Expanded DataFrame where each protein group has its own row, with data duplicated accordingly. The new index will be the individual protein group names.
- omicspreprocessing.core.volcanoplot(df, cutoff=None, save_path=None)
Create and optionally save an interactive volcano plot using Plotly.
- Return type:
None
Parameters:
- dfpd.DataFrame
DataFrame containing at least the following columns: - ‘delta_mean’: log2 fold change values (x-axis). - ‘pP_VALUE’: -log10 transformed p-values or Q-values (y-axis). - ‘type’: categorical variable for coloring points (e.g., ‘up’, ‘down’, ‘ns’). - ‘proteins’: names for hover labels.
- cutofffloat, optional
Y-axis cutoff to add a horizontal reference line (e.g., significance threshold).
- save_pathstr, optional
If provided, saves the interactive plot as an HTML file to this path.
Returns:
: None
Displays the interactive plot.
omicspreprocessing.parallel_computing module
- class omicspreprocessing.parallel_computing.ParallelComputing(func=None, list_to_proceed=None, num_cores=3)
Bases:
object- get_result()
Get the result list after parallel processing.
- Return type:
List
- run_in_parallel()
Run the function in parallel over the list of items.
- Return type:
None
- set_func(func)
Set the function to apply in parallel.
- Return type:
None
- set_list(newlist)
Set the list of items to process.
- Return type:
None
- set_num_cores(num_cores)
Set the number of parallel processes.
- Return type:
None