omicspreprocessing package

Submodules

omicspreprocessing.core module

omicspreprocessing.core.ROC_curve_analysis(labels, scores, curve_title, plot=True)

Compute and optionally plot the ROC curve for binary classification.

Return type:

float

Parameters:

labelsnp.ndarray

Array of true binary labels (0 or 1), where 1 indicates the positive class.

scoresnp.ndarray

Array of predicted scores or probabilities for the positive class. Higher scores indicate higher likelihood of positive class.

curve_titlestr

Title or label for the ROC curve plot.

plotbool, optional

Whether to plot the ROC curve. Default is True.

Returns:

: float

Area Under the Curve (AUC) of the ROC curve.

omicspreprocessing.core.anova_test(inputDF, protein_peptide, metaDataColumn)

Perform one-way ANOVA tests for each protein/peptide across groups defined in metadata.

Return type:

DataFrame

Parameters:

inputDFpd.DataFrame

DataFrame where rows are samples/patients and columns are proteins/peptides.

protein_peptideList[str]

List of protein/peptide column names to perform ANOVA on.

metaDataColumnstr

Name of the metadata column in inputDF that contains group labels for ANOVA.

Returns:

: pd.DataFrame

DataFrame with columns: - ‘F_tests’: ANOVA F-statistics per protein/peptide, - ‘p_values’: Corresponding p-values, - ‘means_pergroup’: List of group means per protein/peptide, - ‘Majority protein IDs’: Protein/peptide names, - ‘fdr’: FDR-adjusted p-values using Benjamini-Hochberg correction.

omicspreprocessing.core.calculate_median_z_scores(df)

Calculate column-wise median-centered z-scores for a DataFrame.

For each column, the z-score is computed by subtracting the column median and dividing by the column standard deviation, ignoring NaNs.

Parameters:

dfpd.DataFrame

Input DataFrame with numerical values. NaNs and infinite values will be handled.

Returns:

: pd.DataFrame

DataFrame of the same shape with median-centered z-scores computed column-wise.

omicspreprocessing.core.check_path_exist(func)
omicspreprocessing.core.compare_two_distributions(data1, data2)

Compare two empirical distributions using the two-sample Kolmogorov-Smirnov (KS) test.

This function tests whether the two input datasets are drawn from the same continuous distribution. It prints the KS test statistic, the p-value, and an interpretation based on a significance level of 0.05.

Parameters:

data1array-like

The first sample distribution (e.g., list, NumPy array, or pandas Series).

data2array-like

The second sample distribution.

Returns:

: None

This function prints the KS statistic, p-value, and an interpretation of the test result.

Notes:

  • Null hypothesis (H0): The two distributions are identical.

  • A p-value > 0.05 suggests that the two distributions are similar (fail to reject H0).

  • A p-value ≤ 0.05 indicates that the two distributions are statistically different (reject H0).

  • This test is non-parametric and sensitive to differences in both location and shape of the distributions.

omicspreprocessing.core.custom_imputation(input)

Custom imputation for triplicate samples in proteomics data.

Rules: - If only one replicate out of three is expressed (non-NaN/non-zero), set all three replicates to zero. - If exactly two replicates are expressed and one is missing, impute the missing value with the median of the two expressed replicates.

Assumptions: - Input DataFrame has samples as the index, with replicate identifiers ‘_1’, ‘_2’, ‘_3’ suffixing sample names. - Columns correspond to proteins.

Return type:

DataFrame

Parameters:

dfpd.DataFrame

DataFrame with samples as index and proteins as columns. Sample names must include ‘_1’, ‘_2’, or ‘_3’ to identify replicates.

Returns:

: pd.DataFrame

DataFrame with the same shape as input but with missing replicate values imputed or zeroed as per the rules.

omicspreprocessing.core.do_normalize_with_target_df(z_scored_value, average_target, std_target)

Convert a z-scored value to the scale of a target distribution.

This function takes a standardized (z-scored) value and transforms it to a value in a target distribution with a specified mean and standard deviation. It performs the inverse z-score transformation:

normalized_value = mean_target + z * std_target

Parameters:

z_scored_valuefloat

The z-scored (standardized) value to be mapped to the target distribution.

average_targetfloat

The mean (μ) of the target distribution.

std_targetfloat

The standard deviation of the target distribution.

Returns:

: float

The normalized value on the scale of the target distribution.

Example:

do_normalize_with_target_df(1.5, 100, 15) 122.5

omicspreprocessing.core.do_shapiro_test(df, column_to_do_test)

Perform the Shapiro-Wilk test for normality on a specific column of a DataFrame.

This function applies the Shapiro-Wilk test to determine whether the values in the specified column are normally distributed. It is suitable for small sample sizes (n < 5000) and prints both the test statistic and the interpretation based on a significance level of 0.05.

Parameters:

dfpd.DataFrame

The DataFrame containing the data to test.

column_to_do_teststr

The name of the column in df on which to perform the Shapiro-Wilk test.

Returns:

: None

This function prints the results to the console but does not return anything.

Notes:

  • The null hypothesis (H0) of the Shapiro-Wilk test is that the data is normally distributed.

  • A p-value greater than 0.05 suggests normality (fail to reject H0).

  • A p-value less than or equal to 0.05 suggests deviation from normality (reject H0).

omicspreprocessing.core.do_smirnov_test(df, column_to_do_test)

Perform the Kolmogorov-Smirnov test for normality on a specific column of a DataFrame.

This function applies the one-sample Kolmogorov-Smirnov (K-S) test to compare the empirical distribution of the specified column with a standard normal distribution.

Parameters:

dfpd.DataFrame

The DataFrame containing the data to test.

column_to_do_teststr

The name of the column in df to test for normality.

Returns:

: None

The function prints the test statistic, p-value, and an interpretation based on a significance level of 0.05.

Notes:

  • The null hypothesis (H0) is that the data comes from a standard normal distribution.

  • A p-value > 0.05 suggests the data may be normally distributed (fail to reject H0).

  • A p-value ≤ 0.05 indicates the data does not follow a normal distribution (reject H0).

  • This test assumes the data is already standardized; otherwise, results may be misleading.

omicspreprocessing.core.get_cv_from_melted_df(melted_df, protein_col='Proteins_names', value_col='value')

Calculate the coefficient of variation (CV) for each protein from a long-format DataFrame.

The function groups the data by protein names, computes the standard deviation and mean of the specified values, and then calculates the CV as std/mean.

Return type:

DataFrame

Parameters:

melted_dfpd.DataFrame

A long-format DataFrame containing protein names and corresponding values.

protein_colstr, optional

The column name containing protein identifiers. Default is ‘Proteins_names’.

value_colstr, optional

The column name containing numeric values for which the CV is calculated. Default is ‘value’.

Returns:

: pd.DataFrame

A DataFrame with columns: ‘std’, ‘mean’, ‘cv’, and ‘Gene names’ (protein identifiers). The ‘Gene names’ column is copied from the index.

omicspreprocessing.core.get_replicate_number(df, column_name='Sample_name')

Add a ‘replicate’ column to the DataFrame by assigning replicate numbers within groups.

For each group defined by the unique values in the specified column_name, this function assigns sequential replicate numbers starting from 1. It adds a new column called ‘replicate’ to the returned DataFrame.

Return type:

DataFrame

Parameters:

dfpd.DataFrame

The input DataFrame containing the data to process.

column_namestr, optional

The column to group by when determining replicates. Default is ‘Sample_name’.

Returns:

: pd.DataFrame

A new DataFrame with an additional ‘replicate’ column indicating the replicate number within each group.

Raises:

ValueError

If the input is not a DataFrame or the specified column does not exist.

Notes:

  • Replicate numbers start at 1.

  • This function preserves the original DataFrame structure but returns a modified copy.

omicspreprocessing.core.get_t_across_all_proteins(df, group_1_name, group_2_name, data_Set_name, Samplename2TMTchannel)

Perform t-tests across all proteins between two groups and return statistics with FDR correction.

Return type:

DataFrame

Parameters:

dfpd.DataFrame

DataFrame with proteins as index and samples as columns.

group_1_namestr

Name of the first group to compare.

group_2_namestr

Name of the second group to compare.

data_Set_namestr

Dataset identifier used in value extraction.

Samplename2TMTchanneldict

Mapping of sample names to TMT channel lists.

Returns:

: pd.DataFrame

DataFrame with columns: - ‘proteins’: protein names - ‘p_values’: p-values from t-test - ‘t_statistics’: t statistics - ‘g1_mean’: mean of group 1 - ‘g2_mean’: mean of group 2 - ‘FDR_correction’: adjusted p-values (FDR) - ‘pP_VALUE’: transformed adjusted p-values (using _pfunclog) - ‘delta_mean’: difference of means (g1_mean - g2_mean) - ‘type’: ‘up’ if delta_mean > 0 else ‘down’

omicspreprocessing.core.impute_normal_down_shift_distribution(unimputerd_df, column_wise=True, width=0.3, downshift=1.8, seed=100)

Perform missing value imputation by replacing NaNs with values drawn from a normal distribution shifted downward relative to the observed data distribution.

The imputed distribution has mean shifted down by downshift standard deviations and scaled by width.

Return type:

DataFrame

Parameters:

unimputerd_dfpd.DataFrame

DataFrame with missing values (NaNs) to be imputed.

column_wisebool, optional

If True, imputation is done separately for each column using that column’s statistics. If False, global mean and std across the entire DataFrame are used. Default is True.

widthfloat, optional

Scale factor for the standard deviation of the imputed distribution relative to the sample std.

downshiftfloat, optional

Number of standard deviations by which to downshift the mean of the imputed distribution.

seedint, optional

Random seed for reproducibility.

Returns:

: pd.DataFrame

DataFrame with imputed values replacing NaNs.

Reference:

Imputation method inspired by: https://rdrr.io/github/jdreyf/jdcbioinfo/man/impute_normal.html#google_vignette

omicspreprocessing.core.intersection(lst1, lst2)

Return the intersection of two lists as a list of unique elements present in both.

Parameters:

lst1list

The first list.

lst2list

The second list.

Returns:

: list

A list containing the unique elements common to both lst1 and lst2.

omicspreprocessing.core.log2_transform_intensities(df)

Apply log2 transformation to intensity values in the DataFrame.

Zeros are replaced with NaN before transformation to avoid -inf values.

Return type:

DataFrame

Parameters:

dfpd.DataFrame

DataFrame containing intensity values (numeric).

Returns:

: pd.DataFrame

DataFrame of the same shape with log2-transformed intensities. Zeros replaced by NaN.

omicspreprocessing.core.log2fold_change_calculator(df)

Calculate the log2 fold change of intensities column-wise relative to the column mean.

For each value in the DataFrame, this function subtracts the mean of its column, resulting in log2 fold changes relative to the average intensity across all samples per column.

Return type:

DataFrame

Parameters:

dfpd.DataFrame

A DataFrame of log-transformed intensity values with samples as columns.

Returns:

: pd.DataFrame

A DataFrame of the same shape as df, containing log2 fold changes for each entry.

omicspreprocessing.core.log2fold_change_calculator_LOO(df)

Calculate leave-one-out (LOO) log2 fold changes column-wise relative to the mean excluding the current sample.

For each sample (row), this function computes the mean of all other samples (excluding the current one) for each column, then calculates the difference between the sample’s value and this leave-one-out mean.

Return type:

DataFrame

Parameters:

dfpd.DataFrame

A DataFrame of log-transformed intensity values with samples as rows and features as columns.

Returns:

: pd.DataFrame

A DataFrame of the same shape as df, containing the LOO log2 fold changes.

omicspreprocessing.core.log_transform_intensities(df)

Apply log10 transformation to intensity values in the DataFrame.

Zeros are replaced with NaN before transformation to avoid -inf values.

Return type:

DataFrame

Parameters:

dfpd.DataFrame

DataFrame containing intensity values (must be numeric).

Returns:

: pd.DataFrame

DataFrame of the same shape with log10-transformed intensities. Zeros replaced by NaN.

omicspreprocessing.core.make_cv_plot(df)

Create and display a cumulative distribution plot of the coefficient of variation (CV).

Return type:

None

Parameters:

dfpd.DataFrame

DataFrame containing a ‘cv’ column representing coefficient of variation values (expected as decimal fractions, e.g., 0.05 for 5%).

Raises:

ValueError:

If the ‘cv’ column is not present in the DataFrame.

omicspreprocessing.core.make_pair_combinations(items)

Generate all unique pairwise combinations from a list of items.

Parameters:

items (list) – List of elements (e.g., strings, numbers) from which to create pairs.

Returns:

A list containing all possible unique pairs, where each pair is represented as a two-element list. Order within each pair follows the original input order.

Return type:

list of list

Examples

>>> make_pair_combinations(["A", "B", "C"])
[['A', 'B'], ['A', 'C'], ['B', 'C']]

Notes

  • Uses itertools.combinations, so no repeated elements and no reversed duplicates.

  • If items has fewer than 2 elements, the result will be an empty list.

omicspreprocessing.core.median_centering(df)

Normalize samples by centering their distributions using median scaling.

Each sample (column) is multiplied by a correction factor defined as the ratio of the average median of reference channels to the sample median. This aligns sample medians around the same central value.

Return type:

DataFrame

Parameters:

dfpd.DataFrame

DataFrame of batch intensities with samples as columns and features (e.g., reporter channels) as rows. Zero values are treated as missing and ignored in median calculations.

Returns:

: pd.DataFrame

Median-centered DataFrame with the same shape as input.

omicspreprocessing.core.median_centering_ms1(merged_ms1_df)

Normalize MS1 intensity samples by median centering with batch-specific correction factors.

This function computes the median intensity per sample considering only peptides detected in more than 70% of samples to avoid bias from low-abundance peptides. Each sample’s intensities are then scaled by a correction factor so that all medians align to the global mean median.

Return type:

DataFrame

Parameters:

merged_ms1_dfpd.DataFrame

DataFrame with proteins/peptides as rows and samples as columns containing intensity values.

Returns:

: pd.DataFrame

Normalized DataFrame with the same shape as input, with each sample scaled by its correction factor.

omicspreprocessing.core.one_vs_all_t_test(inputDF, protein_peptide, favoriteentity, metaDataColumn)

Perform a one-vs-all independent t-test for each protein/peptide comparing the favorite entity group against all other groups.

Return type:

DataFrame

Parameters:

inputDFpd.DataFrame

DataFrame with patients/samples as rows and proteins/peptides as columns.

protein_peptideList[str]

List of protein/peptide column names on which to perform the t-tests.

favoriteentitystr

The group of interest (e.g., ‘chordoma’) to compare against all others.

metaDataColumnstr

Column name in inputDF that contains group labels for each sample.

Returns:

: pd.DataFrame

DataFrame containing t-statistics, p-values, group means, sample sizes, FDR-adjusted p-values, and direction (‘up’/’down’) indicating if favoriteentity group is higher or lower than other groups.

omicspreprocessing.core.plot_cv_per_condition(df, condition, Samplename2TMTchannel, data_Set_name)

Plot cumulative histogram of coefficient of variation (CV) percentages for a given condition using Reporter intensity corrected channels.

Return type:

None

Parameters:

dfpd.DataFrame

The TMT evidence or proteinGroup DataFrame.

conditionstr

The name of the condition to analyze.

Samplename2TMTchanneldict

Dictionary mapping condition names to lists of TMT channel names.

data_Set_namestr

The dataset name repeated after “Reporter intensity corrected” columns. Used to filter relevant intensity columns.

Returns:

: None

Displays the cumulative histogram plot of CV percentages.

omicspreprocessing.core.post_hoc_ANOVA(ANOVA_df, protein_list, group_col='Brain region', p_adjust='fdr_bh')

Perform post-hoc pairwise t-tests after ANOVA for multiple proteins.

For each protein in protein_list, this function performs a post-hoc pairwise t-test using the specified grouping column in the ANOVA_df DataFrame, applying multiple testing correction as specified by p_adjust.

Parameters:

ANOVA_dfpd.DataFrame

DataFrame where rows are samples and columns are protein measurements.

protein_listlist

List of protein column names in ANOVA_df to perform the post-hoc tests on.

group_colstr, optional

Name of the metadata column in ANOVA_df used to group samples for testing. Default is ‘Brain region’.

p_adjuststr, optional

Method for p-value adjustment for multiple comparisons. Default is ‘fdr_bh’ (Benjamini-Hochberg false discovery rate).

Returns:

: list of dict

A list where each element is a dictionary with keys: - ‘protein’: the protein name, - ‘post_hoc_res’: the DataFrame of adjusted p-values from the post-hoc test.

Notes:

  • Requires sp.posthoc_ttest from the scikit-posthocs package.

omicspreprocessing.core.protein_remover_by_sparcity(df, minimum_samples_inside=20)

Remove proteins from the DataFrame based on sparsity threshold.

Proteins (rows) with fewer than minimum_samples_inside non-NA values across samples (columns) are removed from the DataFrame.

Return type:

DataFrame

Parameters:

dfpd.DataFrame

Input DataFrame with proteins as the index and samples as columns.

minimum_samples_insideint, optional

Minimum number of samples (columns) in which a protein must have non-NA values to be retained. Proteins with fewer non-NA samples are removed. Default is 20.

Returns:

: pd.DataFrame

Filtered DataFrame containing only proteins meeting the sparsity criterion.

omicspreprocessing.core.raw_median_centering_normalization(df, general_median)

Perform column-wise median centering normalization on a DataFrame using an external reference median.

This function normalizes each column of the input DataFrame by scaling its values so that the column’s median matches the specified general_median. This is useful when adjusting datasets to a common scale based on a known or reference distribution.

Parameters:

dfpd.DataFrame

A pandas DataFrame containing numerical data to normalize. It is assumed that the DataFrame is indexed and contains no non-numeric columns.

general_medianfloat

The reference median value from another distribution that each column should be normalized to.

Returns:

: pd.DataFrame

A new DataFrame of the same shape as df with normalized values such that the median of each column is approximately equal to general_median.

Notes:

  • Columns containing NaN values will be normalized ignoring the NaNs in median computation.

  • The index and column names of the original DataFrame are preserved.

omicspreprocessing.core.seaborn_volcano(df, fc_thresh=0.25, p_thresh=0.05, xaxis=1, draw_dashed_lines=True, p_value_col='p_values', foldchange_col='delta_mean', gene_col='Gene Names', title='Volcano Plot', where_to_save=None)

Create a volcano plot using Seaborn to visualize statistical significance (p-values) versus magnitude of change (fold change) for features such as genes or proteins.

The plot highlights three categories of points:
  • “up”: Fold change above fc_thresh and p-value ≤ p_thresh

  • “down”: Fold change below -fc_thresh and p-value ≤ p_thresh

  • “ns”: Not significant

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing at least the columns specified in p_value_col, foldchange_col, and gene_col.

  • fc_thresh (float, default=0.25) – Fold change threshold for significance classification. Values greater than this (in absolute value) are considered significant if p-value passes.

  • p_thresh (float, default=0.05) – P-value threshold for statistical significance.

  • xaxis (float, default=1) – X-axis range limit for plotting threshold lines.

  • draw_dashed_lines (bool, default=True) – Whether to draw dashed threshold lines for fold change and p-value cutoffs.

  • p_value_col (str, default="p_values") – Column name in df containing p-values.

  • foldchange_col (str, default="delta_mean") – Column name in df containing fold change values.

  • gene_col (str, default="Gene Names") – Column name in df containing gene/protein identifiers for labeling.

  • title (str, default="Volcano Plot") – Title for the plot.

  • where_to_save (str or None, default=None) – File path to save the plot. If None, the plot is displayed instead.

Returns:

Displays or saves the volcano plot.

Return type:

None

Notes

  • p-values are transformed to -log10(p) for the y-axis.

  • Only points meeting significance thresholds are labeled.

  • Color scheme:
    • Red: “up” (significant positive fold change)

    • Blue: “down” (significant negative fold change)

    • Grey: “ns” (not significant)

omicspreprocessing.core.show_spent_time(func)

Decorator to measure and display the execution time of a function.

This decorator wraps any function and prints the time it took to execute, which can be helpful for performance monitoring or benchmarking.

Usage:

@show_spent_time def some_function(…):

Parameters:

funccallable

The function whose execution time is to be measured.

Returns:

: callable

A wrapped version of the original function that prints the time spent during execution.

omicspreprocessing.core.t_test(x, y)

Perform an independent two-sample t-test between two numerical groups.

Return type:

Tuple[float, float]

Parameters:

xtuple of floats

First group of numerical observations.

ytuple of floats

Second group of numerical observations.

Returns:

: t_stat : float

The computed t-statistic.

p_valuefloat

The two-tailed p-value for the test.

omicspreprocessing.core.univariate_ROC_analysis_by_CV_permutation(pre, favoriteentity, kFold=5, repeats=10, threshold=0.5, scores='scores', labels='labels')

Compute the stability percentage of ROC AUCs above a given threshold using repeated stratified k-fold cross-validation.

Return type:

float

Parameters:

prepd.DataFrame

DataFrame containing prediction scores and true labels. Must include columns named as specified by scores and labels.

favoriteentitystr

The label considered as the positive class.

kFoldint, optional

Number of folds for cross-validation. Default is 5.

repeatsint, optional

Number of repeated cross-validation runs. Default is 10.

thresholdfloat, optional

Threshold for considering AUC as stable. Default is 0.5.

scoresstr, optional

Column name in pre containing the prediction scores. Default is ‘scores’.

labelsstr, optional

Column name in pre containing the true labels. Default is ‘labels’.

Returns:

: float

Percentage of AUCs that are above threshold or below 1 - threshold across all CV splits.

omicspreprocessing.core.unnest_proteingroups(df)

Split multi-protein group entries in the DataFrame index (separated by ‘;’) into separate rows, duplicating associated data for each protein group.

Return type:

DataFrame

Parameters:

dfpd.DataFrame

DataFrame with protein groups as the index, where some index entries may contain multiple protein names separated by semicolons (‘;’).

Returns:

: pd.DataFrame

Expanded DataFrame where each protein group has its own row, with data duplicated accordingly. The new index will be the individual protein group names.

omicspreprocessing.core.volcanoplot(df, cutoff=None, save_path=None)

Create and optionally save an interactive volcano plot using Plotly.

Return type:

None

Parameters:

dfpd.DataFrame

DataFrame containing at least the following columns: - ‘delta_mean’: log2 fold change values (x-axis). - ‘pP_VALUE’: -log10 transformed p-values or Q-values (y-axis). - ‘type’: categorical variable for coloring points (e.g., ‘up’, ‘down’, ‘ns’). - ‘proteins’: names for hover labels.

cutofffloat, optional

Y-axis cutoff to add a horizontal reference line (e.g., significance threshold).

save_pathstr, optional

If provided, saves the interactive plot as an HTML file to this path.

Returns:

: None

Displays the interactive plot.

omicspreprocessing.parallel_computing module

class omicspreprocessing.parallel_computing.ParallelComputing(func=None, list_to_proceed=None, num_cores=3)

Bases: object

get_result()

Get the result list after parallel processing.

Return type:

List

run_in_parallel()

Run the function in parallel over the list of items.

Return type:

None

set_func(func)

Set the function to apply in parallel.

Return type:

None

set_list(newlist)

Set the list of items to process.

Return type:

None

set_num_cores(num_cores)

Set the number of parallel processes.

Return type:

None

static split_df_to_list_by_group(df, group_name)

Split a DataFrame into a list of DataFrames grouped by a column.

Return type:

List[DataFrame]

Parameters:

dfpd.DataFrame

DataFrame to split.

group_namestr

Column name to group by.

Returns:

: List[pd.DataFrame]

List of grouped DataFrames.

Module contents