omicspreprocessing package

Submodules

omicspreprocessing.core module

omicspreprocessing.core.ROC_curve_analysis(labels, scores, curve_title, plot=True)

Compute and optionally plot the ROC curve for binary classification.

Return type:: float

Parameters:

labelsnp.ndarray: Array of true binary labels (0 or 1), where 1 indicates the positive class.
scoresnp.ndarray: Array of predicted scores or probabilities for the positive class. Higher scores indicate higher likelihood of positive class.
curve_titlestr: Title or label for the ROC curve plot.
plotbool, optional: Whether to plot the ROC curve. Default is True.

Returns:

: float

Area Under the Curve (AUC) of the ROC curve.

omicspreprocessing.core.anova_test(inputDF, protein_peptide, metaDataColumn)

Perform one-way ANOVA tests for each protein/peptide across groups defined in metadata.

Return type:: DataFrame

Parameters:

inputDFpd.DataFrame: DataFrame where rows are samples/patients and columns are proteins/peptides.
protein_peptideList[str]: List of protein/peptide column names to perform ANOVA on.
metaDataColumnstr: Name of the metadata column in inputDF that contains group labels for ANOVA.

Returns:

: pd.DataFrame

DataFrame with columns: - ‘F_tests’: ANOVA F-statistics per protein/peptide, - ‘p_values’: Corresponding p-values, - ‘means_pergroup’: List of group means per protein/peptide, - ‘Majority protein IDs’: Protein/peptide names, - ‘fdr’: FDR-adjusted p-values using Benjamini-Hochberg correction.

omicspreprocessing.core.calculate_median_z_scores(df)

Calculate column-wise median-centered z-scores for a DataFrame.

For each column, the z-score is computed by subtracting the column median and dividing by the column standard deviation, ignoring NaNs.

Parameters:

dfpd.DataFrame: Input DataFrame with numerical values. NaNs and infinite values will be handled.

Returns:

: pd.DataFrame

DataFrame of the same shape with median-centered z-scores computed column-wise.

omicspreprocessing.core.check_path_exist(func)

omicspreprocessing.core.compare_two_distributions(data1, data2)

Compare two empirical distributions using the two-sample Kolmogorov-Smirnov (KS) test.

This function tests whether the two input datasets are drawn from the same continuous distribution. It prints the KS test statistic, the p-value, and an interpretation based on a significance level of 0.05.

Parameters:

data1array-like: The first sample distribution (e.g., list, NumPy array, or pandas Series).
data2array-like: The second sample distribution.

Returns:

: None

This function prints the KS statistic, p-value, and an interpretation of the test result.

Notes:

Null hypothesis (H0): The two distributions are identical.
A p-value > 0.05 suggests that the two distributions are similar (fail to reject H0).
A p-value ≤ 0.05 indicates that the two distributions are statistically different (reject H0).
This test is non-parametric and sensitive to differences in both location and shape of the distributions.

omicspreprocessing.core.custom_imputation(input)

Custom imputation for triplicate samples in proteomics data.

Rules: - If only one replicate out of three is expressed (non-NaN/non-zero), set all three replicates to zero. - If exactly two replicates are expressed and one is missing, impute the missing value with the median of the two expressed replicates.

Assumptions: - Input DataFrame has samples as the index, with replicate identifiers ‘_1’, ‘_2’, ‘_3’ suffixing sample names. - Columns correspond to proteins.

Return type:: DataFrame

Parameters:

dfpd.DataFrame: DataFrame with samples as index and proteins as columns. Sample names must include ‘_1’, ‘_2’, or ‘_3’ to identify replicates.

Returns:

: pd.DataFrame

DataFrame with the same shape as input but with missing replicate values imputed or zeroed as per the rules.

omicspreprocessing.core.do_normalize_with_target_df(z_scored_value, average_target, std_target)

Convert a z-scored value to the scale of a target distribution.

This function takes a standardized (z-scored) value and transforms it to a value in a target distribution with a specified mean and standard deviation. It performs the inverse z-score transformation:

normalized_value = mean_target + z * std_target

Parameters:

z_scored_valuefloat: The z-scored (standardized) value to be mapped to the target distribution.
average_targetfloat: The mean (μ) of the target distribution.
std_targetfloat: The standard deviation of the target distribution.

Returns:

: float

The normalized value on the scale of the target distribution.

Example:

do_normalize_with_target_df(1.5, 100, 15) 122.5

omicspreprocessing.core.do_shapiro_test(df, column_to_do_test)

Perform the Shapiro-Wilk test for normality on a specific column of a DataFrame.

This function applies the Shapiro-Wilk test to determine whether the values in the specified column are normally distributed. It is suitable for small sample sizes (n < 5000) and prints both the test statistic and the interpretation based on a significance level of 0.05.

Parameters:

dfpd.DataFrame: The DataFrame containing the data to test.
column_to_do_teststr: The name of the column in df on which to perform the Shapiro-Wilk test.

Returns:

: None

This function prints the results to the console but does not return anything.

Notes:

The null hypothesis (H0) of the Shapiro-Wilk test is that the data is normally distributed.
A p-value greater than 0.05 suggests normality (fail to reject H0).
A p-value less than or equal to 0.05 suggests deviation from normality (reject H0).

omicspreprocessing.core.do_smirnov_test(df, column_to_do_test)

Perform the Kolmogorov-Smirnov test for normality on a specific column of a DataFrame.

This function applies the one-sample Kolmogorov-Smirnov (K-S) test to compare the empirical distribution of the specified column with a standard normal distribution.

Parameters:

dfpd.DataFrame: The DataFrame containing the data to test.
column_to_do_teststr: The name of the column in df to test for normality.

Returns:

: None

The function prints the test statistic, p-value, and an interpretation based on a significance level of 0.05.

Notes:

The null hypothesis (H0) is that the data comes from a standard normal distribution.
A p-value > 0.05 suggests the data may be normally distributed (fail to reject H0).
A p-value ≤ 0.05 indicates the data does not follow a normal distribution (reject H0).
This test assumes the data is already standardized; otherwise, results may be misleading.

omicspreprocessing.core.get_cv_from_melted_df(melted_df, protein_col='Proteins_names', value_col='value')

Calculate the coefficient of variation (CV) for each protein from a long-format DataFrame.

The function groups the data by protein names, computes the standard deviation and mean of the specified values, and then calculates the CV as std/mean.

Return type:: DataFrame

Parameters:

melted_dfpd.DataFrame: A long-format DataFrame containing protein names and corresponding values.
protein_colstr, optional: The column name containing protein identifiers. Default is ‘Proteins_names’.
value_colstr, optional: The column name containing numeric values for which the CV is calculated. Default is ‘value’.

Returns:

: pd.DataFrame

A DataFrame with columns: ‘std’, ‘mean’, ‘cv’, and ‘Gene names’ (protein identifiers). The ‘Gene names’ column is copied from the index.

omicspreprocessing.core.get_replicate_number(df, column_name='Sample_name')

Add a ‘replicate’ column to the DataFrame by assigning replicate numbers within groups.

For each group defined by the unique values in the specified column_name, this function assigns sequential replicate numbers starting from 1. It adds a new column called ‘replicate’ to the returned DataFrame.

Return type:: DataFrame

Parameters:

dfpd.DataFrame: The input DataFrame containing the data to process.
column_namestr, optional: The column to group by when determining replicates. Default is ‘Sample_name’.

Returns:

: pd.DataFrame

A new DataFrame with an additional ‘replicate’ column indicating the replicate number within each group.

Raises:

ValueError: If the input is not a DataFrame or the specified column does not exist.

Notes:

Replicate numbers start at 1.
This function preserves the original DataFrame structure but returns a modified copy.

omicspreprocessing.core.get_t_across_all_proteins(df, group_1_name, group_2_name, data_Set_name, Samplename2TMTchannel)

Perform t-tests across all proteins between two groups and return statistics with FDR correction.

Return type:: DataFrame

Parameters:

dfpd.DataFrame: DataFrame with proteins as index and samples as columns.
group_1_namestr: Name of the first group to compare.
group_2_namestr: Name of the second group to compare.
data_Set_namestr: Dataset identifier used in value extraction.
Samplename2TMTchanneldict: Mapping of sample names to TMT channel lists.

Returns:

: pd.DataFrame

DataFrame with columns: - ‘proteins’: protein names - ‘p_values’: p-values from t-test - ‘t_statistics’: t statistics - ‘g1_mean’: mean of group 1 - ‘g2_mean’: mean of group 2 - ‘FDR_correction’: adjusted p-values (FDR) - ‘pP_VALUE’: transformed adjusted p-values (using _pfunclog) - ‘delta_mean’: difference of means (g1_mean - g2_mean) - ‘type’: ‘up’ if delta_mean > 0 else ‘down’

omicspreprocessing.core.impute_normal_down_shift_distribution(unimputerd_df, column_wise=True, width=0.3, downshift=1.8, seed=100)

Perform missing value imputation by replacing NaNs with values drawn from a normal distribution shifted downward relative to the observed data distribution.

The imputed distribution has mean shifted down by downshift standard deviations and scaled by width.

Return type:: DataFrame

Parameters:

unimputerd_dfpd.DataFrame: DataFrame with missing values (NaNs) to be imputed.
column_wisebool, optional: If True, imputation is done separately for each column using that column’s statistics. If False, global mean and std across the entire DataFrame are used. Default is True.
widthfloat, optional: Scale factor for the standard deviation of the imputed distribution relative to the sample std.
downshiftfloat, optional: Number of standard deviations by which to downshift the mean of the imputed distribution.
seedint, optional: Random seed for reproducibility.

Returns:

: pd.DataFrame

DataFrame with imputed values replacing NaNs.

Reference:

Imputation method inspired by: https://rdrr.io/github/jdreyf/jdcbioinfo/man/impute_normal.html#google_vignette

omicspreprocessing.core.intersection(lst1, lst2)

Return the intersection of two lists as a list of unique elements present in both.

Parameters:

lst1list: The first list.
lst2list: The second list.

Returns:

: list

A list containing the unique elements common to both lst1 and lst2.

omicspreprocessing.core.log2_transform_intensities(df)

Apply log2 transformation to intensity values in the DataFrame.

Zeros are replaced with NaN before transformation to avoid -inf values.

Return type:: DataFrame

Parameters:

dfpd.DataFrame: DataFrame containing intensity values (numeric).

Returns:

: pd.DataFrame

DataFrame of the same shape with log2-transformed intensities. Zeros replaced by NaN.

omicspreprocessing.core.log2fold_change_calculator(df)

Calculate the log2 fold change of intensities column-wise relative to the column mean.

For each value in the DataFrame, this function subtracts the mean of its column, resulting in log2 fold changes relative to the average intensity across all samples per column.

Return type:: DataFrame

Parameters:

dfpd.DataFrame: A DataFrame of log-transformed intensity values with samples as columns.

Returns:

: pd.DataFrame

A DataFrame of the same shape as df, containing log2 fold changes for each entry.

omicspreprocessing.core.log2fold_change_calculator_LOO(df)

Calculate leave-one-out (LOO) log2 fold changes column-wise relative to the mean excluding the current sample.

For each sample (row), this function computes the mean of all other samples (excluding the current one) for each column, then calculates the difference between the sample’s value and this leave-one-out mean.

Return type:: DataFrame

Parameters:

dfpd.DataFrame: A DataFrame of log-transformed intensity values with samples as rows and features as columns.

Returns:

: pd.DataFrame

A DataFrame of the same shape as df, containing the LOO log2 fold changes.

omicspreprocessing.core.log_transform_intensities(df)

Apply log10 transformation to intensity values in the DataFrame.

Zeros are replaced with NaN before transformation to avoid -inf values.

Return type:: DataFrame

Parameters:

dfpd.DataFrame: DataFrame containing intensity values (must be numeric).

Returns:

: pd.DataFrame

DataFrame of the same shape with log10-transformed intensities. Zeros replaced by NaN.

omicspreprocessing.core.make_cv_plot(df)

Create and display a cumulative distribution plot of the coefficient of variation (CV).

Return type:: None

Parameters:

dfpd.DataFrame: DataFrame containing a ‘cv’ column representing coefficient of variation values (expected as decimal fractions, e.g., 0.05 for 5%).

Raises:

ValueError:: If the ‘cv’ column is not present in the DataFrame.

omicspreprocessing.core.make_pair_combinations(items)

Generate all unique pairwise combinations from a list of items.

Parameters:: items (list) – List of elements (e.g., strings, numbers) from which to create pairs.
Returns:: A list containing all possible unique pairs, where each pair is represented as a two-element list. Order within each pair follows the original input order.
Return type:: list of list

Examples

>>> make_pair_combinations(["A", "B", "C"])
[['A', 'B'], ['A', 'C'], ['B', 'C']]

Notes

Uses itertools.combinations, so no repeated elements and no reversed duplicates.
If items has fewer than 2 elements, the result will be an empty list.

omicspreprocessing.core.median_centering(df)

Normalize samples by centering their distributions using median scaling.

Each sample (column) is multiplied by a correction factor defined as the ratio of the average median of reference channels to the sample median. This aligns sample medians around the same central value.

Return type:: DataFrame

Parameters:

dfpd.DataFrame: DataFrame of batch intensities with samples as columns and features (e.g., reporter channels) as rows. Zero values are treated as missing and ignored in median calculations.

Returns:

: pd.DataFrame

Median-centered DataFrame with the same shape as input.

omicspreprocessing.core.median_centering_ms1(merged_ms1_df)

Normalize MS1 intensity samples by median centering with batch-specific correction factors.

This function computes the median intensity per sample considering only peptides detected in more than 70% of samples to avoid bias from low-abundance peptides. Each sample’s intensities are then scaled by a correction factor so that all medians align to the global mean median.

Return type:: DataFrame

Parameters:

merged_ms1_dfpd.DataFrame: DataFrame with proteins/peptides as rows and samples as columns containing intensity values.

Returns:

: pd.DataFrame

Normalized DataFrame with the same shape as input, with each sample scaled by its correction factor.

omicspreprocessing.core.one_vs_all_t_test(inputDF, protein_peptide, favoriteentity, metaDataColumn)

Perform a one-vs-all independent t-test for each protein/peptide comparing the favorite entity group against all other groups.

Return type:: DataFrame

Parameters:

inputDFpd.DataFrame: DataFrame with patients/samples as rows and proteins/peptides as columns.
protein_peptideList[str]: List of protein/peptide column names on which to perform the t-tests.
favoriteentitystr: The group of interest (e.g., ‘chordoma’) to compare against all others.
metaDataColumnstr: Column name in inputDF that contains group labels for each sample.

Returns:

: pd.DataFrame

DataFrame containing t-statistics, p-values, group means, sample sizes, FDR-adjusted p-values, and direction (‘up’/’down’) indicating if favoriteentity group is higher or lower than other groups.

omicspreprocessing.core.plot_cv_per_condition(df, condition, Samplename2TMTchannel, data_Set_name)

Plot cumulative histogram of coefficient of variation (CV) percentages for a given condition using Reporter intensity corrected channels.

Return type:: None

Parameters:

dfpd.DataFrame: The TMT evidence or proteinGroup DataFrame.
conditionstr: The name of the condition to analyze.
Samplename2TMTchanneldict: Dictionary mapping condition names to lists of TMT channel names.
data_Set_namestr: The dataset name repeated after “Reporter intensity corrected” columns. Used to filter relevant intensity columns.

Returns:

: None

Displays the cumulative histogram plot of CV percentages.

omicspreprocessing.core.post_hoc_ANOVA(ANOVA_df, protein_list, group_col='Brain region', p_adjust='fdr_bh')

Perform post-hoc pairwise t-tests after ANOVA for multiple proteins.

For each protein in protein_list, this function performs a post-hoc pairwise t-test using the specified grouping column in the ANOVA_df DataFrame, applying multiple testing correction as specified by p_adjust.

Parameters:

ANOVA_dfpd.DataFrame: DataFrame where rows are samples and columns are protein measurements.
protein_listlist: List of protein column names in ANOVA_df to perform the post-hoc tests on.
group_colstr, optional: Name of the metadata column in ANOVA_df used to group samples for testing. Default is ‘Brain region’.
p_adjuststr, optional: Method for p-value adjustment for multiple comparisons. Default is ‘fdr_bh’ (Benjamini-Hochberg false discovery rate).

Returns:

: list of dict

A list where each element is a dictionary with keys: - ‘protein’: the protein name, - ‘post_hoc_res’: the DataFrame of adjusted p-values from the post-hoc test.

Notes:

Requires sp.posthoc_ttest from the scikit-posthocs package.

omicspreprocessing.core.protein_remover_by_sparcity(df, minimum_samples_inside=20)

Remove proteins from the DataFrame based on sparsity threshold.

Proteins (rows) with fewer than minimum_samples_inside non-NA values across samples (columns) are removed from the DataFrame.

Return type:: DataFrame

Parameters:

dfpd.DataFrame: Input DataFrame with proteins as the index and samples as columns.
minimum_samples_insideint, optional: Minimum number of samples (columns) in which a protein must have non-NA values to be retained. Proteins with fewer non-NA samples are removed. Default is 20.

Returns:

: pd.DataFrame

Filtered DataFrame containing only proteins meeting the sparsity criterion.

omicspreprocessing.core.raw_median_centering_normalization(df, general_median)

Perform column-wise median centering normalization on a DataFrame using an external reference median.

This function normalizes each column of the input DataFrame by scaling its values so that the column’s median matches the specified general_median. This is useful when adjusting datasets to a common scale based on a known or reference distribution.

Parameters:

dfpd.DataFrame: A pandas DataFrame containing numerical data to normalize. It is assumed that the DataFrame is indexed and contains no non-numeric columns.
general_medianfloat: The reference median value from another distribution that each column should be normalized to.

Returns:

: pd.DataFrame

A new DataFrame of the same shape as df with normalized values such that the median of each column is approximately equal to general_median.

Notes:

Columns containing NaN values will be normalized ignoring the NaNs in median computation.
The index and column names of the original DataFrame are preserved.

omicspreprocessing.core.seaborn_volcano(df, fc_thresh=0.25, p_thresh=0.05, xaxis=1, draw_dashed_lines=True, p_value_col='p_values', foldchange_col='delta_mean', gene_col='Gene Names', title='Volcano Plot', where_to_save=None)

Create a volcano plot using Seaborn to visualize statistical significance (p-values) versus magnitude of change (fold change) for features such as genes or proteins.

The plot highlights three categories of points:

“up”: Fold change above fc_thresh and p-value ≤ p_thresh
“down”: Fold change below -fc_thresh and p-value ≤ p_thresh
“ns”: Not significant

Parameters:

df (pandas.DataFrame) – Input DataFrame containing at least the columns specified in p_value_col, foldchange_col, and gene_col.
fc_thresh (float, default=0.25) – Fold change threshold for significance classification. Values greater than this (in absolute value) are considered significant if p-value passes.
p_thresh (float, default=0.05) – P-value threshold for statistical significance.
xaxis (float, default=1) – X-axis range limit for plotting threshold lines.
draw_dashed_lines (bool, default=True) – Whether to draw dashed threshold lines for fold change and p-value cutoffs.
p_value_col (str, default="p_values") – Column name in df containing p-values.
foldchange_col (str, default="delta_mean") – Column name in df containing fold change values.
gene_col (str, default="Gene Names") – Column name in df containing gene/protein identifiers for labeling.
title (str, default="Volcano Plot") – Title for the plot.
where_to_save (str or None, default=None) – File path to save the plot. If None, the plot is displayed instead.

Returns:

Displays or saves the volcano plot.

Return type:

None

Notes

p-values are transformed to -log10(p) for the y-axis.
Only points meeting significance thresholds are labeled.
Color scheme:
- Red: “up” (significant positive fold change)
- Blue: “down” (significant negative fold change)
- Grey: “ns” (not significant)

omicspreprocessing.core.show_spent_time(func)

Decorator to measure and display the execution time of a function.

This decorator wraps any function and prints the time it took to execute, which can be helpful for performance monitoring or benchmarking.

Usage:

@show_spent_time def some_function(…):

…

Parameters:

funccallable: The function whose execution time is to be measured.

Returns:

: callable

A wrapped version of the original function that prints the time spent during execution.

omicspreprocessing.core.t_test(x, y)

Perform an independent two-sample t-test between two numerical groups.

Return type:: Tuple[float, float]

Parameters:

xtuple of floats: First group of numerical observations.
ytuple of floats: Second group of numerical observations.

Returns:

: t_stat : float

The computed t-statistic.

p_valuefloat: The two-tailed p-value for the test.

omicspreprocessing.core.univariate_ROC_analysis_by_CV_permutation(pre, favoriteentity, kFold=5, repeats=10, threshold=0.5, scores='scores', labels='labels')

Compute the stability percentage of ROC AUCs above a given threshold using repeated stratified k-fold cross-validation.

Return type:: float

Parameters:

prepd.DataFrame: DataFrame containing prediction scores and true labels. Must include columns named as specified by scores and labels.
favoriteentitystr: The label considered as the positive class.
kFoldint, optional: Number of folds for cross-validation. Default is 5.
repeatsint, optional: Number of repeated cross-validation runs. Default is 10.
thresholdfloat, optional: Threshold for considering AUC as stable. Default is 0.5.
scoresstr, optional: Column name in pre containing the prediction scores. Default is ‘scores’.
labelsstr, optional: Column name in pre containing the true labels. Default is ‘labels’.

Returns:

: float

Percentage of AUCs that are above threshold or below 1 - threshold across all CV splits.

omicspreprocessing.core.unnest_proteingroups(df)

Split multi-protein group entries in the DataFrame index (separated by ‘;’) into separate rows, duplicating associated data for each protein group.

Return type:: DataFrame

Parameters:

dfpd.DataFrame: DataFrame with protein groups as the index, where some index entries may contain multiple protein names separated by semicolons (‘;’).

Returns:

: pd.DataFrame

Expanded DataFrame where each protein group has its own row, with data duplicated accordingly. The new index will be the individual protein group names.

omicspreprocessing.core.volcanoplot(df, cutoff=None, save_path=None)

Create and optionally save an interactive volcano plot using Plotly.

Return type:: None

Parameters:

dfpd.DataFrame: DataFrame containing at least the following columns: - ‘delta_mean’: log2 fold change values (x-axis). - ‘pP_VALUE’: -log10 transformed p-values or Q-values (y-axis). - ‘type’: categorical variable for coloring points (e.g., ‘up’, ‘down’, ‘ns’). - ‘proteins’: names for hover labels.
cutofffloat, optional: Y-axis cutoff to add a horizontal reference line (e.g., significance threshold).
save_pathstr, optional: If provided, saves the interactive plot as an HTML file to this path.

Returns:

: None

Displays the interactive plot.

omicspreprocessing.parallel_computing module

class omicspreprocessing.parallel_computing.ParallelComputing(func=None, list_to_proceed=None, num_cores=3)

Bases: object

get_result()

Get the result list after parallel processing.

Return type:: List

run_in_parallel()

Run the function in parallel over the list of items.

Return type:: None

set_func(func)

Set the function to apply in parallel.

Return type:: None

set_list(newlist)

Set the list of items to process.

Return type:: None

set_num_cores(num_cores)

Set the number of parallel processes.

Return type:: None

static split_df_to_list_by_group(df, group_name)

Split a DataFrame into a list of DataFrames grouped by a column.

Return type:: List[DataFrame]

Parameters:

dfpd.DataFrame: DataFrame to split.
group_namestr: Column name to group by.

Returns:

: List[pd.DataFrame]

List of grouped DataFrames.

omicspreprocessing package

Submodules

omicspreprocessing.core module

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Notes:

Parameters:

Returns:

Parameters:

Returns:

Example:

Parameters:

Returns:

Notes:

Parameters:

Returns:

Notes:

Parameters:

Returns:

Parameters:

Returns:

Raises:

Notes:

Parameters:

Returns:

Parameters:

Returns:

Reference:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Raises:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Notes:

Parameters:

Returns:

Parameters:

Returns:

Notes:

Usage:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

omicspreprocessing.parallel_computing module

Parameters:

Returns:

Module contents