XeroGraph is a Python package developed for researchers and data scientists to analyze and visualize missing data in datasets. It incorporates Little's MCAR test, among other statistical tools, to help users understand the mechanisms behind missing data. This package is particularly optimized for small to medium-sized datasets and offers extensive visualization options to elucidate data characteristics and integrity.
- Little's MCAR Test: Determines if the missing data in a dataset is missing completely at random.
- Statistical Tests: Perform normality checks and Kolmogorov-Smirnov tests to evaluate the distribution of data.
- Advanced Visualization: Generate histograms, density plots, box plots, Q-Q plots, and more to visualize data distributions and missing data patterns.
- Missing Data Analysis: Tools to visualize and quantify the extent and patterns of missing data within your dataset.
- Missing Value Imputation: Several options to perform missing value imputation.
- Compare Missing Value Imputation Methods: Tools to compare different imputation methods.
- Compare Distribution of Imputed Data: Tools to compare distribution of imputed data with original data.
Ensure you have Python 3.9 or later installed. XeroGraph depends on the following Python libraries:
- pandas
- numpy
- matplotlib
- statsmodels
- scikit-learn
- xgboost
- seaborn
- torch
- nimfa
- optuna
- tqdm
- ipywidgets
These dependencies will be automatically installed during XeroGraph's installation process.
It is recommended to install XeroGraph within a virtual environment to manage dependencies effectively:
python -m venv xeroenv
source xeroenv/bin/activate
xeroenv\Scripts\activate
pip install XeroGraph
Alternatively, if you have access to the source code, navigate to the root directory of the source code and run:
python setup.py install
Here's a quick example to get you started with performing Little's MCAR test, visualizing the data and imputation. We use XeroAnalyzer application provided in XeroGraph.
# XeroAnalyzer can be imported as XA, xa, xeroanalyzer, xero_analyzer or XeroAnalyzer
from XeroGraph import xa
import pandas as pd
data = pd.DataFrame({
'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, None, 6, 4, 5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 1, 6, 4, 5],
'feature2': [4, 6, 2, 4, 5, 6, 7, 8, 9, 2, 4, 3, 2, 2, 6, 4, 6, 2, 4, 5, 6, 7, 8, 9, 2, 4, 3, 2, 2, 6],
'feature3': [1, 2, 4, 3, 6, 2, 6, 6, None, 1, 5, 0, 3, 2, 1, 1, 2, 4, 3, None, 2, 6, 6, 1, 1, 5, 0, 3, 2, 1],
'feature4': [4, 3, 1, 2, 4, 5, 6, 7, 8, 9, 2, None, 3, 2, 1, 4, 3, 1, 2, 4, 5, 6, 7, 8, 9, 2, 1, 3, 2, 1],
'feature5': [4, 3, 4, 2, None, 6, 2, 4, 5, 6, 7, 8, 9, 2, 4, 4, 3, 4, 2, 1, 6, 2, 4, 5, None, 7, 8, 9, 2, 4]
})
print(data.shape)
# Optional arguments:
# To save plot: save_plot=True, save_path='save path'
xg_test = xa(data, save_files=False, save_path="")
xg_test.normality()
xg_test.ks()
xg_test.histograms()
xg_test.density_plots()
xg_test.box_plots()
xg_test.qq_plots()
xg_test.missing_data()
xg_test.missing_percentage()
mcar_result = xg_test.mcar()
print(f"MCAR Test Result: {mcar_result}")
Some of the following tools can be used for imputation of categorical data but we will mainly focus on continuous data.
imp_data_mean = xg_test.mean_imputation()
imp_data_median = xg_test.median_imputation()
imp_data_most_frequent = xg_test.most_frequent_imputation()
imp_data_knn = xg_test.knn_imputation()
imp_data_ii = xg_test.iterative_imputation(plot_convergence=False) # Optional: plot_convergence=True
imp_data_rf = xg_test.random_forest_imputation()
imp_data_lc = xg_test.lasso_cv_imputation()
imp_data_xb = xg_test.xgboost_imputation()
imp_data_xp = xg_test.xputer_imputation()
imp_data_mice = xg_test.mice_imp()
xg_test.check_plausibility(imp_data_rf)
xg_test.compare_with_ttest_and_plot(imp_data_ii)
xg_test.feature_combinations()
We use XeroCompare application provided in XeroGraph to compare different imputation methods. For analysis, you may provide a dataset with minimum number of missing value as XeroCompare will remove rows with missing values.
# MICE imputation is a slow process, if you want to include pass "run_mice=True".
summary = xg_test.compare_imputers(run_mice=False)()
print(summary)
# XeroCompare can be imported as XC, xc, xerocompare, xero_compare or XeroCompare
from XeroGraph import xc
# MICE imputation is a slow process, if you want to include pass "run_mice=True".
compare_imp = xc(data, run_mice=False)
summary = compare_imp.compare()
print(summary)
For more detailed information on all the features and usage instructions, refer to the full documentation available at ReadTheDoc(https://xerograph.readthedocs.io).
Contributions to XeroGraph are welcome! Please refer to the CONTRIBUTING.md file for guidelines on how to make a contribution, including bug fixes, adding new features, and improving the documentation.
XeroGraph is released under the Apache License 2.0. For more details, see the LICENSE file included with the source code.
For help and support, please open an issue in the GitHub repository or contact the development team at XeroGraph@kazilab.se