Welcome to DataLib’s documentation!
datalib_ha package
Submodules
datalib_ha.advanced_analysis module
- class datalib_ha.advanced_analysis.AdvancedAnalysis[source]
Bases:
objectA class providing advanced data analysis methods for DataLib.
Includes regression, classification, clustering, and dimensionality reduction techniques.
- static decision_tree_classification(X: DataFrame, y: Series, max_depth: int | None = None, test_size: float = 0.2) dict[source]
Perform Decision Tree classification.
- Parameters:
X (pd.DataFrame) – Input features.
y (pd.Series) – Target variable.
max_depth (int, optional) – Maximum tree depth. Defaults to None.
test_size (float, optional) – Proportion of data for testing. Defaults to 0.2.
- Returns:
Decision Tree classification results.
- Return type:
dict
- static kmeans_clustering(X: DataFrame, n_clusters: int = 3, random_state: int = 42) dict[source]
Perform K-means clustering.
- Parameters:
X (pd.DataFrame) – Input features.
n_clusters (int, optional) – Number of clusters. Defaults to 3.
random_state (int, optional) – Random seed for reproducibility.
- Returns:
K-means clustering results.
- Return type:
dict
- static knn_classification(X: DataFrame, y: Series, n_neighbors: int = 5, test_size: float = 0.2) dict[source]
Perform K-Nearest Neighbors classification.
- Parameters:
X (pd.DataFrame) – Input features.
y (pd.Series) – Target variable.
n_neighbors (int, optional) – Number of neighbors. Defaults to 5.
test_size (float, optional) – Proportion of data for testing. Defaults to 0.2.
- Returns:
KNN classification results.
- Return type:
dict
- static linear_regression(X: DataFrame, y: Series, test_size: float = 0.2) dict[source]
Perform linear regression analysis.
- Parameters:
X (pd.DataFrame) – Input features.
y (pd.Series) – Target variable.
test_size (float, optional) – Proportion of data for testing. Defaults to 0.2.
- Returns:
Regression analysis results including model, coefficients, and performance metrics.
- Return type:
dict
- static polynomial_regression(X: DataFrame, y: Series, degree: int = 2, test_size: float = 0.2) dict[source]
Perform polynomial regression analysis.
- Parameters:
X (pd.DataFrame) – Input features.
y (pd.Series) – Target variable.
degree (int, optional) – Polynomial degree. Defaults to 2.
test_size (float, optional) – Proportion of data for testing. Defaults to 0.2.
- Returns:
Polynomial regression analysis results.
- Return type:
dict
- static principal_component_analysis(X: DataFrame, n_components: int | None = None) dict[source]
Perform Principal Component Analysis (PCA).
- Parameters:
X (pd.DataFrame) – Input features.
n_components (int, optional) – Number of components to keep. Defaults to None (min of features or samples).
- Returns:
PCA analysis results.
- Return type:
dict
datalib_ha.data_manipulation module
- class datalib_ha.data_manipulation.DataManipulation[source]
Bases:
objectA class for handling data manipulation tasks in DataLib.
This class provides methods for loading, processing, and transforming data, with a focus on CSV files and general data cleaning operations.
- static filter_data(dataframe: DataFrame, conditions: dict | None = None) DataFrame[source]
Filter DataFrame based on specified conditions.
- Parameters:
dataframe (pd.DataFrame) – Input DataFrame to filter.
conditions (dict, optional) – Dictionary of column:value filtering conditions.
- Returns:
Filtered DataFrame.
- Return type:
pd.DataFrame
Example
filter_data(df, {‘age’: lambda x: x > 25, ‘city’: ‘Paris’})
- static handle_missing_values(dataframe: DataFrame, method: str = 'drop', fill_value: int | float | str | None = None) DataFrame[source]
Handle missing values in a DataFrame.
- Parameters:
dataframe (pd.DataFrame) – Input DataFrame.
method (str, optional) – Method to handle missing values. Defaults to ‘drop’. Other options: ‘fill’.
fill_value (optional) – Value to use for filling missing data.
- Returns:
DataFrame with missing values handled.
- Return type:
pd.DataFrame
- static load_csv(filepath: str, delimiter: str = ',', encoding: str = 'utf-8') DataFrame[source]
Load a CSV file into a pandas DataFrame.
- Parameters:
filepath (str) – Path to the CSV file to be loaded.
delimiter (str, optional) – Delimiter used in the CSV file. Defaults to ‘,’.
encoding (str, optional) – File encoding. Defaults to ‘utf-8’.
- Returns:
Loaded data as a pandas DataFrame.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If the specified file cannot be found.
pd.errors.EmptyDataError – If the CSV file is empty.
- static normalize_data(dataframe: DataFrame, columns: List[str] | None = None) DataFrame[source]
Normalize numerical columns using min-max scaling.
- Parameters:
dataframe (pd.DataFrame) – Input DataFrame.
columns (list, optional) – Columns to normalize. If None, normalizes all numeric columns.
- Returns:
Normalized DataFrame.
- Return type:
pd.DataFrame
- static save_csv(dataframe: DataFrame, filepath: str, delimiter: str = ',', encoding: str = 'utf-8') None[source]
Save a pandas DataFrame to a CSV file.
- Parameters:
dataframe (pd.DataFrame) – DataFrame to be saved.
filepath (str) – Destination path for the CSV file.
delimiter (str, optional) – Delimiter to use. Defaults to ‘,’.
encoding (str, optional) – File encoding. Defaults to ‘utf-8’.
datalib_ha.statistics module
- class datalib_ha.statistics.StatisticalAnalysis[source]
Bases:
objectA class providing statistical analysis methods for DataLib.
Offers methods for calculating basic and advanced statistical measures, including descriptive statistics and hypothesis testing.
- static chi_square_test(observed: ndarray) dict[source]
Perform chi-square goodness of fit test.
- Parameters:
observed (np.ndarray) – Observed frequencies.
- Returns:
Chi-square test results.
- Return type:
dict
- static correlation(df: DataFrame, method: str = 'pearson') DataFrame[source]
Calculate correlation matrix between numeric columns.
- Parameters:
df (pd.DataFrame) – Input DataFrame.
method (str, optional) – Correlation method. Defaults to ‘pearson’. Other options: ‘spearman’, ‘kendall’.
- Returns:
Correlation matrix.
- Return type:
pd.DataFrame
- static descriptive_stats(data: Series | List[float] | ndarray) dict[source]
Calculate comprehensive descriptive statistics for a dataset.
- Parameters:
data (Union[pd.Series, List[float], np.ndarray]) – Input data.
- Returns:
Dictionary containing descriptive statistics.
- Return type:
dict
- static t_test(group1: Series | List[float], group2: Series | List[float], equal_var: bool = True) dict[source]
Perform independent t-test between two groups.
- Parameters:
group1 (Union[pd.Series, List[float]]) – First group of data.
group2 (Union[pd.Series, List[float]]) – Second group of data.
equal_var (bool, optional) – Assume equal variances. Defaults to True.
- Returns:
T-test results including t-statistic and p-value.
- Return type:
dict
datalib_ha.visualization module
- class datalib_ha.visualization.DataVisualization[source]
Bases:
objectA class for creating data visualizations in DataLib.
Provides methods for generating various types of plots and charts to help users understand and explore their data.
- static bar_plot(df: DataFrame, x_column: str, y_column: str, title: str | None = None, output_path: str | None = None) Figure[source]
Create a bar plot from DataFrame.
- Parameters:
df (pd.DataFrame) – Input DataFrame.
x_column (str) – Column to use for x-axis.
y_column (str) – Column to use for y-axis.
title (str, optional) – Plot title.
output_path (str, optional) – File path to save the plot.
- Returns:
Matplotlib figure object.
- Return type:
plt.Figure
- static correlation_heatmap(correlation_matrix: DataFrame, title: str | None = None, output_path: str | None = None) Figure[source]
Create a correlation heatmap from a correlation matrix.
- Parameters:
correlation_matrix (pd.DataFrame) – Correlation matrix.
title (str, optional) – Plot title.
output_path (str, optional) – File path to save the plot.
- Returns:
Matplotlib figure object.
- Return type:
plt.Figure
- static histogram(data: Series | List[float], bins: int = 10, title: str | None = None, output_path: str | None = None) Figure[source]
Create a histogram of the data.
- Parameters:
data (Union[pd.Series, List[float]]) – Input data.
bins (int, optional) – Number of histogram bins. Defaults to 10.
title (str, optional) – Plot title.
output_path (str, optional) – File path to save the plot.
- Returns:
Matplotlib figure object.
- Return type:
plt.Figure
- static scatter_plot(df: DataFrame, x_column: str, y_column: str, hue: str | None = None, title: str | None = None, output_path: str | None = None) Figure[source]
Create a scatter plot from DataFrame.
- Parameters:
df (pd.DataFrame) – Input DataFrame.
x_column (str) – Column to use for x-axis.
y_column (str) – Column to use for y-axis.
hue (str, optional) – Column to use for color differentiation.
title (str, optional) – Plot title.
output_path (str, optional) – File path to save the plot.
- Returns:
Matplotlib figure object.
- Return type:
plt.Figure