Welcome to DataLib’s documentation!

datalib_ha package

Submodules

datalib_ha.advanced_analysis module

class datalib_ha.advanced_analysis.AdvancedAnalysis[source]

Bases: object

A class providing advanced data analysis methods for DataLib.

Includes regression, classification, clustering, and dimensionality reduction techniques.

static decision_tree_classification(X: DataFrame, y: Series, max_depth: int | None = None, test_size: float = 0.2) dict[source]

Perform Decision Tree classification.

Parameters:
  • X (pd.DataFrame) – Input features.

  • y (pd.Series) – Target variable.

  • max_depth (int, optional) – Maximum tree depth. Defaults to None.

  • test_size (float, optional) – Proportion of data for testing. Defaults to 0.2.

Returns:

Decision Tree classification results.

Return type:

dict

static kmeans_clustering(X: DataFrame, n_clusters: int = 3, random_state: int = 42) dict[source]

Perform K-means clustering.

Parameters:
  • X (pd.DataFrame) – Input features.

  • n_clusters (int, optional) – Number of clusters. Defaults to 3.

  • random_state (int, optional) – Random seed for reproducibility.

Returns:

K-means clustering results.

Return type:

dict

static knn_classification(X: DataFrame, y: Series, n_neighbors: int = 5, test_size: float = 0.2) dict[source]

Perform K-Nearest Neighbors classification.

Parameters:
  • X (pd.DataFrame) – Input features.

  • y (pd.Series) – Target variable.

  • n_neighbors (int, optional) – Number of neighbors. Defaults to 5.

  • test_size (float, optional) – Proportion of data for testing. Defaults to 0.2.

Returns:

KNN classification results.

Return type:

dict

static linear_regression(X: DataFrame, y: Series, test_size: float = 0.2) dict[source]

Perform linear regression analysis.

Parameters:
  • X (pd.DataFrame) – Input features.

  • y (pd.Series) – Target variable.

  • test_size (float, optional) – Proportion of data for testing. Defaults to 0.2.

Returns:

Regression analysis results including model, coefficients, and performance metrics.

Return type:

dict

static polynomial_regression(X: DataFrame, y: Series, degree: int = 2, test_size: float = 0.2) dict[source]

Perform polynomial regression analysis.

Parameters:
  • X (pd.DataFrame) – Input features.

  • y (pd.Series) – Target variable.

  • degree (int, optional) – Polynomial degree. Defaults to 2.

  • test_size (float, optional) – Proportion of data for testing. Defaults to 0.2.

Returns:

Polynomial regression analysis results.

Return type:

dict

static principal_component_analysis(X: DataFrame, n_components: int | None = None) dict[source]

Perform Principal Component Analysis (PCA).

Parameters:
  • X (pd.DataFrame) – Input features.

  • n_components (int, optional) – Number of components to keep. Defaults to None (min of features or samples).

Returns:

PCA analysis results.

Return type:

dict

datalib_ha.data_manipulation module

class datalib_ha.data_manipulation.DataManipulation[source]

Bases: object

A class for handling data manipulation tasks in DataLib.

This class provides methods for loading, processing, and transforming data, with a focus on CSV files and general data cleaning operations.

static filter_data(dataframe: DataFrame, conditions: dict | None = None) DataFrame[source]

Filter DataFrame based on specified conditions.

Parameters:
  • dataframe (pd.DataFrame) – Input DataFrame to filter.

  • conditions (dict, optional) – Dictionary of column:value filtering conditions.

Returns:

Filtered DataFrame.

Return type:

pd.DataFrame

Example

filter_data(df, {‘age’: lambda x: x > 25, ‘city’: ‘Paris’})

static handle_missing_values(dataframe: DataFrame, method: str = 'drop', fill_value: int | float | str | None = None) DataFrame[source]

Handle missing values in a DataFrame.

Parameters:
  • dataframe (pd.DataFrame) – Input DataFrame.

  • method (str, optional) – Method to handle missing values. Defaults to ‘drop’. Other options: ‘fill’.

  • fill_value (optional) – Value to use for filling missing data.

Returns:

DataFrame with missing values handled.

Return type:

pd.DataFrame

static load_csv(filepath: str, delimiter: str = ',', encoding: str = 'utf-8') DataFrame[source]

Load a CSV file into a pandas DataFrame.

Parameters:
  • filepath (str) – Path to the CSV file to be loaded.

  • delimiter (str, optional) – Delimiter used in the CSV file. Defaults to ‘,’.

  • encoding (str, optional) – File encoding. Defaults to ‘utf-8’.

Returns:

Loaded data as a pandas DataFrame.

Return type:

pd.DataFrame

Raises:
  • FileNotFoundError – If the specified file cannot be found.

  • pd.errors.EmptyDataError – If the CSV file is empty.

static normalize_data(dataframe: DataFrame, columns: List[str] | None = None) DataFrame[source]

Normalize numerical columns using min-max scaling.

Parameters:
  • dataframe (pd.DataFrame) – Input DataFrame.

  • columns (list, optional) – Columns to normalize. If None, normalizes all numeric columns.

Returns:

Normalized DataFrame.

Return type:

pd.DataFrame

static save_csv(dataframe: DataFrame, filepath: str, delimiter: str = ',', encoding: str = 'utf-8') None[source]

Save a pandas DataFrame to a CSV file.

Parameters:
  • dataframe (pd.DataFrame) – DataFrame to be saved.

  • filepath (str) – Destination path for the CSV file.

  • delimiter (str, optional) – Delimiter to use. Defaults to ‘,’.

  • encoding (str, optional) – File encoding. Defaults to ‘utf-8’.

datalib_ha.statistics module

class datalib_ha.statistics.StatisticalAnalysis[source]

Bases: object

A class providing statistical analysis methods for DataLib.

Offers methods for calculating basic and advanced statistical measures, including descriptive statistics and hypothesis testing.

static chi_square_test(observed: ndarray) dict[source]

Perform chi-square goodness of fit test.

Parameters:

observed (np.ndarray) – Observed frequencies.

Returns:

Chi-square test results.

Return type:

dict

static correlation(df: DataFrame, method: str = 'pearson') DataFrame[source]

Calculate correlation matrix between numeric columns.

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • method (str, optional) – Correlation method. Defaults to ‘pearson’. Other options: ‘spearman’, ‘kendall’.

Returns:

Correlation matrix.

Return type:

pd.DataFrame

static descriptive_stats(data: Series | List[float] | ndarray) dict[source]

Calculate comprehensive descriptive statistics for a dataset.

Parameters:

data (Union[pd.Series, List[float], np.ndarray]) – Input data.

Returns:

Dictionary containing descriptive statistics.

Return type:

dict

static t_test(group1: Series | List[float], group2: Series | List[float], equal_var: bool = True) dict[source]

Perform independent t-test between two groups.

Parameters:
  • group1 (Union[pd.Series, List[float]]) – First group of data.

  • group2 (Union[pd.Series, List[float]]) – Second group of data.

  • equal_var (bool, optional) – Assume equal variances. Defaults to True.

Returns:

T-test results including t-statistic and p-value.

Return type:

dict

datalib_ha.visualization module

class datalib_ha.visualization.DataVisualization[source]

Bases: object

A class for creating data visualizations in DataLib.

Provides methods for generating various types of plots and charts to help users understand and explore their data.

static bar_plot(df: DataFrame, x_column: str, y_column: str, title: str | None = None, output_path: str | None = None) Figure[source]

Create a bar plot from DataFrame.

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • x_column (str) – Column to use for x-axis.

  • y_column (str) – Column to use for y-axis.

  • title (str, optional) – Plot title.

  • output_path (str, optional) – File path to save the plot.

Returns:

Matplotlib figure object.

Return type:

plt.Figure

static correlation_heatmap(correlation_matrix: DataFrame, title: str | None = None, output_path: str | None = None) Figure[source]

Create a correlation heatmap from a correlation matrix.

Parameters:
  • correlation_matrix (pd.DataFrame) – Correlation matrix.

  • title (str, optional) – Plot title.

  • output_path (str, optional) – File path to save the plot.

Returns:

Matplotlib figure object.

Return type:

plt.Figure

static histogram(data: Series | List[float], bins: int = 10, title: str | None = None, output_path: str | None = None) Figure[source]

Create a histogram of the data.

Parameters:
  • data (Union[pd.Series, List[float]]) – Input data.

  • bins (int, optional) – Number of histogram bins. Defaults to 10.

  • title (str, optional) – Plot title.

  • output_path (str, optional) – File path to save the plot.

Returns:

Matplotlib figure object.

Return type:

plt.Figure

static scatter_plot(df: DataFrame, x_column: str, y_column: str, hue: str | None = None, title: str | None = None, output_path: str | None = None) Figure[source]

Create a scatter plot from DataFrame.

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • x_column (str) – Column to use for x-axis.

  • y_column (str) – Column to use for y-axis.

  • hue (str, optional) – Column to use for color differentiation.

  • title (str, optional) – Plot title.

  • output_path (str, optional) – File path to save the plot.

Returns:

Matplotlib figure object.

Return type:

plt.Figure

Module contents