Skip to content

mathiasdallapalma/BNFS-QA

Repository files navigation

Bayesian Network Feature Selection (BNFS)

A method for feature selection applied to classification and regression problems using Bayesian networks.

Final Application Project at Trento University

Table of Contents

Introduction
Installation
Usage

Overview

BNFS implements a feature selection strategy based on the reconstruction of Bayesian networks, following the method proposed in arXiv:2204.03526. By modeling probabilistic dependencies between variables, the algorithm identifies a Markov blanket around the target variable to select the most informative features. The approach has been tested by training machine learning classifiers and compared against state-of-the-art selection methods, showing promising results.

The pipeline is structured as follows:

  1. Data Preprocessing: The data is cleaned and normalized to prepare it for Bayesian network training. This step includes discretizing continuous features where necessary.
  2. Bayesian Network Structure Learning: The network structure is learned by identifying the relevant variables and their relationships. Different strategies can be employed for this step including the quantum-computing one.
  3. Markov Blanket Calculation: The Markov blanket of the target variable is determined to identify the set of features that directly influence the target.
  4. Feature Selection: The most relevant features are selected based on their inclusion in the Markov blanket, which provides a set of features highly correlated with the target variable.

Installation

Prerequisites:

Ensure that the following prerequisites are installed on your development machine:

  • Python 3.6 or later
  • pip3 (Python package installer)

BNFS Installation:

To install BNFS, use the following pip command:

pip3 install bnfs

Usage

Step 1: Data Preparation

Prepare a CSV file containing the dataset, with the target variable being the last column. The features can be of any type (integer, float, string), but the target must be labeled appropriately. For example:

Example
Feature 1 Feature 2 Feature 3 TARGET
17.27 3 ETVDA True
44.59 105 FBAER False
... ... ... ...
26.89 19 DDFBDF False
15.56 298 CSDSD True

Mixed data types are supported (int, float, string).

Step 2: Configuration File

Create a JSON configuration file to specify customizable parameters for the feature selection algorithm.

Details

Aviable Parameters:

  • data_path: Path to the CSV file containing the dataset.
  • output_dir: Directory for output files. If it doesn’t exist, it will be created.
  • random_state: Set a random seed for reproducibility.
  • verbose: If set to true, print information at each step (discretization, BN structure learning, Markov blanket calculation).
  • full_Markov_blanket: If true, the selected features include the union of parents, children, and the children’s parents of the target variable.

Discretization Parameters:

  • discretize: If set to false, skips discretization.
  • labels: List of indexes for categorical features needing label encoding.
  • n_bins: Number of bins for discretization.
  • discretizer_strategy: Discretization strategy (e.g., uniform, quantile, kmeans).
  • keep_file: If true, generates a CSV file with the discretized dataset.
  • divide_et_impera: If true, applies the divide et impera approach.

Bayesian Network Structure Learning:

  • dei_n: Number of splits for the divide et impera approach.
  • bnsl_data_path: Path to the discretized data (if discretization is skipped).
  • bnsl_strategy: Strategy for learning the Bayesian network structure (e.g., QA, SA, bnlearn).

QA and bnlearn Parameters:

  • reads: Number of reads for the quantum annealing method.
  • annealing_time: Time (in microseconds) allocated for quantum annealing per read.
  • metric: The scoring function used to evaluate network fit (e.g., k2, bic, bdeu).
  • search_algorithm: Search algorithm for optimizing the DAG structure (e.g., ex, hc, cl, tan, cs, naivebayes).

Step 3: Running the Pipeline

Once the data and configuration file are ready, execute the algorithm using the following command:

bnfs -c <config_file>

This will trigger the execution of the feature selection pipeline, generating the following output files:

  • res.txt: Contains the adjacency matrix of the learned Bayesian network structure and a list of features selected through the Markov blanket method.

About

Implementation and evaluation of a feature selection method based on Bayesian network reconstruction, following the approach from arXiv:2204.03526. The method was tested with ML classifiers and compared to state-of-the-art techniques.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages