A method for feature selection applied to classification and regression problems using Bayesian networks.
Final Application Project at Trento University
Introduction
Installation
Usage
BNFS implements a feature selection strategy based on the reconstruction of Bayesian networks, following the method proposed in arXiv:2204.03526. By modeling probabilistic dependencies between variables, the algorithm identifies a Markov blanket around the target variable to select the most informative features. The approach has been tested by training machine learning classifiers and compared against state-of-the-art selection methods, showing promising results.
The pipeline is structured as follows:
- Data Preprocessing: The data is cleaned and normalized to prepare it for Bayesian network training. This step includes discretizing continuous features where necessary.
- Bayesian Network Structure Learning: The network structure is learned by identifying the relevant variables and their relationships. Different strategies can be employed for this step including the quantum-computing one.
- Markov Blanket Calculation: The Markov blanket of the target variable is determined to identify the set of features that directly influence the target.
- Feature Selection: The most relevant features are selected based on their inclusion in the Markov blanket, which provides a set of features highly correlated with the target variable.
Ensure that the following prerequisites are installed on your development machine:
- Python 3.6 or later
- pip3 (Python package installer)
To install BNFS, use the following pip command:
pip3 install bnfsPrepare a CSV file containing the dataset, with the target variable being the last column. The features can be of any type (integer, float, string), but the target must be labeled appropriately. For example:
Example
| Feature 1 | Feature 2 | Feature 3 | TARGET |
|---|---|---|---|
| 17.27 | 3 | ETVDA | True |
| 44.59 | 105 | FBAER | False |
| ... | ... | ... | ... |
| 26.89 | 19 | DDFBDF | False |
| 15.56 | 298 | CSDSD | True |
Mixed data types are supported (int, float, string).
Create a JSON configuration file to specify customizable parameters for the feature selection algorithm.
Details
Aviable Parameters:
data_path: Path to the CSV file containing the dataset.output_dir: Directory for output files. If it doesn’t exist, it will be created.random_state: Set a random seed for reproducibility.verbose: If set totrue, print information at each step (discretization, BN structure learning, Markov blanket calculation).full_Markov_blanket: Iftrue, the selected features include the union of parents, children, and the children’s parents of the target variable.
Discretization Parameters:
discretize: If set tofalse, skips discretization.labels: List of indexes for categorical features needing label encoding.n_bins: Number of bins for discretization.discretizer_strategy: Discretization strategy (e.g.,uniform,quantile,kmeans).keep_file: Iftrue, generates a CSV file with the discretized dataset.divide_et_impera: Iftrue, applies the divide et impera approach.
Bayesian Network Structure Learning:
dei_n: Number of splits for the divide et impera approach.bnsl_data_path: Path to the discretized data (if discretization is skipped).bnsl_strategy: Strategy for learning the Bayesian network structure (e.g.,QA,SA,bnlearn).
QA and bnlearn Parameters:
reads: Number of reads for the quantum annealing method.annealing_time: Time (in microseconds) allocated for quantum annealing per read.metric: The scoring function used to evaluate network fit (e.g.,k2,bic,bdeu).search_algorithm: Search algorithm for optimizing the DAG structure (e.g.,ex,hc,cl,tan,cs,naivebayes).
Once the data and configuration file are ready, execute the algorithm using the following command:
bnfs -c <config_file>This will trigger the execution of the feature selection pipeline, generating the following output files:
res.txt: Contains the adjacency matrix of the learned Bayesian network structure and a list of features selected through the Markov blanket method.
