Data profiling is a tool that helps data analysts in the process of data analysis and understanding. It summarizes a given dataset with an informative report.
Based on Pandas Profiling 1.4.1.
Data Profiling can be installed by running pip install https://github.com/DAVINTLAB/pandas-profiling/archive/master.zip.
Data Profiling will return its report in the form of a page written in HTML.
The use of Jupyter Notebook is recommended as it can make the experience more interactive. The first step is to import the necessary libraries.
import pandas as pd
from pandas_profiling import ProfileReportA pandas dataframe will serve as the dataset that will be used to generate the report. In this example, we are using the Iris dataset.
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", encoding='UTF-8')In Jupyter Notebook, simply calling the report will display it.
ProfileReport(df)For more visualizations related specifically to categorical data (the Bi-Dimensional Chord Diagram and the Data Table), use:
ProfileReport(df, extended_report=True)Here is a comparison between Pandas Profiling 1.4.1 (a) and Data Profiling (b).
Python 3 is required in order to run Data Profiling. Also, the following Python libraries are used:
| Library | Version |
|---|---|
| pandas | 1.3.3 |
| numpy | 1.15.4 |
| matplotlib | 3.5.0 |
| jinja | 2.10.1 |
| missingno | 0.5.0 |
Internet access is necessary to load the JavaScript libraries. The following JavaScript libraries are used:
| Library | Version |
|---|---|
| d3 | 5.9.7 |
| jquery | 3.4.1 |
| bootstrap | 3.3.6 |
Please refer to this tool by citing the works indicated below.
For the tool in general:
A. M. P. Milani, “Preprocessing profiling model for visual analytics”, Master’s thesis, School of Technology, PUCRS, Porto Alegre, 2019. [Online]. Available: http://tede2.pucrs.br/tede2/handle/tede/9007
For the Bi-Dimensional Chord Diagram and the Data Table, specifically:
L. Ciocari, “Uso de visualização de dados para auxiliar no pré-processamento de dados categóricos”, Undergraduate thesis, School of Technology, PUCRS, Porto Alegre, 2019.
We are members of the Data Visualization and Interaction Lab (DaVInt) at PUCRS:
- Isabel H. Manssour -- Professor Coordinator of DaVInt -- 2017-current.
- Alessandra M. P. Milani -- Master Student in Computer Science -- 2017-2019.
- Lucas B. Ciocari -- Graduate Student in Computer Science -- 2020-current.
- Lucas A. Loges -- Undergraduate Student in Computer Science -- 2019-current.
More information can be found here.
