Discrepancy Detection in Client Profiles

Team Name:

Team Members: Noah Stäuble, Mikael Makonnen, Michał Mikuta, Elias Mbarek

Introduction

Client onboarding often involves collecting the same information across multiple forms—ID documents, profiles, application forms, and descriptions. Inconsistent entries across these documents (e.g. mismatched phone numbers, conflicting text descriptions) are common and hard to detect manually.

Our solution automates discrepancy detection across structured and free-form data using a layered ensemble approach, ensuring more reliable onboarding decisions in real-world banking pipelines.

Explainability

A key strength of our solution is its explainability. Each rejection decision made by the ensemble can be fully traced back to its contributing components. Whether it's a rule-based mismatch, an LLM-detected inconsistency, or an ML classifier signal, the ensemble provides a clear breakdown of the evidence leading to the final decision. This transparency ensures trust and accountability in real-world applications.

Approach Overview

We designed a modular pipeline that captures inconsistencies through multiple complementary layers:

Rule-Based Matching: Symbolic comparison of structured fields (e.g. phone number, nationality, address) across forms
LLM-Based Detection: Uses a language model to detect semantic and textual inconsistencies in free-form client descriptions
ML Classifiers: Supervised models trained to detect subtle data patterns and learn from intermediate signals
Ensemble Aggregator: Final decision is made by combining all sources of evidence in a robust way
Embedding Based Knowledge Text knowledge is incorporated via embeddings, pca

This approach offers:

High precision on hard symbolic logic
Robust recall via ML
Generalization to novel inconsistencies via LLMs

Setup Instructions

Set up environment using Conda:

conda create -n datathon-env python=3.10
conda activate datathon-env
pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
__pycache__		__pycache__
dataset_zip		dataset_zip
plots		plots
rules		rules
saved_models		saved_models
splits		splits
utilities		utilities
.gitignore		.gitignore
README.md		README.md
Team_ __ (pronounced _empty string_).pdf		Team_ __ (pronounced _empty string_).pdf
collect_data_new.py		collect_data_new.py
eda.py		eda.py
emptystring.csv		emptystring.csv
ensemble.ipynb		ensemble.ipynb
false_negatives.csv		false_negatives.csv
feature_engineering.py		feature_engineering.py
get_clean_dataframe.py		get_clean_dataframe.py
llm_enrich.py		llm_enrich.py
llm_test.py		llm_test.py
main.py		main.py
meta_model.ipynb		meta_model.ipynb
meta_model.py		meta_model.py
models.py		models.py
requirements.txt		requirements.txt
solution.csv		solution.csv
train_csv.csv		train_csv.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discrepancy Detection in Client Profiles

Introduction

Explainability

Approach Overview

Setup Instructions

About

Uh oh!

Releases

Packages

Languages

crocodilefigcucumber/datathon2025-emptystring

Folders and files

Latest commit

History

Repository files navigation

Discrepancy Detection in Client Profiles

Introduction

Explainability

Approach Overview

Setup Instructions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages