Conversation
There was a problem hiding this comment.
Pull request overview
Implements the deanonymization homework functions for linking anonymized records to an auxiliary dataset and computing the re-identification rate, alongside updates to the Module 02 bot predictor notebook/model and some repo housekeeping.
Changes:
- Implement
link_recordsanddeanonymization_rateinmod06_deanonymize.py. - Update the bot predictor notebook to support probability-thresholding and add a threshold search; retune GBM hyperparameters.
- Update
.gitignoreto ignore.venv/and (currently) add compiled__pycache__/*.pycartifacts to the repo.
Reviewed changes
Copilot reviewed 3 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
mod06_deanonymize.py |
Implements record linkage via quasi-identifiers and computes deanonymization rate. |
mod02_test_bot_predictor.ipynb |
Adds threshold-based predictions and threshold tuning logic for evaluation. |
mod02_build_bot_predictor.py |
Adjusts GradientBoostingClassifier hyperparameters (and enables early stopping behavior). |
__pycache__/mod06_deanonymize.cpython-313.pyc |
Adds compiled bytecode artifact (should not be committed). |
__pycache__/mod02_build_bot_predictor.cpython-313.pyc |
Adds compiled bytecode artifact (should not be committed). |
.gitignore |
Adds .venv/ to ignores (but missing ignores for __pycache__/ and *.pyc). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """ | ||
| raise NotImplementedError | ||
| n_identified = len(matches_df) | ||
| n_total = len(anon_df) |
There was a problem hiding this comment.
deanonymization_rate will raise ZeroDivisionError when anon_df is empty (n_total == 0). Consider returning 0.0 (or raising a clearer exception) when there are no anonymized records to evaluate.
| n_total = len(anon_df) | |
| n_total = len(anon_df) | |
| if n_total == 0: | |
| return 0.0 |
| "def predict_bot(df, model=None, threshold=None):\n", | ||
| " \"\"\"\n", | ||
| " Predict whether each account is a bot (1) or human (0).\n", | ||
| " If threshold is set (e.g. 0.65), uses predicted probabilities and that cutoff\n", | ||
| " instead of the default 0.5; can lower misclassification rate on test.\n", | ||
| " \"\"\"\n", | ||
| " if model is None:\n", | ||
| " model = train_model()\n", | ||
| "\n", |
There was a problem hiding this comment.
predict_bot falls back to train_model() when model is None, but train_model requires X and y arguments. This path will throw a TypeError if callers omit model; either make model required, or pass in the training data (or remove the auto-train behavior).
| "def predict_bot(df, model=None, threshold=None):\n", | |
| " \"\"\"\n", | |
| " Predict whether each account is a bot (1) or human (0).\n", | |
| " If threshold is set (e.g. 0.65), uses predicted probabilities and that cutoff\n", | |
| " instead of the default 0.5; can lower misclassification rate on test.\n", | |
| " \"\"\"\n", | |
| " if model is None:\n", | |
| " model = train_model()\n", | |
| "\n", | |
| "def predict_bot(df, model, threshold=None):\n", | |
| " \"\"\"\n", | |
| " Predict whether each account is a bot (1) or human (0).\n", | |
| " If threshold is set (e.g. 0.65), uses predicted probabilities and that cutoff\n", | |
| " instead of the default 0.5; can lower misclassification rate on test.\n", | |
| " \"\"\"\n", |
| " for yt, yp in zip(y_true, y_pred):\n", | ||
| " if yt == 0 and yp == 0:\n", |
There was a problem hiding this comment.
confusion_matrix_and_metrics uses zip(y_true, y_pred), which will silently truncate if the inputs differ in length and produce incorrect metrics. Add a length check (or otherwise validate shapes) before iterating.
| "outputs": [], | ||
| "source": [ | ||
| "y_pred_train = predict_bot(X_train, model)\n", | ||
| "y_pred_test = predict_bot(X_test, model, threshold=0.57)" |
There was a problem hiding this comment.
This cell prints a recommended best_t threshold (currently 0.56) but later uses threshold=0.57 for y_pred_test, which makes the results harder to interpret/reproduce. Consider using best_t directly (or explain why a different threshold is chosen).
| "y_pred_test = predict_bot(X_test, model, threshold=0.57)" | |
| "y_pred_test = predict_bot(X_test, model, threshold=best_t)" |
| def train_model(X, y, seed=seed): | ||
| """ | ||
| Build a GBM on given data | ||
|
|
||
| """ | ||
| model = GradientBoostingClassifier( | ||
| learning_rate=0.1, | ||
| n_estimators=100, | ||
| max_depth=8, | ||
| subsample=1, | ||
| min_samples_leaf=1, | ||
| random_state=seed | ||
| n_estimators=800, | ||
| max_depth=4, | ||
| learning_rate=0.03, | ||
| subsample=0.85, | ||
| min_samples_leaf=15, | ||
| min_samples_split=30, | ||
| max_features="sqrt", | ||
| validation_fraction=0.15, | ||
| n_iter_no_change=25, | ||
| random_state=seed, | ||
| ) |
There was a problem hiding this comment.
PR title indicates this is focused on deanonymization, but this file changes the bot predictor model hyperparameters as well. Consider splitting these unrelated changes into separate PRs to keep review scope focused and reduce merge risk.
Added insights on re-identification risks and factors.
No description provided.