Skip to content

Deanonymize hw#21

Open
Kingk1342 wants to merge 8 commits intobatmandoescalc:mainfrom
Kingk1342:deanonymize_hw
Open

Deanonymize hw#21
Kingk1342 wants to merge 8 commits intobatmandoescalc:mainfrom
Kingk1342:deanonymize_hw

Conversation

@Kingk1342
Copy link

No description provided.

Copilot AI review requested due to automatic review settings March 11, 2026 16:39
@Kingk1342 Kingk1342 closed this Mar 11, 2026
@Kingk1342 Kingk1342 reopened this Mar 11, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the deanonymization homework functions for linking anonymized records to an auxiliary dataset and computing the re-identification rate, alongside updates to the Module 02 bot predictor notebook/model and some repo housekeeping.

Changes:

  • Implement link_records and deanonymization_rate in mod06_deanonymize.py.
  • Update the bot predictor notebook to support probability-thresholding and add a threshold search; retune GBM hyperparameters.
  • Update .gitignore to ignore .venv/ and (currently) add compiled __pycache__/*.pyc artifacts to the repo.

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
mod06_deanonymize.py Implements record linkage via quasi-identifiers and computes deanonymization rate.
mod02_test_bot_predictor.ipynb Adds threshold-based predictions and threshold tuning logic for evaluation.
mod02_build_bot_predictor.py Adjusts GradientBoostingClassifier hyperparameters (and enables early stopping behavior).
__pycache__/mod06_deanonymize.cpython-313.pyc Adds compiled bytecode artifact (should not be committed).
__pycache__/mod02_build_bot_predictor.cpython-313.pyc Adds compiled bytecode artifact (should not be committed).
.gitignore Adds .venv/ to ignores (but missing ignores for __pycache__/ and *.pyc).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"""
raise NotImplementedError
n_identified = len(matches_df)
n_total = len(anon_df)
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deanonymization_rate will raise ZeroDivisionError when anon_df is empty (n_total == 0). Consider returning 0.0 (or raising a clearer exception) when there are no anonymized records to evaluate.

Suggested change
n_total = len(anon_df)
n_total = len(anon_df)
if n_total == 0:
return 0.0

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +38
"def predict_bot(df, model=None, threshold=None):\n",
" \"\"\"\n",
" Predict whether each account is a bot (1) or human (0).\n",
" If threshold is set (e.g. 0.65), uses predicted probabilities and that cutoff\n",
" instead of the default 0.5; can lower misclassification rate on test.\n",
" \"\"\"\n",
" if model is None:\n",
" model = train_model()\n",
"\n",
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_bot falls back to train_model() when model is None, but train_model requires X and y arguments. This path will throw a TypeError if callers omit model; either make model required, or pass in the training data (or remove the auto-train behavior).

Suggested change
"def predict_bot(df, model=None, threshold=None):\n",
" \"\"\"\n",
" Predict whether each account is a bot (1) or human (0).\n",
" If threshold is set (e.g. 0.65), uses predicted probabilities and that cutoff\n",
" instead of the default 0.5; can lower misclassification rate on test.\n",
" \"\"\"\n",
" if model is None:\n",
" model = train_model()\n",
"\n",
"def predict_bot(df, model, threshold=None):\n",
" \"\"\"\n",
" Predict whether each account is a bot (1) or human (0).\n",
" If threshold is set (e.g. 0.65), uses predicted probabilities and that cutoff\n",
" instead of the default 0.5; can lower misclassification rate on test.\n",
" \"\"\"\n",

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +80
" for yt, yp in zip(y_true, y_pred):\n",
" if yt == 0 and yp == 0:\n",
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confusion_matrix_and_metrics uses zip(y_true, y_pred), which will silently truncate if the inputs differ in length and produce incorrect metrics. Add a length check (or otherwise validate shapes) before iterating.

Copilot uses AI. Check for mistakes.
"outputs": [],
"source": [
"y_pred_train = predict_bot(X_train, model)\n",
"y_pred_test = predict_bot(X_test, model, threshold=0.57)"
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cell prints a recommended best_t threshold (currently 0.56) but later uses threshold=0.57 for y_pred_test, which makes the results harder to interpret/reproduce. Consider using best_t directly (or explain why a different threshold is chosen).

Suggested change
"y_pred_test = predict_bot(X_test, model, threshold=0.57)"
"y_pred_test = predict_bot(X_test, model, threshold=best_t)"

Copilot uses AI. Check for mistakes.
Comment on lines 8 to 24
def train_model(X, y, seed=seed):
"""
Build a GBM on given data

"""
model = GradientBoostingClassifier(
learning_rate=0.1,
n_estimators=100,
max_depth=8,
subsample=1,
min_samples_leaf=1,
random_state=seed
n_estimators=800,
max_depth=4,
learning_rate=0.03,
subsample=0.85,
min_samples_leaf=15,
min_samples_split=30,
max_features="sqrt",
validation_fraction=0.15,
n_iter_no_change=25,
random_state=seed,
)
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR title indicates this is focused on deanonymization, but this file changes the bot predictor model hyperparameters as well. Consider splitting these unrelated changes into separate PRs to keep review scope focused and reduce merge risk.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants