Deanonymize hw by Kingk1342 · Pull Request #21 · batmandoescalc/code_3000

Kingk1342 · 2026-03-11T16:39:26Z

No description provided.

Copilot

Pull request overview

Implements the deanonymization homework functions for linking anonymized records to an auxiliary dataset and computing the re-identification rate, alongside updates to the Module 02 bot predictor notebook/model and some repo housekeeping.

Changes:

Implement link_records and deanonymization_rate in mod06_deanonymize.py.
Update the bot predictor notebook to support probability-thresholding and add a threshold search; retune GBM hyperparameters.
Update .gitignore to ignore .venv/ and (currently) add compiled __pycache__/*.pyc artifacts to the repo.

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`mod06_deanonymize.py`	Implements record linkage via quasi-identifiers and computes deanonymization rate.
`mod02_test_bot_predictor.ipynb`	Adds threshold-based predictions and threshold tuning logic for evaluation.
`mod02_build_bot_predictor.py`	Adjusts GradientBoostingClassifier hyperparameters (and enables early stopping behavior).
`__pycache__/mod06_deanonymize.cpython-313.pyc`	Adds compiled bytecode artifact (should not be committed).
`__pycache__/mod02_build_bot_predictor.cpython-313.pyc`	Adds compiled bytecode artifact (should not be committed).
`.gitignore`	Adds `.venv/` to ignores (but missing ignores for `__pycache__/` and `*.pyc`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-11T16:43:50Z

mod06_deanonymize.py

    """
-    raise NotImplementedError
+    n_identified = len(matches_df)
+    n_total = len(anon_df)


deanonymization_rate will raise ZeroDivisionError when anon_df is empty (n_total == 0). Consider returning 0.0 (or raising a clearer exception) when there are no anonymized records to evaluate.

Suggested change

n_total = len(anon_df)

n_total = len(anon_df)

if n_total == 0:

return 0.0

Copilot · 2026-03-11T16:43:50Z

mod02_test_bot_predictor.ipynb

+        "def predict_bot(df, model=None, threshold=None):\n",
+        "    \"\"\"\n",
+        "    Predict whether each account is a bot (1) or human (0).\n",
+        "    If threshold is set (e.g. 0.65), uses predicted probabilities and that cutoff\n",
+        "    instead of the default 0.5; can lower misclassification rate on test.\n",
+        "    \"\"\"\n",
+        "    if model is None:\n",
+        "        model = train_model()\n",
+        "\n",


predict_bot falls back to train_model() when model is None, but train_model requires X and y arguments. This path will throw a TypeError if callers omit model; either make model required, or pass in the training data (or remove the auto-train behavior).

Suggested change

"def predict_bot(df, model=None, threshold=None):\n",

" \"\"\"\n",

" Predict whether each account is a bot (1) or human (0).\n",

" If threshold is set (e.g. 0.65), uses predicted probabilities and that cutoff\n",

" instead of the default 0.5; can lower misclassification rate on test.\n",

" \"\"\"\n",

" if model is None:\n",

" model = train_model()\n",

"\n",

"def predict_bot(df, model, threshold=None):\n",

" \"\"\"\n",

" Predict whether each account is a bot (1) or human (0).\n",

" If threshold is set (e.g. 0.65), uses predicted probabilities and that cutoff\n",

" instead of the default 0.5; can lower misclassification rate on test.\n",

" \"\"\"\n",

Copilot · 2026-03-11T16:43:51Z

mod02_test_bot_predictor.ipynb

+        "    for yt, yp in zip(y_true, y_pred):\n",
+        "        if yt == 0 and yp == 0:\n",


confusion_matrix_and_metrics uses zip(y_true, y_pred), which will silently truncate if the inputs differ in length and produce incorrect metrics. Add a length check (or otherwise validate shapes) before iterating.

Copilot · 2026-03-11T16:43:51Z

mod02_test_bot_predictor.ipynb

+      "outputs": [],
+      "source": [
+        "y_pred_train = predict_bot(X_train, model)\n",
+        "y_pred_test = predict_bot(X_test, model, threshold=0.57)"


This cell prints a recommended best_t threshold (currently 0.56) but later uses threshold=0.57 for y_pred_test, which makes the results harder to interpret/reproduce. Consider using best_t directly (or explain why a different threshold is chosen).

Suggested change

"y_pred_test = predict_bot(X_test, model, threshold=0.57)"

"y_pred_test = predict_bot(X_test, model, threshold=best_t)"

Copilot · 2026-03-11T16:43:51Z

mod02_build_bot_predictor.py

 def train_model(X, y, seed=seed):
    """
    Build a GBM on given data
+
    """
    model = GradientBoostingClassifier(
-        learning_rate=0.1,
-        n_estimators=100,
-        max_depth=8,
-        subsample=1,
-        min_samples_leaf=1,
-        random_state=seed
+        n_estimators=800,
+        max_depth=4,
+        learning_rate=0.03,
+        subsample=0.85,
+        min_samples_leaf=15,
+        min_samples_split=30,
+        max_features="sqrt",
+        validation_fraction=0.15,
+        n_iter_no_change=25,
+        random_state=seed,
    )


PR title indicates this is focused on deanonymization, but this file changes the bot predictor model hyperparameters as well. Consider splitting these unrelated changes into separate PRs to keep review scope focused and reduce merge risk.

Added insights on re-identification risks and factors.

Kingk1342 added 5 commits February 10, 2026 19:53

updated the predictor bot

70faac0

finished assignment

70dee64

finished new assignment

5c0952a

Complete mod06 deanonymization assignment

b6cd982

Finished Mod 06

7a57e38

Copilot AI review requested due to automatic review settings March 11, 2026 16:39

Kingk1342 closed this Mar 11, 2026

Kingk1342 reopened this Mar 11, 2026

Copilot started reviewing on behalf of Kingk1342 March 11, 2026 16:39 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Kingk1342 added 3 commits March 11, 2026 12:44

Update markdown cells with analysis on deanonymization

d6f294a

Enhance markdown with re-identification insights

54bde97

Added insights on re-identification risks and factors.

Fix markdown formatting in deanonymization notebook

881188a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deanonymize hw#21

Deanonymize hw#21
Kingk1342 wants to merge 8 commits intobatmandoescalc:mainfrom
Kingk1342:deanonymize_hw

Kingk1342 commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		" for yt, yp in zip(y_true, y_pred):\n",
		" if yt == 0 and yp == 0:\n",

	"y_pred_test = predict_bot(X_test, model, threshold=0.57)"
	"y_pred_test = predict_bot(X_test, model, threshold=best_t)"

Conversation

Kingk1342 commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants