Added compute script, removed unused scripts, cleaned up script forma… by lintyfresh · Pull Request #10 · gsbdarc/LLM_benchmarks

lintyfresh · 2026-02-04T18:58:59Z

…tting, updated README

… parameter

natalya-patrikeeva

Now that the scripts are correct, I made a few code suggestions to make it more elegant, and performant (avoid loop, use vectorized ops, doc your functions) but looks great!

natalya-patrikeeva · 2026-02-10T15:53:42Z

scripts/compute_metrics.py

+
+# what name should the investigate_df be saved under? change each time you
+# run this script to avoid overriding
+investigate_json = "investigate_2"


a better way is to check if file exists, and remove and save with same name? We can probably use one log file and append to it. Or are you saying you want to keep several copies like investigate_1, investigate_2?

natalya-patrikeeva · 2026-02-10T15:54:12Z

scripts/compute_metrics.py

+# run this script to avoid overriding
+investigate_json = "investigate_2"
+
+# what name should the investigate_df be saved under? change each time you


not sure about this comment and below file metrics_2? is there metrics_1?

natalya-patrikeeva · 2026-02-10T15:55:02Z

scripts/compute_metrics.py

+
+# what name should the investigate_df be saved under? change each time you
+# run this script to avoid overriding
+metrics_json = "metrics_2"


same idea as above, let's not make new files unless we can't avoid it

natalya-patrikeeva · 2026-02-10T15:55:54Z

scripts/compute.py

+
+
+def sort_outputs(s):
+    # opens all JSON files within a file path


if we want to keep all results in one json, we could have a key for error and filter that later instead of having two files

natalya-patrikeeva · 2026-02-10T15:56:57Z

scripts/compute_metrics.py

+# Functions
+
+
+def sort_outputs(s):


I would pick a more meaningful arg than "s"

Let's also add doc strings to all functions

Typing can also be helpful!

natalya-patrikeeva · 2026-02-10T16:07:58Z

scripts/compute.py

+
+
+def lookup_truth(row):
+    # looks up ground truth for a given row based on the benchmark name and


doc string here please

natalya-patrikeeva · 2026-02-10T16:08:17Z

scripts/compute.py

+
+
+def compute_accuracy(row):
+    # compares accuracy of llm output to ground_truth based on benchmark_name


add doc string

don't love the if /elses and the repeated return 1 and 0. Can rewrite with less code using functions

natalya-patrikeeva · 2026-02-10T16:15:28Z

scripts/compute_metrics.py

+with open("/zfs/projects/students/ltdarc-usf-intern-2025/LLM_benchmarks/inputs/ground_truth.json", "r") as f:
+    truth = json.load(f)
+    image_index = list(truth.keys())
+    for image in image_index:


I think we can load all images at once avoiding the loop

natalya-patrikeeva · 2026-02-10T16:16:38Z

scripts/compute.py

+    return full_investigate_df
+
+
+def lookup_truth(row):


I wonder if this function should take in df not a row and run on the whole df not row by row

scripts/main.py

Jeffotter · 2026-02-10T17:18:07Z

scripts/compute_metrics.py

+
+full_investigate_df.to_json(
+    f"/zfs/projects/students/ltdarc-usf-intern-2025/LLM_benchmarks/outputs/metrics/{investigate_json}.json",
+    orient='records')


Does this overwrite each run?

…ated README for Sherlock.

Added compute script, removed unused scripts, cleaned up script forma…

03d63db

…tting, updated README

lintyfresh assigned natalya-patrikeeva and Jeffotter Feb 4, 2026

updated main.py to try calling api multiple times, increased timemout…

a797564

… parameter

lintyfresh unassigned natalya-patrikeeva and Jeffotter Feb 5, 2026

lintyfresh requested review from Jeffotter and natalya-patrikeeva February 5, 2026 00:20

natalya-patrikeeva approved these changes Feb 10, 2026

View reviewed changes

Jeffotter reviewed Feb 10, 2026

View reviewed changes

Jeffotter approved these changes Feb 10, 2026

View reviewed changes

Lynn Tong added 3 commits February 10, 2026 14:11

Added pdf_to_png.py script and updated README

7084938

renamed compute script, addressed some git pull request comments, upd…

8fd7259

…ated README for Sherlock.

updated how ground_truth.csv is loaded

a40e3f9



		def sort_outputs(s):
		# opens all JSON files within a file path



		def lookup_truth(row):
		# looks up ground truth for a given row based on the benchmark name and



		def compute_accuracy(row):
		# compares accuracy of llm output to ground_truth based on benchmark_name

Conversation

lintyfresh commented Feb 4, 2026

Uh oh!

natalya-patrikeeva left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants