Skip to content

Commit 159b75c

Browse files
committed
files management and cleanup and created README.txt
1 parent 4ea9603 commit 159b75c

16 files changed

+1512
-2377
lines changed

conversion2025/README.md

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,46 @@
11
# README
22

33
## Overview
4-
This Jupyter Notebook (`file_name.ipynb`) is designed for processing scientific documents, extracting mathematical expressions, and formatting them in Markdown. It leverages Mathpix and OpenAI's LLM capabilities for text transformation.
4+
This Jupyter Notebook (`converter.ipynb`) is designed for processing question sets and tutorials, extracting mathematical expressions, formatting them in Markdown then converting it into Lambda Feedback compatible JSONs. It leverages Mathpix, OpenAI's LLM capabilities and post processing for text transformation.
55

66
## Requirements
77
Ensure you have the following installed:
88
- Python 3.8+
99
- `pip install -r requirements.txt`
1010

1111
## Setup
12-
1. Create a `.env` file in the root directory and add your OpenAI API keys:
12+
1. Create a `.env` file in the root directory and add your OpenAI and MathPix API keys:
1313
```env
1414
OPENAI_API_KEY=<your-openai-api-key>
15-
OPENAI_MODEL=<your-openai-model>
1615
MATHPIX_API_KEY=<your-mathpix-key>
1716
MATHPIX_APP_ID=<your-mathpix-id>
1817
```
19-
4. Open `file_name.ipynb` and execute the cells to process your documents.
18+
2. Open `converter.ipynb` and execute the cells to process your documents.
2019

21-
## Notes
22-
- Ensure your API key and endpoint are correct, as they are required for LLM functionality.
23-
- The notebook is designed for scientific documents, but can be extended to other text formats.
20+
#### Notes
21+
- Ensure your API keys are correct, as they are required for LLM functionality.
2422

2523
## How to use
26-
Place a pdf of your choice into the folder, `/conversion_content`. Name the pdf file as `example.pdf`.
27-
Run the converter in Jupiter. A folder with all the convertion content will be produced.
28-
for `mathpix_to_llm_to_in2lambda_to_JSON.ipynb`, it will produce a folder called `/mathpix_to_llm_to_in2lambda_to_JSON_out`.
29-
This will contain all the output of the converter.
30-
31-
There is a markdown file called `example.md` inside `/mathpix_to_llm_to_in2lambda_to_JSON_out`, this is the markdown version of the pdf.
32-
As Mathpix rather reliably generates a consistent markdown version of the pdf, the converter will simply start from `example.md`.
33-
Meaning that if you wish to convert a different pdf, you must delete `example.md` first.
24+
1. Place a pdf (the set of questions) of your choice into the folder, `/conversion_content/input`.
25+
26+
2. Ensure only 1 pdf is within the ./input folder, as the converter chooses one pdf in the folder in an undefined manner (likely alphabetically).
27+
28+
3. Run the converter in Jupiter. The folder `/conversion_content/converter` will be created if it does not exist yet.
29+
30+
#### Convertion process
31+
1. Within `/conversion_content/converter`, it will create another folder, `converter/conversion_content/media`, this will hold all the images that MathPix extracted.
32+
33+
2. A file called `exmaple.md` will be made within `/conversion_content/converter` if it does not exist yet, this is the markdown file produced by MathPix after scanning the pdf.
34+
35+
3. Note that the current program will keep using the same `example.md` unless it is deleted, this is to reduce MathPix tokens as it almost alway produce identical markdown files with the same pdf.
36+
This means that to convert a different pdf file, you must also delete `example.md`.
37+
38+
#### Notes
39+
Please read `assumptions.txt` for things that the converter assumes. If these assumptions are not obeyed, the converter may struggle and produce odd results.
40+
41+
## Evaluation
42+
I believe this converter should be able to be integrated with the platform's API.
43+
44+
The converter does have its flaws and there are definitely areas it can still improve on, such as being able to reliably taking in extremely messy inputs or produce more than just questions and solutions (answer box).
45+
46+
Within the boundary of its assumptions however, it works very reliably and well.

conversion2025/assumptions.txt

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,8 @@
11
assumptions:
2-
- only 1 set of questions
3-
- the set of question contains the questions AND solutions
4-
- parts are only 1 level deep (i.e. no Q1, part a), i)
5-
- individual questions and solutions are seperatable by using just lines
6-
- all parts are explicitly enumerated
7-
- Chunky Independent Maths are deperated (otherwise Mathpix will not be able to seperate them)
8-
9-
10-
parts needs to be ordered
2+
- there is only 1 set of questions in the pdf.
3+
- the set of question contains the questions AND solutions.
4+
- parts are only 1 level deep (i.e. no subquestions within a subquestion)
5+
- individual questions and solutions are seperatable by using just lines (the chunk of text where the question/solution is contains all and only the question/solution, this includes images)
6+
- all parts are explicitly enumerated (may struggle with implied subquestions)
7+
- independent math equations are deperated (if equations belonging to different questions are clustered together, MathPix may see it as one and put them under the same maths delimiter, this rarely happens)
8+
- dollar signs are used as math delimiters only.

conversion2025/mathpix_to_llm_with_lines_to_api.ipynb renamed to conversion2025/converter.ipynb

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
"from langchain_openai import ChatOpenAI\n",
3939
"from langchain.output_parsers import PydanticOutputParser\n",
4040
"\n",
41-
"from in2lambda.api.module import Module\n",
41+
"from in2lambda.api.set import Set\n",
4242
"from in2lambda.api.question import Question\n",
4343
"from in2lambda.api.part import Part\n",
4444
"\n",
@@ -126,7 +126,7 @@
126126
"# location of the output folder and media folder.\n",
127127
"folder_path = \"conversion_content\"\n",
128128
"input_path = f\"{folder_path}/input\"\n",
129-
"output_path = f\"{folder_path}/mathpix_to_llm_with_lines_to_api\"\n",
129+
"output_path = f\"{folder_path}/converter\"\n",
130130
"media_path = f\"{output_path}/media\"\n",
131131
"\n",
132132
"# Create output and media directories if they do not exist.\n",
@@ -469,7 +469,9 @@
469469
" classes = []\n",
470470
"\n",
471471
" while index < len(md_content):\n",
472+
"\n",
472473
" # no need to check index range since there is always at least 2 characters left\n",
474+
" # display math indicator\n",
473475
" if md_content[index: index+2] == \"$$\":\n",
474476
" display_math = []\n",
475477
" index += 2\n",
@@ -479,6 +481,7 @@
479481
" classes.append(DisplayMath(\"\".join(display_math)))\n",
480482
" index += 2\n",
481483
" \n",
484+
" # inline math indicator\n",
482485
" elif md_content[index] == \"$\":\n",
483486
" inline_math = []\n",
484487
" index += 1\n",
@@ -488,6 +491,7 @@
488491
" classes.append(InlineMath(\"\".join(inline_math)))\n",
489492
" index += 1\n",
490493
"\n",
494+
" # otherwise just regular text\n",
491495
" else:\n",
492496
" regular_text = []\n",
493497
" while index < len(md_content) and md_content[index] != \"$\":\n",
@@ -1616,7 +1620,7 @@
16161620
"source": [
16171621
"questions = full_json_question_set[\"questions\"]\n",
16181622
"\n",
1619-
"in2lambda_questions = []\n",
1623+
"in2lambda_set = []\n",
16201624
"\n",
16211625
"# Loop over all questions and question_answers and use in2lambda API to create a JSON.\n",
16221626
"for question_idx, question_dict in enumerate(questions, start=1):\n",
@@ -1647,10 +1651,10 @@
16471651
" parts=parts,\n",
16481652
" images=image_paths\n",
16491653
" )\n",
1650-
" in2lambda_questions.append(question)\n",
1654+
" in2lambda_set.append(question)\n",
16511655
"\n",
16521656
"try:\n",
1653-
" Module(in2lambda_questions).to_json(f\"{output_path}/out\")\n",
1657+
" Set(questions=in2lambda_set).to_json(f\"{output_path}/out\")\n",
16541658
" print(\"JSON output successfully created.\")\n",
16551659
"except Exception as e:\n",
16561660
" print(f\"Error creating JSON output: {e}\")"

0 commit comments

Comments
 (0)