Skip to content

Commit ccd738f

Browse files
committed
removed duped text whenn question has a part and example for seperating questions
1 parent fb86111 commit ccd738f

File tree

1 file changed

+88
-14
lines changed

1 file changed

+88
-14
lines changed

conversion2025/mathpix_to_llm_with_lines_to_api.ipynb

Lines changed: 88 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -404,7 +404,8 @@
404404
" questions: list[QuestionModelLines] = Field(..., description=\"A list of questions.\")\n",
405405
"\n",
406406
"llm_task_seperate_questions = \"\"\"\n",
407-
" Your task is to extract the line numbers for the start and end of each question and solution from the markdown file, then format it as a JSON object.\n",
407+
" Your task is to extract the line numbers for the start and end of all the question and solution from the markdown file, then format it as a JSON object.\n",
408+
" Note that the questions and solutions may not be around the same area in the markdown file.\n",
408409
" These line numbers will be used later to extract the content of the questions and solutions procedurally.\n",
409410
" \n",
410411
" 1. **Content Extraction:**\n",
@@ -413,7 +414,7 @@
413414
" - Begin by identifying all the questions in the markdown file, and for each question:\n",
414415
" - Identify the start and end line numbers of the full question content, and place them in `question_content_start` and `question_content_end`.\n",
415416
" - Identify the start and end line numbers of the full relevant solution content, and place them in `solution_content_start` and `solution_content_end`.\n",
416-
" - Be careful to ensure that everything related to the question and solution is included, including any math delimiters and LaTeX formatting.\n",
417+
" - Be careful to ensure that everything related to the question and solution is included, including any math delimiters($, $$) and LaTeX formatting.\n",
417418
" - Do not forget to include any images or figures that are part of the question or solution.\n",
418419
" \n",
419420
" 2. **Output Format:**\n",
@@ -568,7 +569,7 @@
568569
" \"\"\"\n",
569570
" Initialize the Set_Question_With_Solution with a question and its solution.\n",
570571
" \n",
571-
" Args:\n",
572+
" Args: \n",
572573
" question (Set_Question): The question object.\n",
573574
" solution (Set_Solution): The solution object.\n",
574575
" \"\"\"\n",
@@ -625,12 +626,12 @@
625626
"llm_task_seperate_parts_question = r\"\"\"\n",
626627
" 1. **Content Extraction:**\n",
627628
" - You may choose the `title` for the question.\n",
628-
" - From the input `Full Question Content`, identify the start line and end line for the main introductory text (the stem), place them in `content_start` and `content_end`. \n",
629+
" - From the input `Full Question Content`, identify the start line and end line for the main introductory text (the stem), place them in `content_start` and `content_end`.\n",
629630
" - From the input `Full Question Content`, identify and separate all the `parts`(sub-questions), they could be explicit (e.g. using, \"(a)\", \"(b)\", \"i.\", \"ii.\"... etc.), but may also be implied. For each identified sub-question:\n",
630631
" - Place the start line going into `part_start` and the end line going into `part_end`.\n",
631632
" - If the question has no sub-questions, leave `part_start` as 0 and `part_end` as -1.\n",
632633
" - You may use the `Full Solution Content` to help with identifying the parts.\n",
633-
" - Be careful to ensure that everything related to the question stem/parts is included, including any math delimiters and LaTeX formatting.\n",
634+
" - Be careful to ensure that everything related to the question stem/parts is included, including any math delimiters($, $$) and LaTeX formatting.\n",
634635
" - Do not forget to include any images or figures that are part of the question stem, parts or solution.\n",
635636
" - Ensure no solution content is included in the `content` or `parts` fields.\n",
636637
" \n",
@@ -639,13 +640,37 @@
639640
" - Do NOT include any explanations, comments, or markdown code blocks (like ```json).\n",
640641
" \"\"\"\n",
641642
"\n",
643+
"example_seperate_parts_question = r\"\"\"\n",
644+
" example:\n",
645+
" [(0, \"Q1. find value of $x$ in the following equation:\"),\n",
646+
" (1, \"i. $x + 1 = 2$\"),\n",
647+
" (2, \"ii. $x - 1 = 5$\")]\n",
648+
"\n",
649+
" should be converted to:\n",
650+
" {\n",
651+
" \"title\": \"suitable title\",\n",
652+
" \"content_start\": 0,\n",
653+
" \"content_end\": 0,\n",
654+
" \"parts\": [\n",
655+
" {\n",
656+
" \"part_start\": 1,\n",
657+
" \"part_end\": 1\n",
658+
" },\n",
659+
" {\n",
660+
" \"part_start\": 2,\n",
661+
" \"part_end\": 2\n",
662+
" }\n",
663+
" ]\n",
664+
" }\n",
665+
" \"\"\"\n",
666+
"\n",
642667
"llm_task_seperate_parts_solution = r\"\"\"\n",
643668
" 1. **Content Extraction:**\n",
644669
" - From the input `full solution content`, identify the specific solution part that corresponds to the `target question part`, and place the start line and end line into `part_solution_start` and `part_solution_end`.\n",
645670
" - If the `target question part` is empty, identify the specific solution part that corresponds to the `full question stem`.\n",
646671
" - Use the `full question stem` and `full question parts` to help identify the specific solution part.\n",
647672
" - Ensure that the `target question part` is used to extract the specific solution part.\n",
648-
" - Be careful to ensure that everything related to the solution part is included, including any math delimiters and LaTeX formatting.\n",
673+
" - Be careful to ensure that everything related to the solution part is included, including any math delimiters($, $$) and LaTeX formatting.\n",
649674
" - Do not forget to include any images or figures that are part of the solution.\n",
650675
"\n",
651676
" 2. **Output Format:**\n",
@@ -672,6 +697,8 @@
672697
"\n",
673698
" {llm_task_seperate_parts_question}\n",
674699
"\n",
700+
" {example_seperate_parts_question}\n",
701+
"\n",
675702
" Full Solution Content:\n",
676703
" {solution_input}\n",
677704
"\n",
@@ -782,10 +809,57 @@
782809
" ).model_dump()"
783810
]
784811
},
812+
{
813+
"cell_type": "markdown",
814+
"id": "24",
815+
"metadata": {},
816+
"source": [
817+
"# remove the duplicated text for single part questions"
818+
]
819+
},
785820
{
786821
"cell_type": "code",
787822
"execution_count": null,
788-
"id": "24",
823+
"id": "25",
824+
"metadata": {},
825+
"outputs": [],
826+
"source": [
827+
"# class NoPartsQuestionModel(BaseModel):\n",
828+
"# \"\"\"\n",
829+
"# Represents a question without parts.\n",
830+
"# \"\"\"\n",
831+
"# hasParts: bool = Field(False, description=\"Indicates if the question has parts.\")\n",
832+
"\n",
833+
"# llm_task_remove_dupe = \"\"\"\n",
834+
"# 1. **Task:**\n",
835+
"# - Check if the single part that the question has is the same as the full question content.\n",
836+
"# - If it is not, then remove the part and set `hasParts` to `False`.\n",
837+
"# - If it is, then set `hasParts` to `True`.\n",
838+
" \n",
839+
"# 2. **Output Format:**\n",
840+
"# - You MUST output ONLY a single, raw, valid JSON string that matches the provided schema.\n",
841+
"# - Do NOT include any explanations, comments, or markdown code blocks (like ```json).\n",
842+
"# \"\"\"\n",
843+
"# def llm_remove_dupe_part(content: str, part: str) -> bool:\n",
844+
"# return content == part\n",
845+
"\n",
846+
"def dupe_text_reduce(questions_dict: dict) -> dict:\n",
847+
" \"\"\"\n",
848+
" Reduces duplicate text in the questions content and its parts.\n",
849+
" \"\"\"\n",
850+
" for question in questions_dict[\"questions\"]:\n",
851+
" parts = question[\"parts\"]\n",
852+
" if len(parts) == 1 and parts[0] == question[\"content\"]:\n",
853+
" # If the only part is the same as the content, remove the part and set hasParts to False.\n",
854+
" question[\"parts\"][0] = \"\"\n",
855+
" \n",
856+
" return questions_dict"
857+
]
858+
},
859+
{
860+
"cell_type": "code",
861+
"execution_count": null,
862+
"id": "26",
789863
"metadata": {},
790864
"outputs": [],
791865
"source": [
@@ -815,16 +889,16 @@
815889
"\n",
816890
" extracted_dict = extract_parts_question(questions_dict)\n",
817891
" print(\"succesfully extracted the parts from the questions.\")\n",
818-
" print(json.dumps(extracted_dict, indent=2))\n",
892+
" print(json.dumps(extracted_dict))\n",
819893
" print(\"Now validating the content...\")\n",
820894
"\n",
821-
" return extracted_dict"
895+
" return dupe_text_reduce(extracted_dict)"
822896
]
823897
},
824898
{
825899
"cell_type": "code",
826900
"execution_count": null,
827-
"id": "25",
901+
"id": "27",
828902
"metadata": {},
829903
"outputs": [],
830904
"source": [
@@ -833,7 +907,7 @@
833907
},
834908
{
835909
"cell_type": "markdown",
836-
"id": "26",
910+
"id": "28",
837911
"metadata": {},
838912
"source": [
839913
"# Displaying questions"
@@ -842,7 +916,7 @@
842916
{
843917
"cell_type": "code",
844918
"execution_count": null,
845-
"id": "27",
919+
"id": "29",
846920
"metadata": {},
847921
"outputs": [],
848922
"source": [
@@ -869,7 +943,7 @@
869943
},
870944
{
871945
"cell_type": "markdown",
872-
"id": "28",
946+
"id": "30",
873947
"metadata": {},
874948
"source": [
875949
"# in2lambda to JSON"
@@ -878,7 +952,7 @@
878952
{
879953
"cell_type": "code",
880954
"execution_count": null,
881-
"id": "29",
955+
"id": "31",
882956
"metadata": {},
883957
"outputs": [],
884958
"source": [

0 commit comments

Comments
 (0)