Adding the libre textbooks #149

Open

hssn-20 wants to merge 23 commits intoOpenBioML:mainfrom

hssn-20:add-libre-textbooks

hssn-20 commented Apr 2, 2023

This script imports an uploaded libre chemistry textbooks from Hugging Face, cleans the data by removing hyperlinks, licenses, and chapter headers, and then removes specific lines based on manual selection. The cleaned data is then saved, and a metadata YAML file is generated based on a template. Here's a colab notebook which implements the process.


          initial commit

70a9dd1

MicPie assigned hssn-20

MicPie requested review from MicPie and kjappelbaum

April 12, 2023 14:12

MicPie added dataset needs-review and removed needs-review labels

hssn-20 added 6 commits

April 13, 2023 03:32


          Merge branch 'OpenBioML:main' into add-libre-textbooks

c4b61fd


          Add files via upload


          Update transform.py

26b4488


          Update transform.py

5de34a7


          Libre-textbook web crawler

c87307a


          version 1

0bd8b97

hssn-20 changed the title ~~Draft: Adding the libre textbooks~~ Adding the libre textbooks

Author

hssn-20 commented Apr 13, 2023

Hopefully this PR should be ok for our first version of this dataset. In our next version, I'd like to remove exercises along with their solutions from the dataset + encode chemicals in a consistent format. Ps.

Author

hssn-20 commented Apr 13, 2023 •

edited

Loading

pre-commit.ci autofix

pre-commit-ci bot and others added 9 commits

April 13, 2023 18:35


          [pre-commit.ci] auto fixes from pre-commit.com hooks

579cc79

for more information, see https://pre-commit.ci


          fixing linting issues

41fd816


          Update transform.py

41b5dfc


          Merge branch 'OpenBioML:main' into add-libre-textbooks

9c68bf9


          Add files via upload

bfcdb4c


          Add files via upload

8050d7e


          Delete lines_to_remove.jsonl

f17e326


          Delete top_sentances.ods

85f161e


          Update transform.py

706442d

Contributor

MicPie commented Apr 17, 2023 •

edited

Loading

Hey @hssn-20, thank you very much for the PR! 🙏
I just had a look and I triggered the pre commit checks on GitHub, see the results here: https://results.pre-commit.ci/run/github/601226793/1681519715.6rdNlKF6QWaniPzvuAnS1g (the links is at the end below too).
Best is you (merge the latest main again), then be sure that the latest pre-commit hooks are installed properly with pre-commit install, and then run black . (both in the main directory) to auto-format the code.
Then you can rerun the yaml creation with python transform.py and add those changes in a new commit to the PR.
Just let me know if you can add those changes, if not, I can also have a look. 😃

MicPie added the Awaiting author contribution label

MicPie requested changes

View reviewed changes

data/libre_textbooks/transform.py

		import yaml


		LINES_TO_REMOVE = "/workspaces/chemnlp/data/libre_textbooks/lines_to_remove.jsonl"

Contributor

MicPie Apr 17, 2023

This is not used below. Are those lines already removed on the HF dataset upload?

data/libre_textbooks/transform.py Outdated

+                  "identifiers": [
+                      {
+                          "id": "url ",  # column name
+                          "type": "OTHER",  # can be "SMILES", "SELFIES", "IUPAC", "OTHER"

Contributor

MicPie Apr 17, 2023

Did run the commit hooks through with "OTHER" (capital letters)?

hssn-20 added 2 commits

April 18, 2023 15:18


          Merge branch 'OpenBioML:main' into add-libre-textbooks

af2982c


          Merge branch 'OpenBioML:main' into add-libre-textbooks

78ff8f7

kjappelbaum reviewed

View reviewed changes

data/libre_textbooks/transform.py Outdated Show resolved Hide resolved


          Update data/libre_textbooks/transform.py

acd40c3

kjappelbaum reviewed

View reviewed changes

data/libre_textbooks/transform.py

+                          "id": "html",  # name of the column in a tabular dataset
+                          "description": "A scraped page from libre textbooks",
+                          "units": None,  # units of the values in this column (leave empty if unitless)
+                          "type": "string",  # can be "categorical", "ordinal", "continuous", "string"

Collaborator

kjappelbaum May 5, 2023 •

edited by MicPie

Loading

Suggested change

      
                        "type": "string",  # can be "categorical", "ordinal", "continuous", "string"
          
                        "type": "text",  # can be "categorical", "ordinal", "continuous", "text"

kjappelbaum reviewed

View reviewed changes

data/libre_textbooks/meta.yaml Outdated Show resolved Hide resolved


          Update data/libre_textbooks/meta.yaml

12f854f

kjappelbaum reviewed

View reviewed changes

data/libre_textbooks/transform.py Outdated Show resolved Hide resolved


          Update data/libre_textbooks/transform.py

2e0e0fd

kjappelbaum reviewed

View reviewed changes

data/libre_textbooks/transform.py Outdated Show resolved Hide resolved


          Update data/libre_textbooks/transform.py

2ba93b0

kjappelbaum reviewed

View reviewed changes

data/libre_textbooks/meta.yaml

Comment on lines +17 to +19

+                  - id: text_length
+                    type: int
+                    description: text character count

Collaborator

kjappelbaum May 5, 2023

Suggested change

      
                - id: text_length
          
                  type: int
          
                  description: text character count

kjappelbaum reviewed

View reviewed changes

data/libre_textbooks/meta.yaml Outdated Show resolved Hide resolved


          Update data/libre_textbooks/meta.yaml

b34267c

kjappelbaum requested a review from MicPie

May 5, 2023 11:34

Collaborator

kjappelbaum commented May 5, 2023

@MicPie requires that we add the text type also used for #188

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Awaiting author contribution dataset