Conversation
|
Hopefully this PR should be ok for our first version of this dataset. In our next version, I'd like to remove exercises along with their solutions from the dataset + encode chemicals in a consistent format. Ps. |
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
|
Hey @hssn-20, thank you very much for the PR! 🙏 |
| import yaml | ||
|
|
||
|
|
||
| LINES_TO_REMOVE = "/workspaces/chemnlp/data/libre_textbooks/lines_to_remove.jsonl" |
There was a problem hiding this comment.
This is not used below. Are those lines already removed on the HF dataset upload?
data/libre_textbooks/transform.py
Outdated
| "identifiers": [ | ||
| { | ||
| "id": "url ", # column name | ||
| "type": "OTHER", # can be "SMILES", "SELFIES", "IUPAC", "OTHER" |
There was a problem hiding this comment.
Did run the commit hooks through with "OTHER" (capital letters)?
| "id": "html", # name of the column in a tabular dataset | ||
| "description": "A scraped page from libre textbooks", | ||
| "units": None, # units of the values in this column (leave empty if unitless) | ||
| "type": "string", # can be "categorical", "ordinal", "continuous", "string" |
There was a problem hiding this comment.
| "type": "string", # can be "categorical", "ordinal", "continuous", "string" | |
| "type": "text", # can be "categorical", "ordinal", "continuous", "text" |
| - id: text_length | ||
| type: int | ||
| description: text character count |
There was a problem hiding this comment.
| - id: text_length | |
| type: int | |
| description: text character count |
This script imports an uploaded libre chemistry textbooks from Hugging Face, cleans the data by removing hyperlinks, licenses, and chapter headers, and then removes specific lines based on manual selection. The cleaned data is then saved, and a metadata YAML file is generated based on a template. Here's a colab notebook which implements the process.