Add Papyrus 3 Million data point pchembl for 7k protein by phalem · Pull Request #340 · OpenBioML/chemnlp

phalem · 2023-06-29T13:26:10Z

Add all papyrus dataset that have pchembl only from: https://doi.org/10.4121/16896406.v3 To understand columns means in details look at README inside reference link. Data was cleaned and upload into Hugging face as original data are difficult to upload for a normal computer link: https://huggingface.co/datasets/phalem/awesome_chem_clean_data/resolve/main/pchembl_papyrus.csv.gz size : 105 MB
Please note: I fill na of the field rather than pchembl with unknown. Please look other field if possible and revise the columns as well.
Example include:
What is the this mention at?
what is the of the or on ?
what <activity_type> of the reported on ? Ka for example.

Please, if possible it need some enhancement ,
@MicPie Can you help me in this ?
Data was large. Hugging face raise a problem when loading using load_dataset.

For 60 Million datapoint We will need to check each compound either active or not as I found compound that doesn't have pchembl is inactive. However I didn't search on all data and other data. I will see away to do that.

Thank you.

Add all papyrus dataset that have pchembl only from: https://doi.org/10.4121/16896406.v3 To understand columns means in details look at README inside reference link. Data was cleaned and upload into Hugging face as original data are difficult to upload for a normal computer link: https://huggingface.co/datasets/phalem/awesome_chem_clean_data/resolve/main/pchembl_papyrus.csv.gz size : 105 MB Please note: I fill na of the field rather than pchembl with unknown. Please look other field if possible.

MicPie · 2023-07-26T13:19:18Z

Hi @phalem thank you for looking into the Papyrus data, this looks very interesting!

For this dataset you used the data from https://data.4tu.nl/file/ca10bf7d-f508-4d54-9c9a-5a9e9c1adef9/36feebfc-4703-4290-90f2-f3e41261f0c4 right?
If, we don't have to go over the HF Hub route at all, or maybe I'm missing something?

PS: I just merged with the latest main and applied the pre-commit hooks.

MicPie · 2023-07-26T15:35:47Z

Ok, I'm currently trying to get the data from the direct source but the data is very big and the transform.py script needs a lot of RAM. Let's see how this works out. Depending on that we can discuss how we best approach that.
But this seems to be a great and big dataset! :-)

phalem mentioned this pull request Jun 30, 2023

Dataset TODO list #75

Open

Merge branch 'main' into add_papyrus_pchembl

7f9f0de

MicPie self-requested a review July 26, 2023 13:13

MicPie assigned phalem Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Papyrus 3 Million data point pchembl for 7k protein#340

Add Papyrus 3 Million data point pchembl for 7k protein#340
phalem wants to merge 2 commits intoOpenBioML:mainfrom
phalem:add_papyrus_pchembl

phalem commented Jun 29, 2023 •

edited

Loading

Uh oh!

MicPie commented Jul 26, 2023 •

edited

Loading

Uh oh!

MicPie commented Jul 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

phalem commented Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MicPie commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MicPie commented Jul 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

phalem commented Jun 29, 2023 •

edited

Loading

MicPie commented Jul 26, 2023 •

edited

Loading