New regression and classification datasets for ontology pre-training#130
New regression and classification datasets for ontology pre-training#130sfluegel05 merged 61 commits intoChEB-AI:devfrom
Conversation
…lubility regression
|
|
add loading from checkpoint pretrained model fix |
|
I added some comments. It would be great if you could have a look at them. Also, you have added quite a number of config files. Some seem to be very specific (e.g. an ELECTRA config with a different learning rate for a specific experiment). My suggestion would be to either remove those configs (and publish it in a paper-specific zenodo archive or mention the parameters in the paper) or group them so that new users don't get overwhelmed (e.g. all moleculenet dataset configs could be one folder). |
chebai/loss/semantic.py
Outdated
| use_sigmoidal_implication: bool = False, | ||
| weight_epoch_dependent: Union[bool | tuple[int, int]] = False, | ||
| weight_epoch_dependent: Union[bool, Tuple[int, int]] = False, | ||
| weight_epoch_dependent: Union[bool, Tuple[int, int]] = False, |
There was a problem hiding this comment.
why does weight_epoch_dependent appear twice here?
chebai/models/base.py
Outdated
| if self.pass_loss_kwargs: | ||
| loss_kwargs = loss_kwargs_candidates | ||
| loss_kwargs["current_epoch"] = self.trainer.current_epoch | ||
| # loss_kwargs["current_epoch"] = self.trainer.current_epoch |
There was a problem hiding this comment.
why is this commented out? Afaik we don't have any loss function at the moment that needs this (this was added for some experimental semantic loss features that didn't perform well). Does this break anything?
chebai/models/electra.py
Outdated
|
|
||
| from chebai.loss.semantic import DisjointLoss as ElectraChEBIDisjointLoss # noqa | ||
| # TODO: put back in before pull request | ||
| # from chebai.loss.semantic import DisjointLoss as ElectraChEBIDisjointLoss # noqa |
There was a problem hiding this comment.
i guess you wanted to uncomment this :)
There was a problem hiding this comment.
this will be a problem for merging. I have added new smiles tokens on a different branch (from pubchem) so the new pubchem-pretrained model (and all models based on that) will depend on those tokens.
Are the tokens you added here actually used by a model or are those just artifacts?
There was a problem hiding this comment.
I have removed the part in question and will open an issue and look into what is going on with this
There was a problem hiding this comment.
is there a reason for deleting this file?
restructering of config files fixing small issues from merging
|
addressed all comments |
chebai/preprocessing/reader.py
Outdated
|
|
||
| def _get_token_index(self, token: str) -> int: | ||
| """Returns a unique number for each token, automatically adds new tokens.""" | ||
| print(str(token)) |
There was a problem hiding this comment.
I assume this is a leftover from debugging?
|
Lint issues fixed. Unit tests most likely need to be adjusted, missing labels might cause issues in some places |
|
The unittests can be fixed by adjusting the mock data for the Tox21Challenge dataset. You just need to add the Below are the fixed functions for @staticmethod
def data_in_dict_format() -> List[Dict]:
data_list = [
{
"labels": [
None,
None,
None,
None,
None,
None,
None,
None,
None,
0,
None,
None,
],
"ident": "25848",
},
{
"labels": [
0,
None,
None,
1,
None,
None,
None,
None,
None,
None,
None,
None,
],
"ident": "2384",
},
{
"labels": [
0,
None,
0,
None,
None,
None,
None,
None,
None,
None,
None,
None,
],
"ident": "27102",
},
{
"labels": [
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
],
"ident": "26792",
},
{
"labels": [
None,
None,
None,
None,
None,
None,
None,
1,
None,
1,
None,
None,
],
"ident": "26401",
},
{
"labels": [
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
],
"ident": "25973",
},
]
for dict_ in data_list:
dict_["features"] = Tox21ChallengeMockData.FEATURE_OF_SMILES
dict_["group"] = None
# missing labels get added here
if any(label is None for label in dict_["labels"]):
dict_["missing_labels"] = [True if label is None else False for label in dict_["labels"]]
return data_list @staticmethod
def get_setup_processed_output_data() -> List[Dict]:
"""
Returns mock processed data used for testing the `setup_processed` method.
The data contains molecule identifiers and their corresponding toxicity labels for multiple endpoints.
Each dictionary in the list represents a molecule with its associated labels, features, and group information.
Returns:
List[Dict]: A list of dictionaries where each dictionary contains:
- "features": The SMILES features of the molecule.
- "labels": A list of toxicity endpoint labels (0, 1, or None).
- "ident": The sample identifier.
- "group": None (default value for the group key).
"""
# "NR-AR", "NR-AR-LBD", "NR-AhR", "NR-Aromatase", "NR-ER", "NR-ER-LBD", "NR-PPAR-gamma", "SR-ARE", "SR-ATAD5",
# "SR-HSE", "SR-MMP", "SR-p53",
data_list = [
{
"labels": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"ident": "NCGC00260869-01",
},
{
"labels": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
"ident": "NCGC00261776-01",
},
{
"labels": [
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
],
"ident": "NCGC00261380-01",
},
{
"labels": [0, 0, 0, None, 0, 0, 0, 0, 0, 0, None, 1],
"ident": "NCGC00261842-01",
},
{
"labels": [0, 0, 1, None, 1, 1, 1, None, 1, 1, None, 1],
"ident": "NCGC00261662-01",
},
{
"labels": [0, 0, None, None, 1, 0, 0, 1, 0, 0, 1, 1],
"ident": "NCGC00261190-01",
},
]
complete_list = []
for dict_ in data_list:
complete_list.append(
{
"features": Tox21ChallengeMockData.FEATURE_OF_SMILES,
**dict_,
"group": None,
}
)
# add missing labels
if any(label is None for label in dict_["labels"]):
complete_list[-1]["missing_labels"] = [
True if label is None else False for label in dict_["labels"]
]
return complete_list |
|
Unittests adjustment done! |
No description provided.