So far, SMILES tokenisation works by fetch all atom subsection and other characters with a SMILES parser and keeping an updated list.
Problem
This list gets increasingly long. Models can't be used on newer datasets because of unknown tokens. If diverging branches both add tokens, it is difficult to reconcile.
Solution
Fixed-length tokenisaton: Encodes
- atom type
- isotope number
- charge
- stereoconfiguration
as one-hot for each atom. This assumes a predetermined list of possible values for each property.
Bonds get a bond type encoding, bond markers get encoded by their number, other symbols remain as-is (assuming there is a fixed number of them).
So far, SMILES tokenisation works by fetch all atom subsection and other characters with a SMILES parser and keeping an updated list.
Problem
This list gets increasingly long. Models can't be used on newer datasets because of unknown tokens. If diverging branches both add tokens, it is difficult to reconcile.
Solution
Fixed-length tokenisaton: Encodes
as one-hot for each atom. This assumes a predetermined list of possible values for each property.
Bonds get a bond type encoding, bond markers get encoded by their number, other symbols remain as-is (assuming there is a fixed number of them).