New tokenisation

So far, SMILES tokenisation works by fetch all atom subsection and other characters with a SMILES parser and keeping an updated list. 

## Problem
This list gets increasingly long. Models can't be used on newer datasets because of unknown tokens. If diverging branches both add tokens, it is difficult to reconcile.

## Solution
Fixed-length tokenisaton: Encodes
-  atom type
- isotope number
- charge
- stereoconfiguration
as one-hot for each atom. This assumes a predetermined list of possible values for each property.

Bonds get a bond type encoding, bond markers get encoded by their number, other symbols remain as-is (assuming there is a fixed number of them).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New tokenisation #166

Problem

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New tokenisation #166

Description

Problem

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions