Describe the bug
Using the to_dict method of some of the ac_bioc dataclasses gets a different result to using the BioCJSON class (which is determined by the BioCJSONEncoder)
To Reproduce
For example, serialising the BioCPassage dataclass returns a very different result:
>>> from autocorpus.ac_bioc import BioCJSON, BioCPassage
>>> p = BioCPassage()
>>> p
BioCPassage(text='', offset=0, infons={}, sentences=[], annotations=[], relations=[])
>>> p.to_dict() # Does not include "annotations" or "relations"
{'text': '', 'offset': 0, 'infons': {}, 'sentences': []}
>>> print(BioCJSON.dumps(p)) # Does not include "sentences"
{"offset": 0, "infons": {}, "text": "", "annotations": [], "relations": []}
Expected behavior
These two approaches to get a dictionary should yield the same result.
Suggested solution
I suggest changing the default method of the BioCJSONEncoder to just use the to_dict methods and adjust the to_dict method to match the desired behaviour.
If you just want to include every field automatically and not need to update the to_dict method ever, I suggest using the asdict function from the datalcasses module. This will recursively unpack everything. For example:
>>> from autocorpus.ac_bioc import BioCPassage, BioCSentence
>>> p = BioCPassage(sentences=[BioCSentence("hello", 2)])
>>> p
BioCPassage(text='', offset=0, infons={}, sentences=[BioCSentence(text='hello', offset=2, infons={}, annotations=[], relations=[])], annotations=[], relations=[])
>>> p.to_dict() # Missing fields from both dataclasses
{'text': '', 'offset': 0, 'infons': {}, 'sentences': [{'text': 'hello', 'offset': 2, 'infons': {}, 'annotations': []}]}
>>> from dataclasses import asdict
>>> asdict(p) # All fields present and converts the nested "sentences" to a dict
{'text': '', 'offset': 0, 'infons': {}, 'sentences': [{'text': 'hello', 'offset': 2, 'infons': {}, 'annotations': [], 'relations': []}], 'annotations': [], 'relations': []}
To include this in a dataclass is as simple as:
from dataclasses import dataclass, asdict
@dataclass
class MyClass():
field1: int
to_dict = asdict
Context
Please, complete the following to better understand the system you are using to run Auto-CORPus.
- Operating system (eg. Windows 10): MacOS 14.7.6
- Auto-CORPus version (eg. 1.0.0): Current
main branch
- Installation method (eg. pipx, pip, development mode): dev mode with poetry
- Python version (you can get this running
python --version): 3.13.1
Describe the bug
Using the
to_dictmethod of some of theac_biocdataclasses gets a different result to using theBioCJSONclass (which is determined by theBioCJSONEncoder)To Reproduce
For example, serialising the
BioCPassagedataclass returns a very different result:Expected behavior
These two approaches to get a dictionary should yield the same result.
Suggested solution
I suggest changing the
defaultmethod of theBioCJSONEncoderto just use theto_dictmethods and adjust theto_dictmethod to match the desired behaviour.If you just want to include every field automatically and not need to update the
to_dictmethod ever, I suggest using theasdictfunction from thedatalcassesmodule. This will recursively unpack everything. For example:To include this in a dataclass is as simple as:
Context
Please, complete the following to better understand the system you are using to run Auto-CORPus.
mainbranchpython --version): 3.13.1