Initial tskit conversion code#346
Conversation
jeromekelleher
left a comment
There was a problem hiding this comment.
Looks great! Big validation of your general VCZ code.
bio2zarr/ts.py
Outdated
| self.num_records = self.ts.num_sites | ||
| self.positions = self.ts.sites_position | ||
|
|
||
| def _make_sample_mapping(self, ploidy): |
There was a problem hiding this comment.
This feels like it belongs in tskit. Let's file an issue there to support once we have something working committed here.
bio2zarr/ts.py
Outdated
|
|
||
| def iter_field(self, field_name, shape, start, stop): | ||
| if field_name == "position": | ||
| for pos in self.ts.tables.sites.position[start:stop]: |
There was a problem hiding this comment.
self.ts.sites_position - you're creating a copy of the tables every time here!
bio2zarr/ts.py
Outdated
|
|
||
| self.samples = [vcz.Sample(id=f"tsk_{j}") for j in range(self.num_samples)] | ||
|
|
||
| def iter_alleles(self, start, stop, num_alleles): |
There was a problem hiding this comment.
Ah yes, it's a bit unfortunate we have to do this. I can see ways to work around though.
868bc9d to
0c6fd5e
Compare
jeromekelleher
left a comment
There was a problem hiding this comment.
Looks good, I think we just need an update and a bit more round-trip testing for the initial version. We can make the interface more flexible later, but this is the core of it.
| @@ -0,0 +1,287 @@ | |||
| import logging | |||
There was a problem hiding this comment.
Thinking about the long-term API, maybe this file should be called tskit.py? Then, clients can do stuff like
bio2zarr.tskit.convert()
I don't think the name-clash of the submodule is a big enough problem to make it worth muddying the interface for.
|
I've added a load more testing - found some issues around mixed ploidy and sample nodes that aren't at the start of the nodes table. Would appreciate a check on those bits. |
|
Looks plausible to me, but I'd have to really dig in to make sure it's watertight. I think a good way to go might be to identify the API that we want from tskit to do this murky data-model migration stuff, and then make it tskit's responsibility to implement it? So for the initial version we just raise errors for mixed ploidies and stuff, keeping this code reasonably simple, and then try to fix a long-term stable API where tskit becomes responsible for mapping stuff into the VCF-Zarr model. Ultimately, we just want tskit to give us a |
|
I think we can make the API here something like ind_nodes = ts.individual_nodes() # (n, ploidy) array, padded with -1s if mixed ploidy
sample_ids = ["x{j}" for j in range(len(ind_nodes))] # Or could look up metadata if you please
bio2zarr.tskit.convert(ts, ind_nodes, sample_ids=sample_ids)The default in bio2zarr could be to call For now, we can do the |
jeromekelleher
left a comment
There was a problem hiding this comment.
Tthis is going in the right direction, but we need to tackle the efficiency issue up front. We need to
- Generate a flattened list of nodes that corresponds to the non-missing elements of the
individual_nodesarray - A mapping from this array back into the output genotypes so that we can do something like
for variant in ts.variants(samples=node_map_flattened):
for ploid in range(max_ploidy):
genotypes[non_missing[ploid], ploid] = variant.genotypes[non_missing_map[ploid]]
bio2zarr/tskit.py
Outdated
| self.node_ids_array = individual_nodes(ts) | ||
| self._num_samples = ts.num_individuals | ||
| self.max_ploidy = self.node_ids_array.shape[1] | ||
|
|
There was a problem hiding this comment.
Scope of this is too wide, you're just covering the individual_nodes call
There was a problem hiding this comment.
Also, might as well make this general straight away:
self._num_samples = self.node_ids.shape[0]
bio2zarr/tskit.py
Outdated
| gt[i, j] = genotypes[sample_index + j] | ||
| sample_index += ploidy | ||
| # For each individual, get genotypes for their nodes | ||
| for i in range(self.num_samples): |
There was a problem hiding this comment.
There must be some numpy-way of doings this without iterating over samples - otherwise this is going to be super slow
bio2zarr/tskit.py
Outdated
| phased = np.ones(shape[:-1], dtype=bool) | ||
|
|
||
| for variant in self.ts.variants( | ||
| samples=self.sample_ids, |
There was a problem hiding this comment.
Have to pass samples somehow, otherwise we can't get data for non-sample nodes out.
|
Sorry I pushed a bit early as I was moving |
|
Not sure if this one slipped the net - it's ready for another look over. |
jeromekelleher
left a comment
There was a problem hiding this comment.
LGTM. Spotted a few small things
bio2zarr/tskit.py
Outdated
|
|
||
| # Determine max number of alleles | ||
| max_alleles = 0 | ||
| for variant in self.ts.variants(): |
There was a problem hiding this comment.
This feels like overkill - how about we count the maximum number of distinct states? Like
max_alleles = 0
for site in ts.sites():
states = {site.ancestral_state}
for mut in site.mutations:
states.add(mut.derived_state)
max_alleles = max(len(states), max_alleles)
Could probably be done quicker with numpy, but this'll be fine.
It's a safe maximum, and it'll be correct in the vast majority of cases right?
There was a problem hiding this comment.
Fixed in 4d5a9b4
This seems right - it can higher if there are back mutations, but I can't see how it could be smaller.
bio2zarr/tskit.py
Outdated
| vcz.ZarrArraySpec( | ||
| source="position", | ||
| name="variant_position", | ||
| dtype="i4", |
There was a problem hiding this comment.
We could potentially have position values that won't fit into int32, we should check for this.
There was a problem hiding this comment.
Fixed in 304a85d
I've made it use i8 if larger positions are found.
bio2zarr/tskit.py
Outdated
| vcz.ZarrArraySpec( | ||
| source=None, | ||
| name="call_genotype", | ||
| dtype="i1", |
There was a problem hiding this comment.
Should check that max alleles will fit into this type. There's a function like min_dtype or something for this somewhere in this repo
bio2zarr/tskit.py
Outdated
| def __init__( | ||
| self, | ||
| ts_path, | ||
| individual_nodes, |
There was a problem hiding this comment.
Best make this individuals_nodes for consistency
bio2zarr/tskit.py
Outdated
| max_position = 0 | ||
| if self.ts.num_sites > 0: | ||
| max_position = np.max(self.ts.sites_position) | ||
| position_dtype = "i4" if max_position <= np.iinfo(np.int32).max else "i8" |
There was a problem hiding this comment.
Why not use min_int_dtype here also?
jeromekelleher
left a comment
There was a problem hiding this comment.
OK. looks like ready to squash and merge?
|
Squashed! Will merge when CI happy. |
|
On that note, shall I implement the github auto merge here? Seems to work well over at tskit. |
yes please! Can you do vcztools also? |
|
Added to the auto queue, which you can see here: https://github.com/sgkit-dev/bio2zarr/queue/main |
No description provided.