Generator implementation, refactor, documentation by Lswhiteh · Pull Request #3 · kr-colab/popvae

Lswhiteh · 2020-08-28T22:14:36Z

Hey all, finished with the implementation of the generators. I went ahead and refactored the entire codebase so I could work with it a little easier, if you like the changes that's great, if you just want the generators it should be pretty simple to throw in without functionalizing everything.

Couple notes:

I'm currently running on about half a million SNPs, ramping up testing as I go with the default grid search values. I have it limited to one GPU, and allowed GPU growth along with clearing the GPU env (which is what was already implemented), together those should do the trick but I'll continue to test. If it doesn't solve the tensor problem, not sure there's much more I can do about that, but this at least frees up more VRAM than there was before since the entire dataset isn't in memory.
The current implementation of the generators relies on writing the train/test partition split to separate HDF5 files so I can keep the filestream open during the entire generation process. It's a little clunky since the stream close is outside of the scope it's called, but it works and I can't think of a better way off the top of my head.
Along with the above, the major memory bottleneck here will still be converting VCF into arrays in the beginning. I've messed around with converting those to HDF5 files right off the bat using allel's vcf_to_hdf5() function, but the parameterization of that is super weird and I'm not used to working with vcf files. If this could be implemented, you could theoretically run this on a super low-memory system, since this is now the non-vRAM bottleneck.
I updated the grid search to be more extensible, now it uses a dict-style structure that I've had good luck with in the past. Pretty simple change but I noticed you wanted to give it more than 2 parameters to search through, should be super simple to throw some more in now by just adding them to the dict and updating the model building function.
I haven't tested the PCA or plotting yet, but I'm going to assume they're good to go since I didn't change any of the output methodology, just the way to get there.

As an aside, do you have some testing data you can run this on to see if it fits your expectations? I ran it on some small testing vcfs and some 1000Genomes data I had to make sure it was consistent with the master fork results on the same data, and it seems to be the same but you're going to know best on this one.

Sorry for the wall of text, feel free to keep/drop whatever portions of the change you like/don't of course. If you have any questions I'm happy to talk more about it!

…d then can start generator and grid search implementation

Refactor done!

… 1000 Genome files

cjbattey · 2020-08-28T23:43:55Z

Wow this looks like great work -- thanks for doing the deep dive here! That grid search was as I'm sure you noticed an... inelegant implementation so I think any refactor would be a step forward. I'm moving this weekend so will need a couple weeks to go through it in detail. Will update when we're somewhat settled in SF.

Lswhiteh · 2020-08-28T23:46:09Z

No worries at all, I'm squashing some bugs I missed, stuff like making sure validation data is always supplied even if the split percentage is smaller than the batch size. If I end up finding any more I'll shoot you another PR, no rush obviously.

Good luck with the move!

cjbattey · 2020-08-28T23:48:59Z

also, here's a link to the set of 100,000 chr1 HGDP SNPs we used in the preprint: https://www.dropbox.com/sh/amqujkodw4ccqjb/AACyOCVpxPEUQLQcPaq9KAgFa?dl=0

andrewkern · 2020-08-29T01:54:34Z

wow awesome @Lswhiteh!

@cjbattey it would be great if we could shove this SNP set into the repo as a test

Logan Whitehouse added 13 commits August 19, 2020 18:42

Functionalizing the entire codebase, adding data generator

f61a1cf

Docstring generation

d237b9d

Still refactoring

d00ea5b

Refactoring should be done. Will move on to testing in the morning an…

a103034

…d then can start generator and grid search implementation

Documentation finished, testing next.

4b1c5d7

Simplified user_args object passing

122f787

Testing, this might be too much work...

a7eb9c5

Refactor done!

e7a33f6

Merge pull request #1 from Lswhiteh/partial-refactor

d578f3a

Refactor done!

Updated docstrings

66fed79

Moving on to testing

5f92b87

Initial testing done, going to move into large-gridsearch testing and…

1f57eb3

… 1000 Genome files

Completed mid-size testing

6e24c27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generator implementation, refactor, documentation#3

Generator implementation, refactor, documentation#3
Lswhiteh wants to merge 13 commits into
kr-colab:masterfrom
Lswhiteh:generators

Lswhiteh commented Aug 28, 2020

Uh oh!

cjbattey commented Aug 28, 2020

Uh oh!

Lswhiteh commented Aug 28, 2020

Uh oh!

cjbattey commented Aug 28, 2020

Uh oh!

andrewkern commented Aug 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Lswhiteh commented Aug 28, 2020

Uh oh!

cjbattey commented Aug 28, 2020

Uh oh!

Lswhiteh commented Aug 28, 2020

Uh oh!

cjbattey commented Aug 28, 2020

Uh oh!

andrewkern commented Aug 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants