Generator implementation, refactor, documentation#3
Open
Lswhiteh wants to merge 13 commits into
Open
Conversation
added 13 commits
August 19, 2020 18:42
…d then can start generator and grid search implementation
Refactor done!
… 1000 Genome files
Collaborator
|
Wow this looks like great work -- thanks for doing the deep dive here! That grid search was as I'm sure you noticed an... inelegant implementation so I think any refactor would be a step forward. I'm moving this weekend so will need a couple weeks to go through it in detail. Will update when we're somewhat settled in SF. |
Author
|
No worries at all, I'm squashing some bugs I missed, stuff like making sure validation data is always supplied even if the split percentage is smaller than the batch size. If I end up finding any more I'll shoot you another PR, no rush obviously. Good luck with the move! |
Collaborator
|
also, here's a link to the set of 100,000 chr1 HGDP SNPs we used in the preprint: https://www.dropbox.com/sh/amqujkodw4ccqjb/AACyOCVpxPEUQLQcPaq9KAgFa?dl=0 |
Member
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey all, finished with the implementation of the generators. I went ahead and refactored the entire codebase so I could work with it a little easier, if you like the changes that's great, if you just want the generators it should be pretty simple to throw in without functionalizing everything.
Couple notes:
I'm currently running on about half a million SNPs, ramping up testing as I go with the default grid search values. I have it limited to one GPU, and allowed GPU growth along with clearing the GPU env (which is what was already implemented), together those should do the trick but I'll continue to test. If it doesn't solve the tensor problem, not sure there's much more I can do about that, but this at least frees up more VRAM than there was before since the entire dataset isn't in memory.
The current implementation of the generators relies on writing the train/test partition split to separate HDF5 files so I can keep the filestream open during the entire generation process. It's a little clunky since the stream close is outside of the scope it's called, but it works and I can't think of a better way off the top of my head.
Along with the above, the major memory bottleneck here will still be converting VCF into arrays in the beginning. I've messed around with converting those to HDF5 files right off the bat using allel's vcf_to_hdf5() function, but the parameterization of that is super weird and I'm not used to working with vcf files. If this could be implemented, you could theoretically run this on a super low-memory system, since this is now the non-vRAM bottleneck.
I updated the grid search to be more extensible, now it uses a dict-style structure that I've had good luck with in the past. Pretty simple change but I noticed you wanted to give it more than 2 parameters to search through, should be super simple to throw some more in now by just adding them to the dict and updating the model building function.
I haven't tested the PCA or plotting yet, but I'm going to assume they're good to go since I didn't change any of the output methodology, just the way to get there.
As an aside, do you have some testing data you can run this on to see if it fits your expectations? I ran it on some small testing vcfs and some 1000Genomes data I had to make sure it was consistent with the master fork results on the same data, and it seems to be the same but you're going to know best on this one.
Sorry for the wall of text, feel free to keep/drop whatever portions of the change you like/don't of course. If you have any questions I'm happy to talk more about it!