Conversation
allofplos/update.py
Outdated
| @@ -0,0 +1,4 @@ | |||
| from .corpus.plos_corpus import main | |||
There was a problem hiding this comment.
If we're going to do this, I'd prefer that the meat of this functionality actually be moved into this file and rely on the corpus module to provide the actual APIs it uses internally. This kind of redirection is getting to be a bit gratuitous. But if this would include a true refactor (making python -m allofplos.corpus.plos_corpus invalid, then I'd be for this.
There was a problem hiding this comment.
I'll note that I've been trying to push us to have a deeper package structure, and so introducing another top level module isn't something I'm jumping to do.
However, I think it's important to separate the applications from the APIs, and this is effectively an application style command, so that's why I'm in favour of adding this at the cost of a flatter package structure to gain the isolation of purpose.
allofplos/transformations.py
Outdated
| """ | ||
| filename = os.path.basename(filename) | ||
| if correction in filename and validate_filename(filename): | ||
| fn = os.path.basename(filename) |
There was a problem hiding this comment.
couldn't you change the filenameparameter to filepath and then keep this as filename?
I always feel conflicted about fn since it can also be shorthand for function, especially in callback based code.
|
|
||
| if __name__ == "__main__": | ||
| warnings.simplefilter('always', DeprecationWarning) | ||
| warnings.warn("This update method is deprecated. use 'python -m allofplos.update'", |
| print("Pubdate error in {}".format(doi)) | ||
|
|
||
|
|
||
| def download_xml(doi, tempdir=newarticledir): |
There was a problem hiding this comment.
That is really clever! I hadn't thought of using an article object as a downloader… why don't we just bake this into the Article class itself?
Edit: I actually had thought about it in the context of the async-await stuff (#46) but I think I chose not to implement it because I thought I'd get pushback and that it wasn't how we were planning on using the Article class. No matter what, this is super clever in a good, straightforward code way!
| doi = filename | ||
| else: | ||
| doi = '' | ||
| return doi |
There was a problem hiding this comment.
You should validate the doi before returning it (raising an error if that fails).
There was a problem hiding this comment.
Do we have a custom doi validation error? If not… we should.
There was a problem hiding this comment.
BTW, you have the same problem with doi_to_path
if directory is None:
directory = get_corpus_dir()
if doi.startswith(ANNOTATION_DOI) and validate_doi(doi):
article_file = os.path.join(directory, "plos.correction." + doi.split('/')[-1] + SUFFIX_LOWER)
elif validate_doi(doi):
article_file = os.path.join(directory, doi.lstrip(PREFIX) + SUFFIX_LOWER)
# NOTE: The following check is weird, a DOI should never validate as a file name.
elif validate_filename(doi):
article_file = doi
return article_file
it should have an else case.
| vor_updated_article_list = [] | ||
| for article in tqdm(vor_updates_available, disable=None): | ||
| updated = download_updated_xml(article, vor_check=True) | ||
| for doi in tqdm(vor_updates_available, disable=None): |
There was a problem hiding this comment.
Does vor_updates_available return a list of dois?
deprecates allofplos.plos_corpus, updates README with new command
`filename_to_doi` no longer allows passing through a DOI. new `download_xml` function is based on DOI and not file, and is passed into `download_updated_xml`.
point .plos_corpus.py at update.py delete `main()` from corpus.plos_corpus.py
| ........ | ||
| ---------------------------------------------------------------------- | ||
| Ran 6 tests in 3.327s | ||
| Ran 8 tests in 0.257s |
There was a problem hiding this comment.
This is no longer accurate after #89, could you update it?
| @@ -0,0 +1,52 @@ | |||
| import os | |||
|
|
|||
| from . import get_corpus_dir, newarticledir, uncorrected_proofs_text_list | |||
There was a problem hiding this comment.
Why is uncorrect_proofs_text_list in __init__.py? Isn't it specific to the update function? Does anything else use it?
There was a problem hiding this comment.
because the .txt file is stored in the top-level directory (which made sense when plos_corpus.py was there).
There was a problem hiding this comment.
It will be wiped out when you reinstall allofplos
There was a problem hiding this comment.
yes, just like people who have installed the corpus directory in the default location. let's make obsolete in future PR
|
There is now an issue with our entrypoints… I actually didn't know we had a console_script entrypoint. but you'll want to change that too… or remove it, because I don't know if it works right now. |
allofplos/transformations.py
Outdated
| elif validate_doi(filename): | ||
| doi = filename | ||
| else: | ||
| raise Exception("Invalid format for PLOS filename: {}".format(filename)) |
There was a problem hiding this comment.
could you add the same validate_filename check at the beginning of the of this functiont hat you did below for the doi_to_url check? that way you could remove all of the other calls to validate filename later on and simplify the logic.
| # NOTE: The following check is weird, a DOI should never validate as a file name. | ||
| elif validate_filename(doi): | ||
| article_file = doi | ||
| else: |
There was a problem hiding this comment.
I really like the way you simplified the check with validate doi in filename_to_doi, could you use that method here as well? Checking for validation first that way you don't need to keep adding it as part of your conditions?
also include string that isn't validating
|
LGTM, merging |
This closes #85.
allofplos.plos_corpusin favor ofallofplos.updatefilename_to_doino longer allows passing through a DOI (it was broken by use os.path.basename() in filename_to_doi #87, oops)download_xmlfunction is based on DOI and not file, and is passed intodownload_updated_xml.