Change update functions & readme by eseiver · Pull Request #88 · PLOS/allofplos

eseiver · 2018-02-28T01:54:39Z

This closes #85.

deprecates allofplos.plos_corpus in favor of allofplos.update
updates README with new command (as well as other readme cleanup)
filename_to_doi no longer allows passing through a DOI (it was broken by use os.path.basename() in filename_to_doi #87, oops)
new download_xml function is based on DOI and not file, and is passed into download_updated_xml.

mpacer · 2018-02-28T02:05:47Z

allofplos/update.py

@@ -0,0 +1,4 @@
+from .corpus.plos_corpus import main


If we're going to do this, I'd prefer that the meat of this functionality actually be moved into this file and rely on the corpus module to provide the actual APIs it uses internally. This kind of redirection is getting to be a bit gratuitous. But if this would include a true refactor (making python -m allofplos.corpus.plos_corpus invalid, then I'd be for this.

I'll note that I've been trying to push us to have a deeper package structure, and so introducing another top level module isn't something I'm jumping to do.

However, I think it's important to separate the applications from the APIs, and this is effectively an application style command, so that's why I'm in favour of adding this at the cost of a flatter package structure to gain the isolation of purpose.

mpacer · 2018-02-28T02:11:24Z

allofplos/transformations.py

    """
-    filename = os.path.basename(filename)
-    if correction in filename and validate_filename(filename):
+    fn = os.path.basename(filename)


couldn't you change the filenameparameter to filepath and then keep this as filename?

I always feel conflicted about fn since it can also be shorthand for function, especially in callback based code.

mpacer · 2018-02-28T02:11:37Z

allofplos/plos_corpus.py


 if __name__ == "__main__":
+    warnings.simplefilter('always', DeprecationWarning)
+    warnings.warn("This update method is deprecated. use 'python -m allofplos.update'",


Yay deprecation warnings!

mpacer · 2018-02-28T02:30:10Z

allofplos/corpus/plos_corpus.py

        print("Pubdate error in {}".format(doi))


+def download_xml(doi, tempdir=newarticledir):


That is really clever! ~~I hadn't thought of using an article object as a downloader…~~ why don't we just bake this into the Article class itself?

Edit: I actually had thought about it in the context of the async-await stuff (#46) but I think I chose not to implement it because I thought I'd get pushback and that it wasn't how we were planning on using the Article class. No matter what, this is super clever in a good, straightforward code way!

mpacer · 2018-03-02T19:58:38Z

allofplos/transformations.py

-        doi = filename
+    else:
+        doi = ''
    return doi


You should validate the doi before returning it (raising an error if that fails).

Do we have a custom doi validation error? If not… we should.

BTW, you have the same problem with doi_to_path

https://github.com/PLOS/allofplos/pull/88/files?diff=unified#diff-ad065d41199b64627a6f15ef79720843R181

if directory is None: directory = get_corpus_dir() if doi.startswith(ANNOTATION_DOI) and validate_doi(doi): article_file = os.path.join(directory, "plos.correction." + doi.split('/')[-1] + SUFFIX_LOWER) elif validate_doi(doi): article_file = os.path.join(directory, doi.lstrip(PREFIX) + SUFFIX_LOWER) # NOTE: The following check is weird, a DOI should never validate as a file name. elif validate_filename(doi): article_file = doi return article_file

it should have an else case.

mpacer · 2018-03-02T19:59:54Z

allofplos/corpus/plos_corpus.py

    vor_updated_article_list = []
-    for article in tqdm(vor_updates_available, disable=None):
-        updated = download_updated_xml(article, vor_check=True)
+    for doi in tqdm(vor_updates_available, disable=None):


Does vor_updates_available return a list of dois?

indeed it does

deprecates allofplos.plos_corpus, updates README with new command

`filename_to_doi` no longer allows passing through a DOI. new `download_xml` function is based on DOI and not file, and is passed into `download_updated_xml`.

point .plos_corpus.py at update.py delete `main()` from corpus.plos_corpus.py

mpacer · 2018-03-05T23:58:38Z

README.rst

+      ........
      ----------------------------------------------------------------------
-      Ran 6 tests in 3.327s
+      Ran 8 tests in 0.257s


This is no longer accurate after #89, could you update it?

mpacer · 2018-03-06T00:04:33Z

allofplos/update.py

@@ -0,0 +1,52 @@
+import os
+
+from . import get_corpus_dir, newarticledir, uncorrected_proofs_text_list


Why is uncorrect_proofs_text_list in __init__.py? Isn't it specific to the update function? Does anything else use it?

because the .txt file is stored in the top-level directory (which made sense when plos_corpus.py was there).

It will be wiped out when you reinstall allofplos

yes, just like people who have installed the corpus directory in the default location. let's make obsolete in future PR

mpacer · 2018-03-06T00:09:12Z

There is now an issue with our entrypoints… I actually didn't know we had a console_script entrypoint. but you'll want to change that too… or remove it, because I don't know if it works right now.

mpacer · 2018-03-06T01:05:31Z

allofplos/transformations.py

-    elif validate_doi(filename):
-        doi = filename
+    else:
+        raise Exception("Invalid format for PLOS filename: {}".format(filename))


could you add the same validate_filename check at the beginning of the of this functiont hat you did below for the doi_to_url check? that way you could remove all of the other calls to validate filename later on and simplify the logic.

mpacer · 2018-03-06T01:46:02Z

allofplos/transformations.py

-    # NOTE: The following check is weird, a DOI should never validate as a file name.
-    elif validate_filename(doi):
-        article_file = doi
+    else:


I really like the way you simplified the check with validate doi in filename_to_doi, could you use that method here as well? Checking for validation first that way you don't need to keep adding it as part of your conditions?

also include string that isn't validating

mpacer · 2018-03-06T02:12:50Z

LGTM, merging

eseiver requested review from mpacer and sbassi February 28, 2018 01:54

mpacer reviewed Feb 28, 2018

View reviewed changes

eseiver force-pushed the update branch from 38af9f4 to 5b389c6 Compare February 28, 2018 18:30

mpacer reviewed Mar 2, 2018

View reviewed changes

eseiver force-pushed the update branch from 5b389c6 to 7963d2b Compare March 3, 2018 22:14

eseiver added 2 commits March 5, 2018 13:16

change update function to allofplos.update

a0502c9

deprecates allofplos.plos_corpus, updates README with new command

fix filename_to_doi and download functions

0b4d942

`filename_to_doi` no longer allows passing through a DOI. new `download_xml` function is based on DOI and not file, and is passed into `download_updated_xml`.

eseiver force-pushed the update branch from 7963d2b to 0b4d942 Compare March 5, 2018 21:16

move main() to update.py

2dc637c

point .plos_corpus.py at update.py delete `main()` from corpus.plos_corpus.py

mpacer reviewed Mar 5, 2018

View reviewed changes

mpacer reviewed Mar 6, 2018

View reviewed changes

eseiver force-pushed the update branch from 5472117 to 9ea4d79 Compare March 6, 2018 01:28

mpacer reviewed Mar 6, 2018

View reviewed changes

more exceptions for invalid formats

a68b06c

also include string that isn't validating

eseiver force-pushed the update branch from 9ea4d79 to a68b06c Compare March 6, 2018 01:55

mpacer merged commit a6d6376 into PLOS:master Mar 6, 2018

eseiver added this to the 0.11.0 milestone Mar 6, 2018

eseiver deleted the update branch March 6, 2018 18:16

		print("Pubdate error in {}".format(doi))


		def download_xml(doi, tempdir=newarticledir):

		@@ -0,0 +1,52 @@
		import os

		from . import get_corpus_dir, newarticledir, uncorrected_proofs_text_list

Conversation

eseiver commented Feb 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpacer Feb 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpacer Feb 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpacer Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpacer Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpacer commented Mar 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpacer commented Mar 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eseiver commented Feb 28, 2018 •

edited

Loading

mpacer Feb 28, 2018 •

edited

Loading

mpacer Feb 28, 2018 •

edited

Loading

mpacer Mar 2, 2018 •

edited

Loading

mpacer Mar 6, 2018 •

edited

Loading