Add 'Lost and Found' Road Hazard Dataset by hermannsblum · Pull Request #950 · tensorflow/datasets

hermannsblum · 2019-08-27T11:39:01Z

This PR adds the Lost and Found dataset.

I mostly followed the code from PR #225 as the datasets follow the same structure, but added a script to generate fake-data.

The gist of the dataset.info files is here.

This is my first contribution, so please excuse if I missed some step, I tried to follow the contributing guidelines closely.

I so far did not get any information about a CLA between ETH Zurich and Google. I am sure there is one, but my universities contact person will be on holidays for at least another week. If you can find out in the meantime whether there is an agreement and how I can add my email to it that would be great!

googlebot · 2019-08-27T11:39:09Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

googlebot · 2019-08-27T12:04:41Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

hermannsblum · 2019-08-27T12:06:55Z

As a suggestion, it was not initially clear that I could sign individually even if my university/company did not sign a CLA, as long as I have general rights to open-source my work. The CLA behind the link made that clear, so I could sign now 👍

us · 2019-08-27T16:50:18Z

tensorflow_datasets/image/lost_and_found.py

+        yield image_id, features
+
+# Helper functions
+


Please use pylint. For TF code style check this link

I assume you would like to reduce the 2 blank lines to 1?

I checked all files with pylint using the TF code style (following the provided script in oss_scripts/lint.sh), however there seems to be no linter (neither pylint nor pycodestyle) that actually checks for that. Fixed it anyways.

us · 2019-08-27T16:51:10Z

tensorflow_datasets/image/lost_and_found_test.py

+  tf.compat.v1.enable_eager_execution()
+
+  # create fake files
+  example_dir = ('tensorflow_datasets/testing/test_data/fake_examples/'


tfds check this as default, you can delete this.

us · 2019-08-27T16:52:32Z

tensorflow_datasets/image/lost_and_found_test.py

+  # create fake files
+  example_dir = ('tensorflow_datasets/testing/test_data/fake_examples/'
+                 'lost_and_found')
+  testing.test_utils.remake_dir(example_dir)


You shouldn't create fake examples on every testing. You should just create and add fake examples.

@us I did as suggested and added the data to the repository in 06f7bca. However, this adds 120 MiB to the repository. Maybe there is another solution that I am missing? E.g. creating the data as part of the package installation, or only if tests are invoked and the data is not already there?

@hermannsblum - Thanks for the PR.

the 120 MB is too much. One way to bring it down: create the fake image that it not a noise but (for example) the simple pattern (even one-color).

This way such files would compress as png lot better.

@cyfra Totally agree. I adjusted the random png generation in fake_data_utils.py to always generate pngs from a 4x4 random pattern. However, this function is potentially also used by other scripts, so should I add an argument to switch this on and off for backwards compatibility? I don't think huge png files are the wanted behaviour in any case, but I don't knwo how you usually handle this here.

Yes, it'd be great to have such argument (probably as default?).
But please do it in separate pull request.

cyfra · 2019-09-05T05:14:18Z

tensorflow_datasets/image/lost_and_found.py

+      download_urls['gt'] = base_url.format('gtCoarse')
+    if 'disparity_map' in self.builder_config.features:
+      download_urls['disparity_map'] = base_url.format('disparity')
+    # split into two steps to save space for testing data


I didn't understand this part. Can you explain more why you decided not to use download_and_extract ?

If I run download_and_extract the test will fail because it apparently skips the whole command during testing, instead of extracting those zip-files specified in DL_EXTRACT_RESULTS in lost_and_found_test.py.

Splitting the two steps correctly replaces the results from the download-function with DL_EXTRACT_RESULTS and subsequently extracts the archives in the unittest, which means that the test-data can be stored in compressed zip-archives, which saves space.

I will update the comment to explain this better.
Or is there another workaround to fix this issue?

cyfra · 2019-09-05T05:16:18Z

tensorflow_datasets/image/lost_and_found.py

+          self.builder_config.right_image_string)
+    if 'segmentation_label' in self.builder_config.features \
+            or 'instance_id' in self.builder_config.features:
+      download_urls['gt'] = base_url.format('gtCoarse')


AFAIK - you can pass the same URL multiple times to download_and_extract (and will do the "smart" thing).

this might make your code easier (as you can pass download_urls['segmentation_label'] and download_urls['instance_id'])

I will update this, your suggestions is much more elegant and also better readable, thanks!

hermannsblum · 2019-10-31T13:54:37Z

@cyfra I resolved the python2 issue. The python3 issue appears to be unrelated to this code, so I merged the latest master version in. Please rerun CI.

PiperOrigin-RevId: 278492512

hermannsblum added 4 commits August 27, 2019 13:33

sort imports

3de425d

ititial working version of lostandfound

818744c

refine docstrings, remove all todos

43d2f99

fix style errors

37a8f28

googlebot added the cla: no Author has not signed CLA label Aug 27, 2019

remove double import

347e3f1

googlebot added cla: yes Author has signed CLA and removed cla: no Author has not signed CLA labels Aug 27, 2019

us suggested changes Aug 27, 2019

View reviewed changes

remove double blank lines

46fb524

cyfra reviewed Sep 5, 2019

View reviewed changes

hermannsblum force-pushed the lostandfound2 branch from 06f7bca to 46fb524 Compare September 6, 2019 09:00

hermannsblum added 3 commits September 6, 2019 11:42

reduce filesize of fake pngs

02f23e8

add fakedata to repositoy instead of creating at test time

ce9de6a

clean up code, address review comments

d2cf70a

hermannsblum requested a review from cyfra September 12, 2019 08:51

Conchylicultor added the dataset request Request for a new dataset to be added label Oct 15, 2019

cyfra added the kokoro:run Run Kokoro tests label Oct 21, 2019

kokoro-team removed the kokoro:run Run Kokoro tests label Oct 21, 2019

hermannsblum added 2 commits October 31, 2019 22:50

remove dictionary unpacking

d94dec3

Merge branch 'master' into lostandfound2

703d57e

Conchylicultor added the kokoro:run Run Kokoro tests label Oct 31, 2019

kokoro-team removed the kokoro:run Run Kokoro tests label Oct 31, 2019

fix featuredict

599c94f

cyfra added the being_merged label Nov 4, 2019

tfds-copybara pushed a commit that referenced this pull request Nov 5, 2019

Merge pull request #950 from hermannsblum:lostandfound2

a933bfa

PiperOrigin-RevId: 278492512

tfds-copybara merged commit 599c94f into tensorflow:master Nov 5, 2019

Conversation

hermannsblum commented Aug 27, 2019

Uh oh!

googlebot commented Aug 27, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

Uh oh!

googlebot commented Aug 27, 2019

Uh oh!

hermannsblum commented Aug 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hermannsblum commented Oct 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants