Skip to content

Add 'Lost and Found' Road Hazard Dataset#950

Merged
tfds-copybara merged 12 commits intotensorflow:masterfrom
hermannsblum:lostandfound2
Nov 5, 2019
Merged

Add 'Lost and Found' Road Hazard Dataset#950
tfds-copybara merged 12 commits intotensorflow:masterfrom
hermannsblum:lostandfound2

Conversation

@hermannsblum
Copy link
Contributor

This PR adds the Lost and Found dataset.

I mostly followed the code from PR #225 as the datasets follow the same structure, but added a script to generate fake-data.

The gist of the dataset.info files is here.

This is my first contribution, so please excuse if I missed some step, I tried to follow the contributing guidelines closely.

I so far did not get any information about a CLA between ETH Zurich and Google. I am sure there is one, but my universities contact person will be on holidays for at least another week. If you can find out in the meantime whether there is an agreement and how I can add my email to it that would be great!

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added the cla: no Author has not signed CLA label Aug 27, 2019
@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes Author has signed CLA and removed cla: no Author has not signed CLA labels Aug 27, 2019
@hermannsblum
Copy link
Contributor Author

As a suggestion, it was not initially clear that I could sign individually even if my university/company did not sign a CLA, as long as I have general rights to open-source my work. The CLA behind the link made that clear, so I could sign now 👍

yield image_id, features

# Helper functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use pylint. For TF code style check this link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you would like to reduce the 2 blank lines to 1?

I checked all files with pylint using the TF code style (following the provided script in oss_scripts/lint.sh), however there seems to be no linter (neither pylint nor pycodestyle) that actually checks for that. Fixed it anyways.

tf.compat.v1.enable_eager_execution()

# create fake files
example_dir = ('tensorflow_datasets/testing/test_data/fake_examples/'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tfds check this as default, you can delete this.

# create fake files
example_dir = ('tensorflow_datasets/testing/test_data/fake_examples/'
'lost_and_found')
testing.test_utils.remake_dir(example_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't create fake examples on every testing. You should just create and add fake examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@us I did as suggested and added the data to the repository in 06f7bca. However, this adds 120 MiB to the repository. Maybe there is another solution that I am missing? E.g. creating the data as part of the package installation, or only if tests are invoked and the data is not already there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hermannsblum - Thanks for the PR.

the 120 MB is too much. One way to bring it down: create the fake image that it not a noise but (for example) the simple pattern (even one-color).

This way such files would compress as png lot better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyfra Totally agree. I adjusted the random png generation in fake_data_utils.py to always generate pngs from a 4x4 random pattern. However, this function is potentially also used by other scripts, so should I add an argument to switch this on and off for backwards compatibility? I don't think huge png files are the wanted behaviour in any case, but I don't knwo how you usually handle this here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it'd be great to have such argument (probably as default?).
But please do it in separate pull request.

download_urls['gt'] = base_url.format('gtCoarse')
if 'disparity_map' in self.builder_config.features:
download_urls['disparity_map'] = base_url.format('disparity')
# split into two steps to save space for testing data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand this part. Can you explain more why you decided not to use download_and_extract ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run download_and_extract the test will fail because it apparently skips the whole command during testing, instead of extracting those zip-files specified in DL_EXTRACT_RESULTS in lost_and_found_test.py.

Splitting the two steps correctly replaces the results from the download-function with DL_EXTRACT_RESULTS and subsequently extracts the archives in the unittest, which means that the test-data can be stored in compressed zip-archives, which saves space.

I will update the comment to explain this better.
Or is there another workaround to fix this issue?

self.builder_config.right_image_string)
if 'segmentation_label' in self.builder_config.features \
or 'instance_id' in self.builder_config.features:
download_urls['gt'] = base_url.format('gtCoarse')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK - you can pass the same URL multiple times to download_and_extract (and will do the "smart" thing).

this might make your code easier (as you can pass download_urls['segmentation_label'] and download_urls['instance_id'])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update this, your suggestions is much more elegant and also better readable, thanks!

@hermannsblum hermannsblum requested a review from cyfra September 12, 2019 08:51
@Conchylicultor Conchylicultor added the dataset request Request for a new dataset to be added label Oct 15, 2019
@cyfra cyfra added the kokoro:run Run Kokoro tests label Oct 21, 2019
@kokoro-team kokoro-team removed the kokoro:run Run Kokoro tests label Oct 21, 2019
@hermannsblum
Copy link
Contributor Author

@cyfra I resolved the python2 issue. The python3 issue appears to be unrelated to this code, so I merged the latest master version in. Please rerun CI.

@Conchylicultor Conchylicultor added the kokoro:run Run Kokoro tests label Oct 31, 2019
@kokoro-team kokoro-team removed the kokoro:run Run Kokoro tests label Oct 31, 2019
tfds-copybara pushed a commit that referenced this pull request Nov 5, 2019
@tfds-copybara tfds-copybara merged commit 599c94f into tensorflow:master Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Author has signed CLA dataset request Request for a new dataset to be added

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants