Skip to content

[ISSUE GUIDE] dataset path mismatch, hidden files causing empty task names, and corrupted output HDF5s #27

@MrZoyo

Description

@MrZoyo

Issue 1 — Dataset directory layout mismatch (preprocess finds no data / weird downstream errors)

Symptom

preprocess_libero.py fails early or behaves unexpectedly because it does not find expected LIBERO suite files in the directory it scans.

Root cause

The download/output directory conventions are inconsistent: the preprocessing script expects suites under a data/libero/<suite> layout, but the dataset may be placed directly under data/<suite> (e.g., data/libero_10, data/libero_90, …).

Fix

Move suite folders under data/libero/ so the expected structure exists:

mkdir -p data/libero
mv data/libero_10 data/libero/
mv data/libero_90 data/libero/
mv data/libero_goal data/libero/
mv data/libero_object data/libero/
mv data/libero_spatial data/libero/

After this, data/libero/ contains:

  • libero_10, libero_90, libero_goal, libero_object, libero_spatial

Issue 2 — IndexError: string index out of range in get_task_name_from_file_name

Symptom

Preprocessing crashes with:

IndexError: string index out of range
... in get_task_name_from_file_name
if name[0].isupper():

In our case this sometimes happened after processing many demos (e.g., after completing 50/50 demos for a task).

Root cause

The script iterates over directory entries using something like os.listdir(...) and derives task names via split('.'). If the directory contains hidden files (e.g., .DS_Store) or other non-.hdf5 entries, split('.')[0] can become an empty string (""), and name[0] triggers the IndexError.

Fix (recommended)

Ensure the input suite directory contains only .hdf5 files (remove hidden files / junk entries), e.g.:

find data/libero/libero_spatial -maxdepth 1 -name ".DS_Store" -delete
find data/libero/libero_spatial -maxdepth 1 -name ".ipynb_checkpoints" -exec rm -rf {} +

Fix (more robust, code-level)

Change the outer traversal to only iterate over .hdf5 files (e.g., glob("*.hdf5")) instead of os.listdir. This avoids crashes even if hidden files exist.


Issue 3 — OSError: Unable to open file (bad object header version number) while preprocessing

Symptom

Preprocessing crashes with:

OSError: Unable to open file (bad object header version number)
... in inital_save_h5
with h5py.File(path, 'r') as f:

Notably:

  • It can happen very early (0%) or after completing some demos.
  • A scan of the input dataset .hdf5 files shows they are all readable (bad=0), yet the error persists.

Root cause

This is not caused by the input dataset .hdf5 files. It is caused by a corrupted output .hdf5 file under the preprocessing output directory:

  • Output root: data/atm_libero/<suite>/.../demo_k.hdf5

If a previous run was interrupted (killed job, disconnect, disk full, etc.), a partially written demo_*.hdf5 can remain. When rerunning with --skip_exist 1, the script tries to open existing output files in read mode to decide whether to skip; opening a corrupted output file triggers the HDF5 “bad object header” error.

Fix (per-file)

Delete the specific corrupted output file and rerun with --skip_exist 1.
Example:

rm -f data/atm_libero/libero_goal/<task_name>/demo_1.hdf5
python -m scripts.preprocess_libero --suite libero_goal --skip_exist 1

Fix (suite-level, fastest if nothing valuable was produced)

If the suite fails at 0% or output is not needed, remove the entire output suite folder and rerun:

rm -rf data/atm_libero/libero_90
python -m scripts.preprocess_libero --suite libero_90 --skip_exist 1

Debug tip

If the stack trace does not show which file is corrupted, add a debug print before opening the file in inital_save_h5():

print("[DEBUG] opening existing h5:", path, flush=True)

Then rerun once to get the exact path to delete.

Preventive improvement (code-level)

Wrap the h5py.File(path, 'r') open with try/except OSError and, on failure, delete and regenerate the corrupted output file (so long runs don’t get stuck on a single bad artifact).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions