Skip to content

Random lock-ups when running tests/test_api_base.py::TestCreateCopy::test #45

@josch

Description

@josch

Hi,

I have sunk a few days of my time into this and was unable to come up with a solution. I'm documenting it here in case anybody else wants to pick this up where I left of.

The problem is, that on the Debian buildds, on the Debian gitlab CI as well as on tests.reproducible-builds.org, the bmaptool test suite runs into a dead-lock around ~50% of the time it is run. I am unable to reproduce this problem on my own laptop, so debugging this was not trivial. I hacked in a periodic call to pstree -l -a to have a look at what the process tree looks like when the deadlock happens:

  • bzip2 stalls:
  rules -f debian/rules override_dh_auto_test
    `-sh -c sh -c 'while sleep 10; do pstree -l -a 1641 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test  -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
        |-sh -c while sleep 10; do pstree -l -a 1641 >&2; done
        |   `-pstree -l -a 1641
        `-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
            `-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
                `-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
                    `-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
                        `-sh -c cd /tmp/reprotest.GU9bw5/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
                            `-python3.13 -m pytest -vvv
                                `-sh -c bzip2 -c -k  /tmp/reprotest.GU9bw5/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_vq64huef.img
  • xz stalls:
  rules -f debian/rules override_dh_auto_test
    `-sh -c sh -c 'while sleep 10; do pstree -l -a 13086 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test  -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
        |-sh -c while sleep 10; do pstree -l -a 13086 >&2; done
        |   `-pstree -l -a 13086
        `-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
            `-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
                `-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
                    `-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
                        `-sh -c cd /tmp/reprotest.jfoiN4/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
                            `-python3.13 -m pytest -vvv
                                `-sh -c xz -c -k  /tmp/reprotest.jfoiN4/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_plus_1_kqm3eugf.img
  • tar stalls:
  rules -f debian/rules override_dh_auto_test
    `-sh -c sh -c 'while sleep 10; do pstree -l -a 13126 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test  -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
        |-sh -c while sleep 10; do pstree -l -a 13126 >&2; done
        |   `-pstree -l -a 13126
        `-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
            `-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
                `-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
                    `-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
                        `-sh -c cd /tmp/reprotest.gP4xrW/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
                            `-python3.13 -m pytest -vvv
                                `-sh -c tar -c -j -O -P -C /  /tmp/reprotest.gP4xrW/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_gn6_77ax.img
  • gzip stalls:
  rules -f debian/rules override_dh_auto_test
    `-sh -c sh -c 'while sleep 10; do pstree -l -a 1640 >&2; done' & pid=$!; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- -vvv;  \\\012kill $pid; \\\012wait $pid || :
        |-sh -c while sleep 10; do pstree -l -a 1640 >&2; done
        |   `-pstree -l -a 1640
        `-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- -vvv
            `-faketime 2022-06-26 17:10:50 dh_auto_test -- -vvv
                `-dh_auto_test /usr/bin/dh_auto_test -- -vvv
                    `-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 -vvv
                        `-sh -c cd /tmp/reprotest.M8T3OF/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest tests
                            `-python3.13 -m pytest tests
                                `-sh -c gzip -c /tmp/reprotest.M8T3OF/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_4up2jzoj.img

All of these tools are invoked by _generate_compressed_files() in tests/test_api_base.py so my guess was that the way that the tools are called using subprocess.Popen(), followed by Popen.wait() somehow causes the kind of deadlock that is described in the documentation of Popen.wait(). So I replaced those subprocess calls with calls to subprocess.check_output() and slurped the whole standard output into a Python string (it's only ~2 MB of data). But this did not change the behaviour at all.

Searching online for deadlocks involving the subprocess module revealed that in Python 2.7 there was a problem of deadlocks when the subprocess module was called from Python threads. The fix was supposedly to add close_fds=True to the Popen call. I tried that too but it resulted no change in behaviour.

Then I stumbled across a slightly different deadlock:

  rules -f debian/rules override_dh_auto_test
    `-sh -c sh -c 'while sleep 10; do pstree -l -a 1641 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test  -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
        |-sh -c while sleep 10; do pstree -l -a 1641 >&2; done
        |   `-pstree -l -a 1641
        `-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
            `-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
                `-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
                    `-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
                        `-sh -c cd /tmp/reprotest.dL6CBt/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
                            `-python3.13 -m pytest -vvv
                                `-df -PT -- /tmp/reprotest.dL6CBt/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_plus_1_3wg1y132.img

The invocation of "df -PT" is from a completely different part of the code. It comes from get_file_system_type() in bmaptool/BmapHelpers.py which just runs:

       proc = subprocess.Popen(["df", "-PT", "--", abspath], stdout=PIPE, stderr=PIPE)
       stdout, stderr = proc.communicate()

But this should be completely harmless, no???

So maybe the problem is not with tests/test_api_base.py but the problem is a more fundamental one. I looked into other parts of the codebase and _open_compressed_file() in bmaptool/TransRead.py does an aweful lot of threading in combination with opening decompressors via the subprocess module.

I have no solution to the problem yet but I do have a workaround:

--- a/tests/helpers.py
+++ b/tests/helpers.py
@@ -284,29 +284,19 @@ def calculate_chksum(file_path):
     return hash_obj.hexdigest()
█
█
+import subprocess
+
+
 def copy_and_verify_image(image, dest, bmap, image_chksum, image_size):
     """
     Copy image 'image' using bmap file 'bmap' to the destination file 'dest'
     and verify the resulting image checksum.
     """
█
-    f_image = TransRead.TransRead(image)
-    f_dest = open(dest, "w+b")
-    if bmap:
-        f_bmap = open(bmap, "r")
-    else:
-        f_bmap = None
-
-    writer = BmapCopy.BmapCopy(f_image, f_dest, f_bmap, image_size)
-    # Randomly decide whether we want the progress bar or not
-    if bool(random.getrandbits(1)) and sys.stdout.isatty():
-        writer.set_progress_indicator(sys.stdout, None)
-    writer.copy(bool(random.getrandbits(1)), bool(random.getrandbits(1)))
+    subprocess.check_call(
+        ["python3", "-m", "bmaptool", "copy"]
+        + (["--bmap", bmap] if bmap else ["--nobmap"])
+        + [image, dest]
+    )-    # Compare the original file and the copy are identical
     assert calculate_chksum(dest) == image_chksum
-
-    if f_bmap:
-        f_bmap.close()
-    f_dest.close()
-    f_image.close()

Instead of calling BmapCopy.BmapCopy().copy(), spawn a new process running the
full bmaptool utility with the given options. This is slower than the existing
code because bmaptool gets spawned a lot but it will effectively test the same
thing. Except that with this change we do not anymore test using BmapCopy as a
library. But I do not believe that this test is useful for Debian because we do
not ship this as a library but we ship it as a program that is to be called
from the terminal. And when run like that, the deadlock problem never occurs.

I also documented this in the associated Debian bug: https://bugs.debian.org/1081336

Maybe somebody else feels motivated to look into this. It would probably help being able to reproduce this issue locally instead of a remote machine without ssh access. :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions