Skip to content

Parsing and unparsing has super-linear runtime #8

@thurstond

Description

@thurstond

Parsing and unparsing have super-linear runtime, largely because OCaml list append and string concatenation are not optimized.

Parsing: parse_tilde quadratic (/cubic?) behavior

for n in 1000 2000 4000 8000 16000 32000 64000; do perl -e "print '~'; print 'a' x $n;" > /tmp/t; time ./parse_to_json.native /tmp/t > /dev/null; done

user    0m0.015s
user    0m0.034s
user    0m0.153s
user    0m0.830s
user    0m4.540s
user    0m27.661s
user    3m4.591s

(real/sys runtime omitted for brevity)

Notes:

  • This is likely because of ast.ml: | c::s' -> parse_tilde (acc @ [c]) s'. It's not obvious to me why it seems to be cubic (rather than quadratic) runtime though.
    • An alternative would be to prepend to the list, and then reverse the result at the end.
  • The parse_to_json.native test application is from https://github.com/andromeda/pash/tree/main/compiler/parser
  • To avoid conflating the libdash runtime with the JSON serialization, I disabled the JSON serialization code (print_ast) in the test app

Parsing: to_assign cubic (?) behavior

for n in 1000 2000 4000 8000 16000 32000 64000; do perl -e "print 'f' x $n; print '=pay_respects;'" > /tmp/t1; time ./parse_to_json.native /tmp/t1 > /dev/null; done

user    0m0.011s
user    0m0.021s
user    0m0.098s
user    0m0.571s
user    0m3.749s
user    0m25.097s
user    2m52.969s

This also has an expensive list append operation: ast.ml | C c :: a -> to_assign (v @ [c]) a.

Unparsing: ^ string concatenation considered harmful

OCaml's "^" string operator is not optimized; concatenating n strings one at a time can take O(n^2) runtime (https://discuss.ocaml.org/t/whats-the-fastest-way-to-concatenate-strings/2616/7). This is arguably a compiler issue e.g., CPython optimizes for common cases (https://mail.python.org/pipermail/python-dev/2013-February/124031.html).

ast.ml's unparsing is essentially a series of "^" operations, hence everything is going to have a worst-case runtime that's super-linear.

Here's an example of a long pipeline, showing quadratic runtime for json_to_shell:

for n in 2000 4000 8000 16000 32000 64000 128000; do (echo -n "echo 'Hello '"; for f in `seq 1 $n`; do echo -n "| cat "; done; echo) > /tmp/s1; cat /tmp/s1 | ./parse_to_json.native > /tmp/j1; time ./json_to_shell.native /tmp/j1 | md5sum; done

user    0m0.035s
user    0m0.112s
user    0m0.508s
user    0m1.920s
user    0m13.909s
user    0m52.966s
user    4m26.274s

I also tried removing the Ast.to_string operation in json_to_shell.ml, to show that the JSON deserialization by itself is fast (linear); thus, the quadratic runtime is due to the libdash core.

Unparsing: fresh_marker for heredocs is slow on adversarial inputs

fresh_marker tries to find increasingly long-variants of {EOF, EOFF, EOFFF ..}, until it can find a marker that is not contained in the heredoc. This is slow for adversarial inputs that deliberately use all those markers:

cat <<CTHULHU
EOF
EOFF
EOFFF
EOFFFF
EOFFFFF
CTHULHU

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions