Skip to content

[tools]Import Wizard for Installation Setup#599

Merged
wenjin272 merged 35 commits into
apache:mainfrom
JinkunLiu:feature/import-wizard
May 22, 2026
Merged

[tools]Import Wizard for Installation Setup#599
wenjin272 merged 35 commits into
apache:mainfrom
JinkunLiu:feature/import-wizard

Conversation

@JinkunLiu
Copy link
Copy Markdown
Contributor

@JinkunLiu JinkunLiu commented Apr 1, 2026

Linked issue: #586

Purpose of change

Import Wizard for Installation Setup

Install Flink & Install PyFlink Environment

Tests

I have test shell on Ubuntu 22.0 and macOS 15.0

API

not

Documentation

  • doc-needed
  • doc-not-needed
  • doc-included

@github-actions github-actions Bot added doc-needed Your PR changes impact docs. fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. and removed doc-needed Your PR changes impact docs. labels Apr 1, 2026
@JinkunLiu JinkunLiu marked this pull request as draft April 1, 2026 05:09
@JinkunLiu JinkunLiu force-pushed the feature/import-wizard branch from 4345b93 to 47ba91e Compare April 6, 2026 12:57
@JinkunLiu JinkunLiu marked this pull request as ready for review April 6, 2026 12:58
@github-actions github-actions Bot added doc-needed Your PR changes impact docs. and removed doc-needed Your PR changes impact docs. labels Apr 6, 2026
Copy link
Copy Markdown
Collaborator

@wenjin272 wenjin272 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work @JinkunLiu. I tried using this script and found some issues, which I've listed in the comments.

From a user experience perspective, I feel there is a significant issue: users are required to understand too many cli flag and environment variables, such as --install-flink, --enable-pyflink, INSTALL_FLINK, FLINK_VERSION, and FLINK_AGENTS_VERSION. Even the value of INSTALL_FLINK can be set to ask.

I'm wondering if we could retain only a single CLI flag, --default, to indicate that all operations should follow the default path, while using an interactive mode for user selection in all other cases. Users can interactively choose whether to install Flink, select the Flink version (1.20.3, 2.0.1, 2.1.1, 2.2.0), and specify the flink-agents version (0.1.1, 0.2.1, 0.3). Additionally, users can provide interactive responses regarding the location of an existing Flink installation and the target installation directory.

We can retain these environment variables; when they are detected as set, we will print a log and skip the prompt.

Comment thread tools/install.sh
Comment thread tools/install.sh Outdated
Comment thread tools/install.sh Outdated
Comment thread tools/install.sh
Comment thread tools/install.sh Outdated
Comment thread tools/install.sh Outdated
@JinkunLiu
Copy link
Copy Markdown
Contributor Author

JinkunLiu commented Apr 27, 2026

Hello @wenjin272 thanks for you review

users are required to understand too many cli flag and environment variables.

CLI flags are reserved for non-interactive installation mode

I'm wondering if we could retain only a single CLI flag, --default, to indicate that all operations should follow the default path

Actually, the code already does this — you can try running bash install.sh directly and the interactive mode for user selection will appear.

Users can interactively choose whether to install Flink, select the Flink version (1.20.3, 2.0.1, 2.1.1, 2.2.0), and specify the flink-agents version (0.1.1, 0.2.1, 0.3).

Now the install shell provides the latest version as default. Users can use all the features we provide in the latest version.If they need special version about Flink or Flink-agents they can use cli flag.

@wenjin272
Copy link
Copy Markdown
Collaborator

Hi, @JinkunLiu, thanks for your update.

Since flink-agents provides multi-version support, and considering that some developers' clusters may not yet be compatible with the latest Flink version, I believe it might be too inflexible to always download the latest Flink version.

Now the install shell provides the latest version as default. Users can use all the features we provide in the latest version.If they need special version about Flink or Flink-agents they can use cli flag.

This is what I mean when I say that users need to understand many CLI flags. Why can't users be allowed to select a specific Flink version, rather than being required to specify it via a CLI flag? We provide a GUI-like installation wizard, but users are still required to manually specify certain cli flag to achieve specific goals. I believe this creates a fragmented user experience.

I think for some less frequent requirements, such as dry-run and verbose, we can simply only provide CLI flags. However, for some high-frequency requirements, I believe it would be a more user-friendly experience to allow users to select options during the installation guide, rather than requiring them to read and understand the information returned by help and setting CLI flags or environment variables.

I think these requirements contains:

  1. intall dir: we can provide two options: default vs. custom, and if user choose custom, they need input the custom dir.
  2. install flink: yes vs. no, and check the FLINK_HOME when user choose no.
  3. flink-version: 4 options, 1.20.3, 2.0.1, 2.1.1, 2.2.0(recommend)
  4. install pyflink: yes vs. no

After the user makes the above selections, we will then output the Install Plan.

Additionally, when running the script on my Mac, I encountered the error die: command not found. You may need to replace it with a more universally available command.

Copy link
Copy Markdown
Collaborator

@wenjin272 wenjin272 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @JinkunLiu. I have verified the install.sh script locally and successfully submitted the example to the Flink standalone cluster. LGTM. Please take a look at your convenience @xintongsong.

The only issue I encountered was being unable to complete the download of Flink within the timeout period. I'm not sure whether this is a personal network problem or a common occurrence.

Additionally, now that an install script has been provided, we need to refactor the Installation section in the documentation. I think this can be done either in this PR or in a separate one.

Comment thread tools/install.sh Outdated
detect_downloader
fi
if [[ "$DOWNLOADER" == "curl" ]]; then
curl -fsSL --proto '=https' --tlsv1.2 --connect-timeout 5 --max-time 30 --retry 2 --retry-delay 1 --retry-connrefused -o "$output" "$url"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flink-2.2.0-bin-scala_2.12.tgz file is 544 MB in size, and setting the max-time to 30 seconds imposes high demands on network performance.
Based on my actual tests, it takes approximately 500 seconds in my office network. I recommend setting it to 600 seconds.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is a mistake which introduced when i fix another issue. Fixed it.

@JinkunLiu JinkunLiu force-pushed the feature/import-wizard branch from a2ab0bd to 852c603 Compare April 29, 2026 14:29
Copy link
Copy Markdown
Contributor

@xintongsong xintongsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @JinkunLiu and @wenjin272 .

I did not closely check all the codes, but have tried the script out instead. Here are a few issues that I run into, which might be improved.

  1. An extra -- in line 68 causes gum downloading failure. However, the script did fail and continued to execute without gum, leading to a poor interaction experience. I think we should make gum required, and fail the script if it's not available.

  2. There's no downloading progress. It takes a couple of minutes to download the Flink binary release, and during that time the script acts like dead. Might be better to show some progress, letting users know that it's actually downloading.

  3. Setting up the python environment produces too many logs, which I believe is from pip install. Can we make the pip install output rolling, like it always shows the latest N lines of outputs, taking at most N lines of the screen space?

  4. The plan shows FLINK_HOME .//flink-2.2.0 on my screen. There's an extra /.

  5. It's like a one-way road. There's no way to go back.

  • There's no way to go back to the previous step and change the settings.
  • At the stage of confirming the installation plan, if you choose No, the script exits, without giving you a chance to modify the plan.
  • When I input a wrong path of exiting flink installation, the script exits, without giving me a second chance to re-input it.
  1. I tried ctrl+c to interrupt the script. It didn't exit the script, but instead skipped the configuration and directly started the installation.

In addition to the above experience issues, I think for this over 1k loc script, we need test cases to verify it. The -- bug, introduced by fixing the download time limit, is a strong indicator that the script lacks protection that any simple maintenance update may break things.

Comment thread tools/install.sh Outdated
Comment thread tools/install.sh Outdated
@JinkunLiu
Copy link
Copy Markdown
Contributor Author

Thanks for the review! @xintongsong

  1. Fixed the extra -- issue. On whether to make gum required — I referenced openclaw's design, which treats it as an optional dependency rather than a hard requirement. Since gum is just a UI enhancement and not something the script needs to work, I think keeping it optional makes more sense.
  2. Added a download progress indicator for the Flink binary. Done.
  3. I aware the pip install output is noisy, but haven't found a elegant solution yet. Open to suggestions if you have ideas.
  4. I couldn't reproduce the double-slash issue in FLINK_HOME. Could you share more details about how it was triggered?
  5. Improved path input validation so users get a second chance on wrong Flink path input. As for going back after reviewing the plan — since the number of configuration fields is still relatively small, I considering leaving that out for now. WDYT?
  6. The Ctrl+C issue has been fixed.

@wenjin272
Copy link
Copy Markdown
Collaborator

Hi, @JinkunLiu, thanks for your update. These updates look to me.

As you mentioned, there are three remaining issues.

  1. make the pip install output rolling
  2. second chance to modify install plan
  3. add comprehensive tests.

Given that the code freeze (5.31) for Flink-Agents 0.3 is approaching and communication via natural language may be inefficient, I tried using Claude Code to address the three issues mentioned above and generated some code. I verified this code locally, and it seems to be working fine. Would you mind if I directly push the above commits to this branch? Then, you can accept or reject these commits, or make modifications based on them.

@JinkunLiu
Copy link
Copy Markdown
Contributor Author

Hi, @wenjin272
Thanks for you help, please go ahead and push the commits directly to this branch.

@wenjin272 wenjin272 force-pushed the feature/import-wizard branch 2 times, most recently from f21bf9b to b41ac14 Compare May 19, 2026 02:48
wenjin272 and others added 3 commits May 19, 2026 11:14
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wenjin272 wenjin272 force-pushed the feature/import-wizard branch from b41ac14 to 54ccd1b Compare May 19, 2026 03:14
@wenjin272
Copy link
Copy Markdown
Collaborator

Hi, @JinkunLiu @xintongsong , I pushed 3 commits to address the remaining review comments. PTAL.

  • e8f9c0e — Adds an 85-test bats-core suite under tools/test/ (unit + PATH-shim integration, including an exact-argv pin on download_file as a regression guard for the -- bug) plus a CI job running on Ubuntu and macOS.
  • 525adadpip_install_quiet helper replaces noisy pip install output with a single rolling status line on tty (full log surfaced on failure), gum spin when gum is available, and --quiet in CI.
  • 54ccd1b — Confirm prompt is now Proceed / Edit a setting / Cancel; the Edit submenu re-prompts any currently-applicable field (Flink Yes/No, version, install dir, FLINK_HOME, PyFlink Yes/No, venv dir) and re-displays the plan.

wenjin272 and others added 9 commits May 20, 2026 21:17
…de paths

Reviewer feedback (#7b): user-supplied INSTALL_DIR / FLINK_HOME / VENV_DIR
with trailing slashes, leading "~", "." or consecutive "/" produced ugly
or broken paths like "/x//flink-2.2.0".

Add a small normalize_path() that:
  - expands "~" against $HOME,
  - anchors relative paths at $PWD,
  - folds "/./" and trailing "/.",
  - collapses runs of "/",
  - strips trailing "/" (except root).

Route every path input through it: prompt_path_input, the default and
custom branches of prompt_path_choice_interactive, the explicit / env-var
paths in plan_flink (Yes + No), plan_pyflink (replacing the relative→
absolute case statement), and the install_dir / venv_dir actions in
edit_plan_interactive.

Tests:
  - 9 unit tests covering ~, relative, "." , trailing "/", consecutive "/",
    "/./", combined.
  - 2 integration tests on the dry-run plan: trailing slash, "//" in
    INSTALL_DIR no longer leak into FLINK_HOME.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer feedback (apache#2): when users chose "I already have Flink" and entered
a FLINK_HOME, the installer kept the default FLINK_VERSION=2.2.0 internally
and later failed looking for opt/flink-python-2.2.0.jar.

Add detect_flink_version_from_home():
  - fast path: parse the version out of lib/flink-dist-<ver>.jar — instant.
  - slow path: run "${FLINK_HOME}/bin/flink --version" — authoritative but
    spins up a JVM (3-10s). Announce progress before invoking so the user
    doesn't think the installer hung.

plan_flink (No branch) now calls it after path validation, prints
"FLINK_HOME accepted" + "Detected Flink version: X.Y.Z" so the user sees
the result, and falls back to a warning (keep current FLINK_VERSION) when
the layout is non-standard.

Tests cover the fast/slow paths, stderr noise, no-source-available, and
that the progress line only appears on the slow path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer feedback (#1) and follow-up: stop extracting the JAR out of the
Python wheel; download the official artifact straight from the ASF mirror
exactly like the Flink tarball, and verify it with the SHA512 sidecar
published next to the JAR on downloads.apache.org.

Effect: Java-only users no longer need any Python at all. PyFlink users
still get their venv from pip, but the JAR in FLINK_HOME/lib comes from
the same authoritative source for both audiences.

Implementation:
  - flink_agents_jar_relpath() builds
    flink-agents-<ver>/flink-agents-dist-flink-<flink_maj.min>-<ver>.jar
  - install_flink_agents_jar() downloads to FLINK_HOME/lib (skips if
    already present), then verifies.
  - verify_flink_agents_jar_sha512() best-effort: missing sidecar /
    missing sha512 tool / malformed hash all degrade to a warning so
    a transient mirror lag never blocks an install; an actual hash
    mismatch is a hard die().
  - Replaces copy_flink_agents_jars() (wheel extraction) and removes the
    short-lived install_agent_jars_without_pyflink() temp-venv shim.
  - verify_installation now checks the single flink-agents-dist-flink JAR
    (no separate common / -thin JARs after the wheel was retired).
  - print_usage documents FLINK_AGENTS_BASE_URL /
    FLINK_AGENTS_CHECKSUM_BASE_URL.

Five integration tests cover relpath construction, mirror+sidecar URLs,
JAR reuse, empty-download failure, and downloader failure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer feedback (apache#3, apache#5, alignment): the install plan mashed
detected-environment facts (OS, Java, JAVA_HOME) together with editable
configuration in one flat list, and the JAVA_HOME row implied it was
something the user could pick. It also leaked editable-looking
"Install directory" / "Flink version" rows when the user had said
"don't install Flink".

Rewrite show_install_plan():

  Environment (read-only)
    OS:            macos / linux
    Java:          <parsed `java -version` output> or "not found"
    JAVA_HOME:     /path  or  <not set, Flink will auto-detect>
    Python:        <bin> (X.Y.Z)  or  <none>

  Installation plan
    Flink Agents version:  …
    Install Flink:         Yes / No
      ├─ Flink version:    …          # Yes only
      ├─ Install directory: …         # Yes only
      └─ FLINK_HOME:       /p (vX.Y)  # No only — detected version inline
    Enable PyFlink:        Yes / No
      └─ Venv directory:   …          # Yes only

Also widen the key column: ui_kv now takes a UI_KV_KEY_WIDTH=24 and
appends the colon itself in both gum and fallback branches, so
"Flink Agents version" no longer collides with its value (the previous
20-char width packed key + value into "Flink Agents version0.2.1").

Five unit tests cover the Environment section being present, version+dir
on Yes, FLINK_HOME-with-version on No, the JAVA_HOME unset hint, and
that the new aligned format is rendered.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer feedback: "I didn't see where to choose which flink-agents
version to install." Only FLINK_VERSION had an interactive picker;
FLINK_AGENTS_VERSION was hardcoded.

Add a parallel picker that mirrors
https://flink.apache.org/downloads/#apache-flink-agents:
  FLINK_AGENTS_SUPPORTED_VERSIONS=("0.2.1" "0.2.0" "0.1.1" "0.1.0")
  FLINK_AGENTS_RECOMMENDED_VERSION="0.2.1"

- prompt_flink_agents_version_interactive() — same UX as the Flink
  version picker (gum choose + numbered-menu fallback, "(recommended)"
  marker).
- plan_flink_agents() runs the picker if the version wasn't supplied
  explicitly; called from main right after plan_flink so users see it as
  part of "Planning Flink installation".
- mark_explicit FLINK_AGENTS_VERSION + --flink-agents-version flag
  (also adds --flink-version which was missing too) so non-interactive
  invocations and env vars opt out of the picker.
- Updated print_usage with both flags and reset_install_sh_state with
  the supported-versions array so tests stay deterministic.

Integration test (dry-run plan) confirms "Flink Agents version: 0.2.1"
appears in the plan output.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer feedback (apache#6): path entry had no autocomplete in either branch —
gum input is a plain text box and the fallback used plain \`read -r\`, so
users typing FLINK_HOME / INSTALL_DIR / VENV_DIR couldn't Tab-complete.

Replace both with \`read -e -r -p\` so readline's filename completion is
available everywhere. The gum input box is dropped (it offered no value
beyond the visual style and blocked completion); gum choose is still used
for menu selections. The previous "placeholder" argument is woven into
the prompt label as "(example: …)" since readline doesn't surface
placeholders the same way.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s block

Reviewer feedback (#7a, venv hint visibility): the closing "export
FLINK_HOME=..." line was opaque to first-time users — they couldn't tell
why they needed to export it, and the venv activate hint was an
afterthought tucked below it.

Replace it with a "Next steps" section, numbered, where the venv hint is
a peer of the FLINK_HOME export instead of being buried. Each step uses
ui_success for the rationale and ui_info for the literal command so the
copy/paste line is visually separable from the prose. The "make it
permanent" guidance only mentions one or two rc lines depending on
whether the venv is actually installed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… nested menu

Reviewer feedback (apache#4, apache#8, apache#9): toggling Install Flink / Enable PyFlink in
the edit menu didn't ask the follow-up questions; ESC at the edit menu
killed the whole installer; Ctrl+C during "Proceed with installation?"
silently looped instead of cancelling.

Cascade follow-ups:
  - Flipping install_flink from No → Yes immediately walks the user
    through Flink version + install dir.
  - Flipping install_flink from Yes → No prompts for FLINK_HOME and
    runs detect_flink_version_from_home.
  - Flipping enable_pyflink from No → Yes prompts for the venv dir
    right after resolving Python.

ESC = back, at any nesting depth:
  - Run the entire edit body inside a `()` subshell.
    A die_cancelled / ESC from any nested prompt only kills the subshell.
  - On successful completion the subshell dumps mutated state via
    edit_plan_quote / edit_plan_dump_state; the parent sources it back.
  - On any non-zero subshell exit, the parent returns 0 ("back to
    confirm"); the state file is left untouched, so the user's previous
    plan is preserved.
  - The "← Back" menu item is gone — ESC IS back.
  - Critical detail: the subshell is wrapped with `|| rc=$?` so set -e
    doesn't propagate the subshell's `exit 130` and kill the parent.

Ctrl+C handling:
  - Register `trap die_cancelled INT` once at script load so any read /
    gum that swallowed SIGINT still surfaces as a clean exit.
  - die_cancelled now writes its message to stderr, so a Ctrl+C inside
    `$(confirm_plan_action_interactive)` doesn't get captured into a
    variable and disappear.
  - confirm_install_plan checks the subshell exit code:
      action="$(...)" || exit 130
    so when Ctrl+C lands inside the proceed/edit/cancel picker, the
    installer actually exits instead of looping forever on an empty
    selection.

Tests:
  - 6 unit tests for edit_plan_quote / edit_plan_dump_state round-tripping
    (spaces, single quotes, $expansion, command substitution).
  - 3 unit tests for the subshell-with-set-e pattern: exit 130, exit 1,
    and exit 0 paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer feedback: rerunning the installer always printed "Installing gum
v0.17.0, please wait..." because the bootstrap path downloaded gum into a
mktemp dir, used it, then EXIT-cleanup removed it. Every rerun paid the
download cost again.

Cache the verified gum binary under a stable, versioned path:
  GUM_CACHE_ROOT="${FLINK_AGENTS_GUM_CACHE_DIR:-${XDG_CACHE_HOME:-$HOME/.cache}/flink-agents/gum}"
  cache_dir="$GUM_CACHE_ROOT/$GUM_VERSION"
  cached_gum="$cache_dir/gum"

Lookup order in bootstrap_gum_temp:
  1. system gum on PATH                     -> GUM_STATUS=found
  2. cached binary at $cached_gum           -> GUM_STATUS=cached (new)
  3. download + verify + promote into cache -> GUM_STATUS=installed

A new status "cached" surfaces a friendly "gum loaded from cache" line in
print_gum_status. If promoting into the cache fails (read-only HOME, etc.)
the bootstrap falls back to using the freshly extracted binary for the
current run and warns once via GUM_REASON — the existing TMPFILES cleanup
still handles it, costing only a re-download next run.

Tests:
  - reset_install_sh_state defaults GUM_CACHE_ROOT to BATS_TEST_TMPDIR so
    we never write to a developer's real $HOME/.cache during local runs.
  - new "cached binary on disk is reused" test confirms no curl/wget call
    happens when the cache is pre-seeded.
  - new "successful install promotes binary into the cache" test confirms
    the extracted binary lands at the stable cache path and GUM points at it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@wenjin272
Copy link
Copy Markdown
Collaborator

Hi, @JinkunLiu @xintongsong, I have made some additional updates to this PR, which address the following issues. PTAL

  • flink-agents JAR missing when PyFlink skipped — download JAR directly from ASF mirror (SHA512-verified); Python no longer required for Java users.
  • Existing Flink version not detected — parse lib/flink-dist-*.jar, fall back to bin/flink --version with a progress hint.
  • "Install dir" / version shown even when not installing — plan split into read-only Environment + editable Plan; install-only rows hidden on INSTALL_FLINK=No.
  • Toggling Install Flink didn't re-promptNo → Yes now cascades into version + dir prompts.
  • JAVA_HOME looked editable — moved to read-only Environment section.
  • No Tab-completion on pathsread -e -r -p replaces gum input.
  • export FLINK_HOME=… line unclear — numbered Next steps block with rationale.
  • Double slashes / ~ / trailing / in pathsnormalize_path helper wired into every path entry.
  • Ctrl+C swallowed — global trap die_cancelled INT; stderr cancel msg; confirm honors subshell exit code.
  • ESC quit installer / no way back — edit menu wrapped in ( ) subshell, ESC backs out at any nesting depth.

wenjin272 and others added 6 commits May 21, 2026 14:03
…patibility

Reviewer reported:
    ./install.sh: line 1328: ${expected,,}: bad substitution

Root cause: the SHA512 case-insensitive compare used ${var,,}, which is
bash 4+ syntax. macOS still ships /bin/bash 3.2 (GPLv3 hold-out), and
when the shebang resolves to /bin/bash, the script reaches stage 4
("Installing Flink Agents") before the bad-substitution surfaces — so
the user has already waited through the Flink + Python downloads by
the time it dies.

Fix: case-fold via `tr '[:upper:]' '[:lower:]'` instead — POSIX, works
on bash 3.2. A grep for other bash 4+ constructs (${var^^}, mapfile /
readarray, declare -A, local -n, [[ -v ]], globstar, coproc) came up
clean, so the installer now runs end-to-end on macOS's /bin/bash 3.2
without anyone needing to `brew install bash` first. Verified manually
with `/bin/bash install.sh --help` and both --dry-run paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ection

Reviewer pointed at a locally compiled
  /Users/.../flink-dist/target/flink-2.2-SNAPSHOT-bin/flink-2.2-SNAPSHOT
and the installer reported
  Could not auto-detect Flink version from <path>

Two problems:

  1. detect_flink_version_from_home() only matched
     `flink-dist-X.Y.Z.jar`, so a source-build artifact called
     `flink-dist-2.2-SNAPSHOT.jar` fell through both paths.

  2. Every place that derived FLINK_MAJOR_MINOR did
     `${FLINK_VERSION%.*}`, which silently produced "2" for
     "2.2-SNAPSHOT" (it strips the last dot-segment, which here is the
     entire "2-SNAPSHOT").

Fix:
  - Use a single shared regex `[0-9]+\.[0-9]+(\.[0-9]+)?(-[A-Za-z0-9._]+)?`
    in detect_flink_version_from_home() for both the JAR-filename fast
    path and the `bin/flink --version` slow path. Now accepts 2.2-SNAPSHOT,
    2.1.0-SNAPSHOT, 2.0.0-rc1, etc.
  - Introduce flink_major_minor() that pulls the leading X.Y out of any
    such version string and replace every `${FLINK_VERSION%.*}` site
    with it. So FLINK_MAJOR_MINOR for "2.2-SNAPSHOT" is "2.2", which is
    what the ASF mirror keys agents-JAR releases on.

Tests cover X.Y-SNAPSHOT (filename path), X.Y.Z-SNAPSHOT, and X.Y.Z-rc1
(CLI path), plus a small table for flink_major_minor including a
"garbage in, empty out" case. Manually verified: a local snapshot
FLINK_HOME now reports `(v2.2-SNAPSHOT)` in the plan and would resolve
the correct mirror URL for the agents JAR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reproduced on macOS /bin/bash 3.2.57 with PWD=/tmp/test-wizard:
    normalize_path("./") → "/tmp/test-wizard\"

Root cause: the "/./" → "/" fold used
    p="${p//\/.\//\/}"
Bash 3.2 doesn't strip the backslash from an escaped slash in the
*replacement* half of a `${var//pat/repl}` expansion, so the literal
`\/` ends up in the result. Bash 4+ unescapes it, hiding the bug —
which is why we missed it locally until a reviewer (and then us)
actually ran the installer through macOS's default /bin/bash.

Fix: rewrite the four folding steps as a single sed invocation:
    sed -E 's|/+|/|g; s|/\./|/|g; s|/\.$||; s|/+$||'
The "~" expansion and relative→absolute anchor still happen in bash,
but the slash arithmetic is now handled by sed and avoids the
parameter-expansion quirk entirely.

Verified on both /bin/bash 3.2.57 and /opt/homebrew/bin/bash 5+: same
output for "./", "", "/", "~/foo", "bar", "/x//y/", "/x/./y/",
"/x/./y/.", "//x//y//.//".

Drive-by: bump curl/wget --max-time 600 → 900, since slow ASF mirror
hops were brushing up against the 10-minute ceiling on the Flink
tarball. download_file.bats expected-argv assertions updated to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer reported: typing "./" at the Choose Python venv directory
prompt, when the cwd held unrelated files, was silently accepted —
plan announced "Reusing existing virtual environment", and the script
later either died trying to `source $VENV_DIR/bin/activate` (after
already downloading Flink), or worse, scribbled venv files alongside
the user's project.

Add validate_venv_dir() which classifies a candidate path as one of
new / empty / venv / nonempty / file. Only the first three are safe.
The venv marker is strict: pyvenv.cfg must be present (a stray
`bin/activate` doesn't make a directory a venv).

Apply the classifier in two places:

  - Interactive (plan_pyflink + the venv_dir / enable_pyflink edits in
    edit_plan_interactive) re-prompts on nonempty/file with an
    explanatory warning. Hoisted the prompt+validate loop into a
    shared prompt_and_validate_venv_dir() helper.

  - Non-interactive (--non-interactive / NO_PROMPT / explicit
    VENV_DIR= env var) dies with a clear message BEFORE stage 3
    downloads Flink — the integration test guards that no curl/wget
    call happens on the failure path.

Reordered plan_pyflink to validate VENV_DIR before resolve_python:
bad user input should fail fast, not after the user has just gone
to install a Python interpreter.

Tests: 8 unit cases covering each classifier branch + 2 integration
cases (foreign dir aborts before download, real venv with pyvenv.cfg
is accepted). 133 → 143 BATS, all green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer ran the installer against a source-built Flink 2.1-SNAPSHOT
and pip blew up at stage 4:
    ERROR: No matching distribution found for apache-flink==2.1-SNAPSHOT

PyPI only ships finished release wheels for apache-flink — there's no
"2.1-SNAPSHOT" upload — so an `==` pin against any pre-release Flink
version is doomed.

Switch to PEP 440's compatible-release operator (~=) when the Flink
version carries a pre-release suffix:
    FLINK_VERSION=2.1-SNAPSHOT → apache-flink~=2.1.0
    FLINK_VERSION=2.0.0-rc1    → apache-flink~=2.0.0
    FLINK_VERSION=2.2.0        → apache-flink==2.2.0  (unchanged)
    FLINK_VERSION=1.20.3       → apache-flink==1.20.3 (unchanged)

A new is_snapshot_version() helper decides the branch (any "-suffix"
is treated as a pre-release; pure X.Y.Z stays on the exact pin). When
we take the fallback we warn the user that PyFlink will come from the
nearest released minor on PyPI — their Java JARs are still the source
build, but the Python bridge may not match exactly. That tradeoff is
better than a silent install failure after Flink has already been
downloaded.

Tests: 4 new unit cases for is_snapshot_version (SNAPSHOT, rc/beta/dev,
plain releases, empty); 143 → 147 BATS, all green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer reproductions (Flink download 404, pip install for SNAPSHOT,
tar exit non-zero) all ended with the script silently dying after a
wall of pip/curl/tar output — no marker for "the installer failed",
no hint of which stage we were in, no line number to grep for.

Install an ERR trap that fires whenever set -e tears the script down
on an unhandled non-zero command. The banner prints:

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ✗ Installation failed at stage 3/5 (Installing Apache Flink).

    Command:   curl -fL --progress-bar ... -o "$output" "$url"
    Source:    install.sh:75
    Exit code: 56

    Re-run with --verbose for full output, or report at:
      https://github.com/apache/flink-agents/issues
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Implementation notes:
  - Enable `set -E` (errtrace) so the trap survives into functions and
    command substitutions — the actual fail sites are all nested.
  - ui_stage() now remembers INSTALL_STAGE_TITLE so the banner can
    name the stage in addition to its number.
  - The banner writes to stderr and self-disables (trap - ERR) before
    its own echo lines so a failure inside the handler doesn't recurse.
  - die() / die_cancelled() use plain `exit`, which doesn't trip ERR,
    so their existing single-line messages stay unchanged — banner is
    reserved for the genuinely unexpected failures.
  - Script exits with the failing command's original rc (curl 56,
    pip 1, tar 2, etc.) so report-an-issue threads carry a real signal.

Tests: 2 integration cases — set -e tear-down lights up the banner
with the right stage / line / exit code, and a die() path stays a
single-line warning. 147 → 149 BATS, all green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@xintongsong xintongsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @JinkunLiu and @wenjin272 . I tried the wizard locally, and it works well. Much better experience. LGTM.

@wenjin272 wenjin272 merged commit 962d587 into apache:main May 22, 2026
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-needed Your PR changes impact docs. fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants