implemented NISAR adapation and provide template for future project adaptation#1
implemented NISAR adapation and provide template for future project adaptation#1pymonger wants to merge 23 commits into
Conversation
- Move json, logging, requests, sys, getpass, datetime, and urllib imports to top - Remove duplicate imports from main block - Enables functions to be imported and used by other scripts
- Move NISAR-specific scripts to dedicated nisar/ directory - Create comprehensive project organization with hierarchical breakdown - Add job execution time extractor with wall_time analysis - Create PROJECT_TEMPLATE.md for future project adaptations - Add QUICK_START.md for easy reference - Clean up obsolete debugging and testing scripts - Update all documentation to reflect new structure
- Add sys.path.append to correctly import from parent metrics_extractor directory - Fixes ModuleNotFoundError when running scripts from nisar/ directory - Both hysds_metrics_es_extractor_enhanced.py and job_execution_time_extractor.py updated
- CSV files are created in CWD when scripts are run - Empty generated_csv_files directory was not needed - Simplifies directory structure
…ctory - CSV files are created in CWD when scripts run - Remove unnecessary directory references from README.md - Clean up directory structure documentation
- Ignore macOS .DS_Store files - Ignore Python __pycache__ directories and compiled files - Ignore CSV files generated by scripts - Ignore common IDE, backup, and temporary files - Ignore Python virtual environments and build artifacts - Comprehensive coverage of commonly ignored files/directories
- Remove trailing underscore from regex pattern to handle job IDs that continue after beam name - Pattern now matches: _full_individual_L_20_QP_05_QP_0_state-config-... - Previously only matched: _full_individual_L_20_QP_05_QP_ - Fixes issue where INSAR jobs were not being processed due to strict regex - Tested with example INSAR job ID: SCIFLO_INSAR__pcm_r4.0.7_pge_r4.1.0-network_pair_006_040_012_full_individual_L_20_QP_05_QP_0_state-config-20251010T201535.245443Z
| ```bash | ||
| cp /Users/gmanipon/dev/metrics_extractor/nisar/hysds_metrics_es_extractor_enhanced.py /Users/gmanipon/dev/metrics_extractor/your_project/ | ||
| cp /Users/gmanipon/dev/metrics_extractor/nisar/job_execution_time_extractor.py /Users/gmanipon/dev/metrics_extractor/your_project/ | ||
| ``` |
There was a problem hiding this comment.
Bug: Template Contains Hardcoded Developer Paths
The PROJECT_TEMPLATE.md includes hardcoded personal development paths, like /Users/gmanipon/dev/metrics_extractor/. These paths make the template non-portable and expose specific developer environment details, which could hinder its general usability for other developers.
| patterns = { | ||
| "beam_name": r"_(?P<coverage>full|partial)_(?P<acquisition_mode>individual|mixed)_(?P<beam_name>L_\d{2}_\w{2}_\d{2}_\w{2})_", # Primary: beam_name | ||
| "coverage": r"_(?P<coverage>full|partial)_(?P<acquisition_mode>individual|mixed)_(?P<beam_name>L_\d{2}_\w{2}_\d{2}_\w{2})_", # Secondary: coverage | ||
| "acquisition_mode": r"_(?P<coverage>full|partial)_(?P<acquisition_mode>individual|mixed)_(?P<beam_name>L_\d{2}_\w{2}_\d{2}_\w{2})_", # Tertiary: acquisition_mode |
There was a problem hiding this comment.
Bug: Regex excludes S-band jobs, only matches L-band
The beam_name regex pattern uses L_\d{2}_\w{2}_\d{2}_\w{2} which only matches L-band modes (e.g., L_20_DH_05_DH), but the NISAR mission includes both L-band and S-band modes. The documentation in README_ENHANCED.md correctly states the regex should be [LS]_\d{2}_\w{2}_\d{2}_\w{2} to match both bands, and the load_nisar_config function also correctly uses this pattern. However, the parse_job_id_patterns function only matches L-band, causing all S-band job metrics (e.g., S_37_QP_00_NA) to be silently skipped. This affects both extractor scripts.
Additional Locations (1)
| .Spotlight-V100 | ||
| .Trashes | ||
| ehthumbs.db | ||
| Thumbs.db |
There was a problem hiding this comment.
Duplicate entries in gitignore file
Low Severity
The .gitignore file contains duplicate entries. Lines 2-8 (.DS_Store, .DS_Store?, ._*, .Spotlight-V100, .Trashes, ehthumbs.db, Thumbs.db) are repeated verbatim at lines 126-132. Line 116 (*~) is also duplicated at line 123. These duplicates clutter the file without adding value and could cause confusion during maintenance.
| __version__ = "1.2.3" | ||
| __url__ = "https://github.com/hysds/metrics_extractor" | ||
| __description__ = "HySDS Metrics Extractor Tool" No newline at end of file | ||
| __description__ = "HySDS Metrics Extractor Tool" |
There was a problem hiding this comment.
No-op modification in module init file
Low Severity
The __init__.py file shows a deletion and re-addition of the same __description__ line with identical content. The diff indicates line 8 was deleted and line 9 added, but both contain __description__ = "HySDS Metrics Extractor Tool". This is a whitespace-only or no-op change that adds no value and may have been committed accidentally.
| continue | ||
|
|
||
| pge_time = min(wall_times[0], wall_times[1]) / (1e9 * 60) # nanoseconds to minutes | ||
| pcm_time = max(wall_times[0], wall_times[1]) / (1e9 * 60) |
There was a problem hiding this comment.
Unguarded array access in pge_execution_time_extractor
Medium Severity
The code checks len(wall_times) < 2 and continues if true, but then directly accesses wall_times[0] and wall_times[1] without considering that wall_times might have more than 2 elements. If there are 3 or more wall_time values, only the first two are used without documenting this behavior or warning about extra values, which could hide data quality issues.
| continue | ||
|
|
||
| pge_time = min(wall_times[0], wall_times[1]) / (1e9 * 60) # nanoseconds to minutes | ||
| pcm_time = max(wall_times[0], wall_times[1]) / (1e9 * 60) |
There was a problem hiding this comment.
Missing None value validation for wall_time
High Severity
The list comprehension filters for the presence of the wall_time key but doesn't verify the value isn't None. If wall_time exists but has a None value, it gets added to the wall_times list, then the arithmetic operations at lines 235-236 will crash with TypeError: unsupported operand type(s) for /: 'NoneType' and 'float' when attempting to divide None by 1e9 * 60.
| match_id = row[1] | ||
| instance_type = row[3] | ||
| pge_time = float(row[4]) | ||
| pcm_time = float(row[5]) |
There was a problem hiding this comment.
Missing bounds check for CSV row parsing
Medium Severity
The CSV parsing code accesses row indices 0 through 5 without checking if the row has at least 6 elements. If the CSV format is slightly malformed or has missing columns, this will raise an IndexError and crash the script. Only line 70 has a length check for a different row type, but the job details parsing lacks this protection.
| subprocess.check_call(['pip', 'install', 'openpyxl', '-q']) | ||
| import openpyxl | ||
| from openpyxl.styles import Font, PatternFill, Border, Side | ||
| from openpyxl.chart import BarChart, Reference |
There was a problem hiding this comment.
Automatic package installation without user consent
Medium Severity
The script automatically installs the openpyxl package using subprocess.check_call(['pip', 'install', 'openpyxl', '-q']) when the import fails, without asking for user permission. This modifies the user's Python environment without consent, which is a security and usability concern, especially in production or restricted environments where pip install might fail or be disallowed.
Six queries across 3 NISAR scripts used size:10000 with no truncation detection, risking silently incomplete results for wide time ranges. Replace with search_after + Point In Time pagination via a new shared es_pagination module. Also clean up redundant Kibana boilerplate (match_all, empty should/must_not) from queries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| if breakdown_value not in breakdown_groups: | ||
| breakdown_groups[breakdown_value] = [] | ||
| breakdown_groups[breakdown_value].append(job_id) | ||
|
|
There was a problem hiding this comment.
Breakdown extraction uses wrong capture group
Medium Severity
get_job_id_breakdown_aggregation() sets breakdown_value = match.group(1) even though the configured patterns rely on named groups (coverage, acquisition_mode, beam_name). With the current regex, group(1) corresponds to coverage, so breakdowns for other fields can be mis-grouped even when the match succeeds.
L0B execution time reports now include per-job RCID (radar config ID), diagnostic mode flag (0/1/2/cal), beam mode, and PCM/PGE version. Stats are grouped by (version, instance_type, RCID, diagnostic_mode_flag) for L0B; other PGE types are unchanged (new CSV columns present but empty). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the --modes_config argument, L0B-specific extraction examples, and the new CSV columns (rcid, diagnostic_mode_flag, beam_mode, version). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
es_pagination.py: Auto-detect backend by trying the Elasticsearch PIT
endpoint first (POST /{index}/_pit), then falling back to the OpenSearch
endpoint (POST /_search/point_in_time) with the corresponding close API.
pge_execution_time_extractor.py: Replace deprecated datetime.utcnow()
with datetime.now(timezone.utc).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If both ES and OpenSearch PIT endpoints fail (e.g. older OpenSearch clusters), fall back to the universally supported scroll API for pagination beyond 10K hits. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include the raw product ID in the detail section so we can see the actual filenames for jobs where RCID extraction fails (dtid jobs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CRSD products have a different filename format without the RCID field. Only process RRSD products for L0B metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add RCID, diagnostic mode flag, and version breakdown to L0B metrics
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| match_id = row[1] | ||
| instance_type = row[3] | ||
| pge_time = float(row[4]) | ||
| pcm_time = float(row[5]) |
There was a problem hiding this comment.
CSV reader uses wrong column indices for time values
High Severity
The read_csv function uses hardcoded indices row[4] and row[5] for pge_time and pcm_time, but according to the documented CSV format in nisar/README.md, the Individual Job Details header is job_id,match_id,data_day,instance_type,version,rcid,diagnostic_mode_flag,beam_mode,pge_execution_time_min,pcm_container_time_min. In that format, row[4] is the version string and row[5] is the rcid — the actual time columns are at indices 8 and 9. Calling float() on a version string like "pcm_r05.00.1_pge_r05.00.5.1" would raise a ValueError, crashing the tool. The header variable is captured but never used for column lookups.
Additional Locations (1)
| ### 2. Copy Base Scripts | ||
| ```bash | ||
| cp /Users/gmanipon/dev/metrics_extractor/nisar/hysds_metrics_es_extractor_enhanced.py /Users/gmanipon/dev/metrics_extractor/your_project/ | ||
| cp /Users/gmanipon/dev/metrics_extractor/nisar/job_execution_time_extractor.py /Users/gmanipon/dev/metrics_extractor/your_project/ |
There was a problem hiding this comment.
Hardcoded local paths in project template documentation
Medium Severity
The project template contains hardcoded local filesystem paths (/Users/gmanipon/dev/metrics_extractor/) in several code examples and reference sections. These developer-specific paths won't work for anyone else and leak the author's local directory structure. The template is intended to be generic guidance for future projects but has absolute paths baked in.
Additional Locations (1)
| logging.error(f"missing argument breakdown_job") | ||
| raise Exception(f"missing argument breakdown_job") | ||
|
|
||
| nisar_config = args["nisar_config"] |
There was a problem hiding this comment.
CLI argument --nisar_config is accepted but never used
Medium Severity
The --nisar_config CLI argument is accepted and passed to get_job_breakdown_metrics as nisar_config_path, but that function completely ignores it (the docstring even says "not used in this version"). The load_nisar_config function is defined but never called anywhere. This means the NISAR config file is never loaded, and users following the documented usage in QUICK_START.md and README_ENHANCED.md (which instruct passing --nisar_config) would have no idea their config has no effect on the analysis.
Additional Locations (1)
| for pattern_name, extractions in sample_extractions.items(): | ||
| logging.info(f" {pattern_name}: {extractions}") | ||
|
|
||
| return patterns, sample_extractions |
There was a problem hiding this comment.
Duplicated functions across two NISAR scripts
Low Severity
get_sample_job_ids and parse_job_id_patterns are fully duplicated between job_execution_time_extractor.py and hysds_metrics_es_extractor_enhanced.py. The implementations are character-for-character identical (except the enhanced version has an unused nisar_modes parameter on parse_job_id_patterns). Both files already import from shared modules via sys.path manipulation, so these functions could live in a shared location. A bug fix in one copy would need to be manually replicated to the other.
Additional Locations (1)
Extends the three-level breakdown to all PGE types and adds an
orchestrator that produces a complete set of deliverables (per-PGE CSVs,
unified aggregated CSV, multi-sheet Excel workbook) for any CRID.
Changes:
- hysds_metrics_es_extractor_enhanced.py: accept GSLC, GCOV, and L3_SM
in addition to RSLC and INSAR. Their job_ids carry the same
_<coverage>_<acquisition_mode>_<beam_name>_ pattern, so the existing
regex applies without modification.
- l0b_three_level_breakdown.py (new): L0B-specific three-level
breakdown (rcid -> beam_mode -> diagnostic_mode). Needed because L0B
runs pre-focus and its job_id lacks beam info; breakdown is derived
from the L0B product filename with rcid -> beam_mode mapped via the
NISAR_MIXED_MODES_CONFIG.
- combine_breakdowns.py (new): merges per-PGE breakdown CSVs into a
single aggregated CSV with a unified schema. The two breakdown
conventions (standard vs L0B) are projected onto generic level_1/
level_2/level_3 columns while preserving the native columns.
- build_crid_workbook.py (new): produces a multi-sheet Excel workbook
with a Summary sheet (per-PGE totals + bar chart), "All PGEs
(unified)" sheet, and one sheet per PGE type with its native schema.
- run_crid_report.sh (new): orchestrator that takes --crid, --cluster,
--es_url, --netrc_os, --version, and --days_back, then chains the
extractor, L0B script, combiner, and workbook builder. Reads
OpenSearch credentials from the netrc-os file.
- README.md: add Quickstart, new sections for L0B breakdown, combiner,
workbook builder, and CRID report orchestrator. Update directory
listing and section 3 to reflect broader PGE support.
Usage example:
./nisar/run_crid_report.sh --crid X05013 --cluster POP1 --es_url "https://localhost:9202/logstash-*/_search" --netrc_os /path/to/netrc-os-<cluster> --version r05.01.3 --days_back 200 --output_dir ./output
…adata
In r05.01.4 the SCIFLO_{RSLC,GSLC,GCOV,INSAR,L3_SM} job_id changed from
embedding the long-form product name (e.g. `..._full_individual_L_22_DH_20_DH_...`)
to encoding the state-config id
(`SCIFLO_RSLC__release-r05.01.4-track_frame_<cycle>_<track>_<frame>_<v>_state-config-<ts>-<ts>`).
The existing `_(?P<coverage>full|partial)_(?P<acquisition_mode>individual|mixed)_(?P<beam_name>L_\d{2}_\w{2}_\d{2}_\w{2})_`
regex no longer matches, so pattern analysis returned `[]` for every
dimension and the three-level breakdown produced no rows.
Add a `--grq_url` arg. When set and the job_type is a GRQ-product PGE,
take `job.job_info.metrics.products_staged[*].id` from each job_info doc,
bulk-look-up `metadata.BeamName`, `metadata.FrameCoverage`, and
`metadata.isJointObservation` in GRQ via an `ids` query, and group on
those. `isJointObservation` true/false maps to mixed/individual.
The legacy regex path is unchanged when `--grq_url` is omitted, so
older releases and non-product job types (L0B etc.) keep working.
Verified on POP1 X05014 over 30 days: 657/657 staged RSLC products
resolved, 22 (beam x coverage x acquisition_mode) groups written.
Forgot to thread the new flag through the orchestrator after adding it to hysds_metrics_es_extractor_enhanced.py in 7d869a7. Without it the RSLC/GSLC/GCOV/INSAR/L3_SM PGEs still fall back to the legacy regex path and produce zero breakdown rows on r05.01.4+. When --grq_url is omitted, the extractor is invoked exactly as before, so pre-r05.01.4 runs are unchanged.
On modern NISAR clusters ES_CLUSTER_MODE=true and es-mozart/es-grq/ es-metrics all front the same 3-node OS cluster, so the metrics-ES endpoint already serves grq_* indices. Requiring users to set up a second tunnel and pass --grq_url for the common case was friction with no benefit. When --grq_url is omitted and the job_type is a GRQ-product PGE, derive the GRQ base URL from the --es_url scheme+netloc and log the choice. For non-clustered deployments where GRQ is a separate ES instance, --grq_url remains available as an explicit override. Verified on POP1: run with --es_url=https://es-metrics:9200/logstash-*/_search and no --grq_url resolves 702/702 RSLC products through the auto-derived https://es-metrics:9200 base.


🚀 Pull Request: NISAR Project Adaptation and Enhanced Metrics Extraction
📋 Overview
This PR introduces a comprehensive NISAR-specific adaptation of the HySDS metrics extractor, providing enhanced hierarchical job breakdown capabilities and specialized execution time analysis. The changes organize the codebase for scalability while maintaining backward compatibility with the original functionality.
🎯 Key Features
1. NISAR-Specific Enhancements
L_40_DH_05_DH,L_20_QP_05_QP)full,partial)individual,mixed)2. Project Organization
3. Code Quality Improvements
📁 File Changes
.gitignorePROJECT_TEMPLATE.mdQUICK_START.mdnisar/nisar/hysds_metrics_es_extractor_enhanced.pynisar/job_execution_time_extractor.pynisar/NISAR_MIXED_MODES_CONFIG_20200101T000000_01.jsonnisar/README.mdnisar/README_ENHANCED.mdmetrics_extractor/hysds_metrics_es_extractor.py🔧 Technical Details
Enhanced Metrics Extractor
beam_name→coverage→acquisition_mode_(?P<coverage>full|partial)_(?P<acquisition_mode>individual|mixed)_(?P<beam_name>L_\d{2}_\w{2}_\d{2}_\w{2})_Execution Time Extractor
wall_timefromjob.job_info.metrics.usage_statsexecution_time_minutes: Lesser of two wall_time valuespcm_container_runtime_m: Larger of two wall_time valuesSecurity & Best Practices
🚀 Usage Examples
NISAR Enhanced Metrics
NISAR Execution Time Analysis
📊 Output Files
job_three_level_breakdown_job_SCIFLO_RSLC_*.csvjob_execution_times_job_SCIFLO_RSLC_*.csv🔮 Future Adaptations
The project template (
PROJECT_TEMPLATE.md) provides a clear pattern for adapting this structure to other projects:your_project/✅ Testing
nisar/directory📈 Impact
🎯 Ready for Review
This PR provides a solid foundation for NISAR-specific analysis while establishing patterns for future project adaptations. All security concerns have been addressed, documentation is comprehensive, and the codebase maintains backward compatibility.
Branch:
feature/nisar-adaptationTarget:
mainAuthor: pymonger pymonger@gmail.com
Note
Medium Risk
Moderate risk due to substantial new querying/aggregation code paths (including PIT/scroll pagination) and a script that auto-installs
openpyxlat runtime, which could affect reproducibility and operational safety; existing extractor behavior appears mostly unchanged aside from refactoring/formatting.Overview
Adds a new
nisar/toolkit that extends the metrics extractor with NISAR-focused analyses: a three-level hierarchical breakdown ofSCIFLO_RSLC/SCIFLO_INSARjobs by parsingjob_id(beam/mode → coverage → acquisition mode), plus execution-time extraction based onusage_stats.wall_timeand a PGE-by-data-day execution-time extractor.Introduces shared Elasticsearch pagination (
search_after+PIT with scroll fallback) to reliably fetch results beyond the 10k hit limit, and adds an Excel-based version comparison script that matches jobs viamatch_idand outputs a multi-sheet report.Improves repo usability with new
.gitignore, quick-start and project adaptation template docs, and minor refactoring/formatting inmetrics_extractor/hysds_metrics_es_extractor.pyto make imports reusable from the new scripts.Written by Cursor Bugbot for commit 3225654. This will update automatically on new commits. Configure here.