Feature/reorganize project structure #264

tino097 · 2025-11-17T14:53:57Z

With this PR we are reorganizing the project structure into different directories for better separation of concerns

Each pipeline stage has a single, well-defined responsibility:

DownloadStage: Resource fetching, hash checking, ZIP extraction
FormatConverterStage: Spreadsheet/spatial/CSV conversion & normalization
ValidationStage: RFC4180 validation, sortcheck, deduplication
AnalysisStage: Type inference, statistics, frequency tables, PII screening
DatabaseStage: PostgreSQL COPY operations
IndexingStage: Automatic index creation based on cardinality
FormulaStage: DRUF Jinja2 formula processing (optional, requires scheming)
MetadataStage: Auto-aliasing, summary stats, resource metadata updates

Copilot

Pull Request Overview

This PR reorganizes the DataPusher Plus codebase from a monolithic jobs.py file (1624 lines) into a modular pipeline architecture with distinct stages. The refactoring separates concerns into individual stage classes (Download, FormatConverter, Validation, Analysis, Database, Indexing, Formula, Metadata), each with a single well-defined responsibility. The original monolithic implementation is preserved in jobs_legacy.py for reference, while jobs.py now serves as a backward compatibility wrapper.

Key Changes:

Introduced a pipeline architecture with 8 distinct processing stages
Created a ProcessingContext dataclass to manage shared state between stages
Split monolithic code into modular, testable components
Maintained backward compatibility through wrapper functions

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
`ckanext/datapusher_plus/jobs_legacy.py`	Preserved copy of original monolithic implementation for reference
`ckanext/datapusher_plus/jobs.py`	Backward compatibility wrapper that re-exports refactored functions
`ckanext/datapusher_plus/jobs/__init__.py`	Module initialization with exports
`ckanext/datapusher_plus/jobs/pipeline.py`	Main pipeline orchestration and execution logic
`ckanext/datapusher_plus/jobs/context.py`	ProcessingContext dataclass for shared state management
`ckanext/datapusher_plus/jobs/stages/base.py`	Abstract base class for all pipeline stages
`ckanext/datapusher_plus/jobs/stages/download.py`	Download stage implementation
`ckanext/datapusher_plus/jobs/stages/format_converter.py`	Format conversion stage implementation
`ckanext/datapusher_plus/jobs/stages/validation.py`	CSV validation and deduplication stage
`ckanext/datapusher_plus/jobs/stages/analysis.py`	Type inference and statistics stage
`ckanext/datapusher_plus/jobs/stages/database.py`	PostgreSQL COPY operations stage
`ckanext/datapusher_plus/jobs/stages/indexing.py`	Automatic index creation stage
`ckanext/datapusher_plus/jobs/stages/formula.py`	DRUF Jinja2 formula processing stage
`ckanext/datapusher_plus/jobs/stages/metadata.py`	Resource metadata updates and aliasing stage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-17T14:58:29Z

ckanext/datapusher_plus/jobs_legacy.py

+            )
+            sortcheck_json = json.loads(str(stdout_content))
+        except (json.JSONDecodeError, AttributeError) as e:
+            raise utils.JobError(f"Failed to parse sortcheck JSONoutput: {e}")


Corrected spelling of 'JSONoutput' to 'JSON output' (should have space).

Suggested change

raise utils.JobError(f"Failed to parse sortcheck JSONoutput: {e}")

raise utils.JobError(f"Failed to parse sortcheck JSON output: {e}")

Copilot · 2025-11-17T14:58:29Z

ckanext/datapusher_plus/jobs/stages/analysis.py

+                f'"{len(unsafe_headers)} unsafe" header names found '
+                f"({unsafe_headers}). Sanitizing...\""


Mismatched quotation marks - the log message starts with a single quote but ends with escaped double quote followed by single quote. Should be consistent: f'"{len(unsafe_headers)} unsafe" header names found ({unsafe_headers}). Sanitizing...'

Suggested change

f'"{len(unsafe_headers)} unsafe" header names found '

f"({unsafe_headers}). Sanitizing...\""

f"{len(unsafe_headers)} unsafe header names found "

f"({unsafe_headers}). Sanitizing..."

Copilot · 2025-11-17T14:58:30Z

ckanext/datapusher_plus/jobs/pipeline.py

+                context = stage(context)
+
+                # If stage returns None, stop pipeline
+                if context is None:
+                    context.logger.info(f"Pipeline stopped after stage: {stage.name}")


Attempting to access context.logger after context is None will cause an AttributeError. When a stage returns None, the context is None and cannot be accessed. The logger should be accessed before the None check or stored separately.

Suggested change

context = stage(context)

# If stage returns None, stop pipeline

if context is None:

context.logger.info(f"Pipeline stopped after stage: {stage.name}")

logger = context.logger

context = stage(context)

# If stage returns None, stop pipeline

if context is None:

logger.info(f"Pipeline stopped after stage: {stage.name}")

Copilot · 2025-11-17T14:58:30Z

ckanext/datapusher_plus/jobs/stages/formula.py

+        )
+
+        # Process resource suggestion formulae
+        package = self._process_resource_suggestions(


Variable package is not used.

Suggested change

package = self._process_resource_suggestions(

self._process_resource_suggestions(

Copilot · 2025-11-17T14:58:30Z

ckanext/datapusher_plus/jobs_legacy.py

+        else:
+            qsv_input_utf_8_encoded_csv = tmp
+        try:
+            qsv.input(tmp, trim_headers=True, output_file=qsv_input_csv)


Variable qsv_input_utf_8_encoded_csv is not used.

Suggested change

qsv.input(tmp, trim_headers=True, output_file=qsv_input_csv)

qsv.input(qsv_input_utf_8_encoded_csv, trim_headers=True, output_file=qsv_input_csv)

Copilot · 2025-11-17T14:58:33Z

ckanext/datapusher_plus/jobs/stages/download.py

+                                f"Resource too large to download: {DataSize(int(cl)):.2MB} "
+                                f"> max ({DataSize(int(max_content_length)):.2MB})."
+                            )
+                    except ValueError:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except ValueError:

except ValueError:

# If content-length header is missing or malformed, skip size check and proceed.

Copilot · 2025-11-17T14:58:33Z

ckanext/datapusher_plus/jobs_legacy.py

+                    raise utils.JobError(
+                        f"Resource too large to download: {DataSize(int(cl)):.2MB} > max ({DataSize(int(max_content_length)):.2MB})."
+                    )
+            except ValueError:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except ValueError:

except ValueError:

# If content-length header is missing or malformed, ignore and proceed as file size is unknown.

ckanext/datapusher_plus/jobs_legacy.py

Copilot · 2025-11-17T14:58:34Z

ckanext/datapusher_plus/jobs_legacy.py

+        # to truncate overly long strings from causing issues with
+        # Python's CSV reader and Postgres's limits with the COPY command
+        if spatial_format_flag:
+            env = os.environ.copy()


This statement is unreachable.

Copilot · 2025-11-17T14:58:34Z

ckanext/datapusher_plus/jobs_legacy.py

+            f = open(qsv_input_utf_8_encoded_csv, "wb")
+            f.write(cmd.stdout)
+            f.close()


File may not be closed if this operation raises an exception.

Suggested change

f = open(qsv_input_utf_8_encoded_csv, "wb")

f.write(cmd.stdout)

f.close()

with open(qsv_input_utf_8_encoded_csv, "wb") as f:

f.write(cmd.stdout)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

tino097 added 3 commits November 5, 2025 20:57

Organize the structure

3c02ced

Move jobs in separate stages

72c119f

Refactoring of the DP+

984367e

tino097 requested review from Copilot and jqnatividad November 17, 2025 14:54

Copilot started reviewing on behalf of tino097 November 17, 2025 14:54 View session

Copilot finished reviewing on behalf of tino097 November 17, 2025 14:55

Copilot AI reviewed Nov 17, 2025

View reviewed changes

jqnatividad mentioned this pull request Nov 17, 2025

DP+ 3.0 tracking issue #259

Open

10 tasks

tino097 and others added 2 commits November 27, 2025 14:46

Update ckanext/datapusher_plus/jobs_legacy.py

0036f64

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'main' into feature/reorganize-project-structure

2cc46ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/reorganize project structure #264

Feature/reorganize project structure #264

Uh oh!

tino097 commented Nov 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	raise utils.JobError(f"Failed to parse sortcheck JSONoutput: {e}")
	raise utils.JobError(f"Failed to parse sortcheck JSON output: {e}")

		f'"{len(unsafe_headers)} unsafe" header names found '
		f"({unsafe_headers}). Sanitizing...\""

	package = self._process_resource_suggestions(
	self._process_resource_suggestions(

	qsv.input(tmp, trim_headers=True, output_file=qsv_input_csv)
	qsv.input(qsv_input_utf_8_encoded_csv, trim_headers=True, output_file=qsv_input_csv)

	except ValueError:
	except ValueError:
	# If content-length header is missing or malformed, skip size check and proceed.

Feature/reorganize project structure #264

Are you sure you want to change the base?

Feature/reorganize project structure #264

Uh oh!

Conversation

tino097 commented Nov 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants