Thanks for helping expand the Data Engineering Practice Problems collection. This project thrives on a steady flow of realistic scenarios, clean reference implementations, and thoughtful documentation. Please use the guidance below to keep the repository approachable and consistent.
- New practice problems – add interview-style prompts, sample datasets, and reference solutions.
- Enhance existing solutions – improve performance, readability, resilience, or validation coverage.
- Improve documentation – clarify instructions, add diagrams or walkthroughs, fix typos.
- Quality assurance – add tests, linting, or automation that strengthens the developer experience.
- Discuss (optional but encouraged): Open a draft issue describing the idea, especially for larger additions (new problem families, major refactors).
- Fork & branch: Use feature branches named
feature/<summary>orfix/<summary>. - Develop locally: Target Python 3.10+, prefer type hints, and keep third-party dependencies minimal. Include lightweight tests (
pytest, doctest, or script-based checks) when feasible. - Keep commits focused: Each commit should represent a cohesive change with a clear message.
- Open a pull request: Summarise what changed, why it matters, and how you validated it. Link related issues if applicable.
- Each problem lives in its own folder at the repository root, e.g.
Problem X: <Descriptive Title>/. - Inside each problem directory:
question.md– prompt or requirements written in Markdown; include context, inputs, outputs, and any bonus objectives.solution.py– reference implementation using standard library where possible; include a short module docstring explaining the approach.- Optional subdirectories (
data/,tests/) if the problem demands bespoke assets. Keep paths relative (e.g.,../data/for shared sample data).
- Shared datasets belong in the top-level
data/directory. Provide aREADMEsnippet or inline comments describing their provenance if they are newly added.
- Aim for realistic data engineering themes: streaming ingestion, batch transformation, data quality, orchestration, storage optimisation, monitoring, or reliability.
- Calibrate difficulty with clear expectations. Outline assumptions, edge cases, and success criteria in
question.md. - When new problem types emerge (e.g., Airflow DAG design, dbt modelling), note the category at the top of the prompt for quick scanning.
- Prefer self-contained assets. If external services or APIs are referenced, provide mocks or clear instructions for local simulation.
- Use descriptive naming, type hints, and concise comments where clarity is needed.
- For scripts, include a guarded
main()entry point and friendly CLI output. - Handle large-file scenarios with streaming reads/writes; avoid pulling entire datasets into memory unless the problem explicitly targets small inputs.
- Validate inputs defensively and surface actionable errors.
- If a contribution introduces dependencies, update any relevant setup instructions and pin versions in
requirements.txt(create one if needed for the specific problem).
Before opening a pull request, confirm:
-
question.mdreads cleanly with correct Markdown formatting. -
solution.pyruns locally using the provided instructions. - README references (if touched) remain accurate.
Thank you for sharing your expertise and helping fellow engineers level up. Star the repository to stay notified about new challenges and updates!