Reinforcement Learning for Multi-Product Inventory Management
This project implements and evaluates Deep Reinforcement Learning agents (DQN and PPO) for managing replenishment policies in a two-product warehouse system, comparing their performance against classical (s,S) baseline policies.
The system manages inventory for two products with different demand patterns and lead times. At the beginning of each day, the agent must decide:
- Whether to place a replenishment order for each product
- How many units to order (if any)
Objective: Minimize total operational costs (ordering, holding, and shortage costs).
- Products: 2 independent items with distinct suppliers
- Demand: Exponential inter-arrival times (λ=0.1) with discrete quantity distributions
- Lead Times: Stochastic and unobservable (POMDP setting)
- Product 1: U(0.5, 1.0) months
- Product 2: U(0.2, 0.7) months
- Cost Structure:
- Setup cost (K): $10 per order
- Incremental cost (i): $3 per unit
- Holding cost (h): $1 per unit-day
- Shortage cost (π): $7 per backlogged unit-day
For complete problem formulation, see docs/assigment.md and docs/mdp.md.
Built a custom inventory simulation using SimPy that models:
- Customer demand arrival processes
- Supplier lead time delays
- Inventory dynamics (on-hand, backorders, in-transit)
- Daily cost accumulation
Addressed the POMDP challenge using frame stacking to approximate Markov property.
State: [Inventory_Level, Outstanding_Orders] for each product, stacked over k+1 time steps.
Action: Discrete order quantities [q₁, q₂] for each product.
Reward: Negative total cost (ordering + holding + shortage)
See docs/mdp.md for mathematical details.
- DQN (Deep Q-Network): Value-based method with experience replay
- PPO (Proximal Policy Optimization): Policy gradient method with clipped objective
Both implemented using Stable-Baselines3 with custom Gymnasium environment wrappers.
Classical (s,S) policy: Order up to S when inventory falls below s
- Tuned empirically through grid search on steady-state costs
Performance evaluated using Welch's procedure with 1000 independent replications to ensure steady-state convergence.
- ✅ Both RL agents successfully learned non-trivial inventory policies
- ✅ Policies account for lead time uncertainty through observation history
- ✅ Warmup period detection applied to exclude transient behavior
- 📈 Performance varies based on hyperparameters (Q_max, learning rate, network architecture)
Note: Run notebooks/welch_procedure.ipynb to generate detailed performance comparison and statistical analysis.
See notebooks/ for complete experimental results and visualizations.
- Python 3.12+
- uv (recommended for fast dependency management)
Clone and set up the environment:
git clone https://github.com/MarinCervinschi/rl-inventorysystem.git
cd rl-inventorysystem
uv syncThat's it! uv sync creates a virtual environment and installs everything you need.
- Simulation: SimPy (Discrete Event Simulation)
- RL Framework: Stable-Baselines3 + Gymnasium
- Algorithms: DQN, PPO
- Analysis: NumPy, Pandas, Matplotlib, Seaborn
- Assignment Specification - Original problem statement
- MDP Formulation - Complete mathematical formulation
- Implementation Tips - Development guidelines
Explore the experimental workflow:
- MDP Exploration - Understanding the state/action space
- Simulation Basics - Testing the SimPy engine
- Baseline Tuning - Optimizing (s,S) parameters
- DQN Training - Hyperparameter tuning & results
- PPO Training - Policy gradient experiments
- Welch Analysis - Steady-state cost comparison
Supply Chain Management - Master's Degree Program
University Project - January 2026
Academic project for educational purposes.