- Project Overview
- Dataset Explanation
- Task Breakdown
- Getting Started
- Detailed Explanations
- Results & Evaluation
This project implements a Long Short-Term Memory (LSTM) neural network to predict the Remaining Useful Life (RUL) of aircraft turbofan engines before they fail.
Real-world Application:
- Airlines operate thousands of engines across their fleet
- Unexpected engine failures β expensive maintenance + flight delays
- Predictive maintenance β schedule repairs proactively
- Your model predicts: "In how many operational cycles will this engine fail?"
| Scenario | Cost | Impact |
|---|---|---|
| Unplanned Failure | $500K-$2M per incident | Safety hazard, customer distrust |
| Predictive Schedule | $100K-$300K (planned) | Safety, revenue protection |
| Savings | 50-70% cost reduction | Better fleet utilization |
NASA provides 4 increasingly complex datasets (FD001, FD002, FD003, FD004):
| Aspect | FD001 | FD002 | FD003 | FD004 |
|---|---|---|---|---|
| Train Engines | 100 | 260 | 100 | 248 |
| Test Engines | 100 | 259 | 100 | 249 |
| Op. Conditions | 1 (Sea Level) | 6 (Variable) | 1 (Sea Level) | 6 (Variable) |
| Fault Modes | 1 (HPC) | 1 (HPC) | 2 (HPC + Fan) | 2 (HPC + Fan) |
| Difficulty | π’ Easy | π‘ Medium | π‘ Medium | π΄ Hard |
Each row represents one engine snapshot during one operational cycle:
Column | Description
--------|--------------------------------------------------
1 | Unit/Engine ID (1-100 for FD001, 1-260 for FD002)
2 | Time in cycles (how many cycles this engine has run)
3-5 | Operational Settings (altitude, throttle, temperature)
6-26 | Sensor Measurements (21 different sensor readings)
Engine 1, Cycle 1:
1 1 -0.0007 -0.0004 0.0000 100.00 518.67 641.82 ... [18 more sensor values]
Engine 1, Cycle 2:
1 2 -0.0007 -0.0004 0.0000 100.00 518.67 642.15 ...
... (cycles 3 to 192 for Engine 1)
Engine 1, Cycle 192 (FAILURE):
1 192 0.0411 0.0440 0.0000 100.00 518.67 2388.04 ...
- Each engine generates a sequence of sensor readings
- Length varies: Engine A fails after 100 cycles, Engine B after 300 cycles
- Challenge: Model must handle variable-length sequences
Each cycle has 21 sensor readings:
s_1, s_2, ..., s_21
This is NOT a univariate problem (single time series).
It's a multivariate problem (multiple interconnected time series).
Cycle: 1 50 100 150 192 (failure)
Sensor: 100.0 100.5 101.2 102.1 103.8
Pattern: Slow drift over time, accelerates near failure
- FD001: Only sea-level (simple)
- FD002: 6 different altitude + throttle combinations (complex)
Example:
Setting 1 = -0.0007 β High altitude operation
Setting 2 = -0.0004 β Different throttle position
Same engine at different conditions = different degradation rates
RUL = Number of cycles remaining before engine failure
Training Data (we know when each engine fails):
Engine 1: Fails at cycle 192
At cycle 1: RUL = 192 - 1 = 191 cycles remaining
At cycle 100: RUL = 192 - 100 = 92 cycles remaining
At cycle 192: RUL = 192 - 192 = 0 cycles remaining (FAILURE)General Formula:
RUL(t) = max_cycle - current_cycle
Why cap RUL at 130 cycles?
Problem: Some engines live very long (250+ cycles)
Without capping, model learns: "RUL = 250"
This becomes the "easy" answer for healthy engines
Solution: Cap RUL at 130
- Engine with 200 cycles β label as 130
- Engine with 100 cycles β label as 100
- Prevents extreme values, improves generalization
Effect: ~40% of training samples hit the cap in both domains
Engine Lifecycle:
|---Healthy Degrading--------Near Failure--|
RUL: 130 -> 120 -> 90 -> 30 -> 10 -> 0
Cycle: 1 -> 50 -> 100 -> 150 -> 180 -> 192
Deliverables:
- β Load CMAPSS data (FD001, FD002)
- β Exploratory Data Analysis with 5+ visualizations
- β Sensor trend analysis (degradation curves)
- β RUL label generation and visualization
- β Domain shift analysis (FD001 vs FD002)
- β Summary of key insights
Key Questions to Answer:
- How many cycles does a typical engine run?
- Which sensors show the clearest degradation patterns?
- What's the difference between FD001 and FD002?
- Why is RUL capping necessary?
What to do:
- Design LSTM architecture with:
- Embedding or input layer
- 2 LSTM layers with dropout
- Dense output layer (regression head)
- Justify each design choice
- Explain why LSTM > Feedforward > Linear Regression
Example Architecture:
Input (batch, 30, 24)
β
LSTM Layer 1 (96 units, dropout=0.3)
β
LSTM Layer 2 (96 units, dropout=0.3)
β
Dense Layer (1 unit) β RUL prediction
Steps:
- Imputation: Handle missing values (forward-fill, backfill)
- Scaling: StandardScaler (zero mean, unit variance)
- Sequence Windowing: Create fixed-size time windows
- Train/Validation Split: Temporal split (respect causality)
Example Windowing:
Original sequence: [s1, s2, s3, s4, s5, ..., s192]
Window size = 30, Stride = 1:
Window 1: [s1:s30] β RUL at s30
Window 2: [s2:s31] β RUL at s31
Window 3: [s3:s32] β RUL at s32
...
Window 163: [s163:s192] β RUL at s192 (failure)
Training:
- Loss function: Mean Squared Error (MSE)
- Optimizer: Adam
- Early stopping (patience=10)
- Learning rate scheduling
Evaluation Metrics:
RMSE = sqrt(mean((predicted_RUL - actual_RUL)^2))
MAE = mean(|predicted_RUL - actual_RUL|)
Good RMSE: < 15 cycles
Great RMSE: < 10 cycles
Visualizations:
- Loss curves (training vs validation)
- Prediction vs actual scatter plots
- Residual analysis
- Sensor importance (permutation feature importance)
Real-world aspects:
- How to use the model in production?
- Edge computing considerations
- Maintenance scheduling decisions
- Cost-benefit analysis
# Activate the conda environment
conda activate pythonenv
# Install additional packages if needed
pip install numpy pandas scikit-learn torch matplotlib seaborn# Navigate to project directory
cd d:\Piyush\College\AI\EST_Project
# Open Jupyter Notebook
jupyter notebook Predictive_Maintenance_RUL_LSTM.ipynbEST_Project/
βββ Predictive_Maintenance_RUL_LSTM.ipynb (Main notebook β fully executed)
βββ dashboard.html (Interactive visualization dashboard)
βββ README.md (This file)
βββ CMaps/ (NASA C-MAPSS dataset)
β βββ train_FD001.txt / test_FD001.txt / RUL_FD001.txt
β βββ train_FD002.txt / test_FD002.txt / RUL_FD002.txt
β βββ train_FD003.txt / test_FD003.txt / RUL_FD003.txt
β βββ train_FD004.txt / test_FD004.txt / RUL_FD004.txt
β βββ readme.txt / Damage Propagation Modeling.pdf
βββ api/
β βββ rul_service.py (FastAPI REST API for real-time RUL prediction)
β βββ simulator.py (Sensor data streaming simulator)
βββ artifacts/deployment/
β βββ rul_lstm_fd001.pt (Trained LSTM model weights)
β βββ feature_scaler.pkl (StandardScaler for feature normalization)
β βββ deployment_metadata.json (Model metadata for production)
β βββ maintenance_plan_top20.csv (Top 20 maintenance priority engines)
β βββ optimization_eval_metrics.csv (Pruning/quantization evaluation)
β βββ optimization_eval_comparison.png (Optimization comparison chart)
βββ tools/
βββ evaluate_optimizations.py (Model optimization evaluation script)
Traditional ML (Random Forest, SVM):
Input: [s1, s2, ..., s21] (single cycle)
Problem: Ignores time order
LSTM:
Input: [Cycle1=[s1...], Cycle2=[s1...], ..., Cycle30=[s1...]]
Advantage: Captures "how sensors are trending"
Engine A: 192 cycles
Engine B: 245 cycles
Engine C: 156 cycles
LSTM: Handles with padding/windowing β
Feedforward: Requires fixed input size β
Is the engine about to fail?
Answer depends on:
- Recent sensor trends (last 10 cycles)
- Overall degradation pattern (all 192 cycles)
- Rate of change (comparing cycle 50 vs cycle 190)
LSTM remembers long-range context via hidden state
# Before standardization:
sensor_s2 = [100.5, 101.2, 102.1, ..., 650.3]
# Range: 100 to 650
# After standardization:
sensor_s2 = [-1.5, -1.3, -1.1, ..., 2.8]
# Mean: 0, Std: 1
Why?
1. All sensors on same scale
2. Prevents numerical instability
3. Improves gradient flow during backpropEngine 1 runs 192 cycles with 24 features per cycle
Raw shape: (192, 24)
Window size = 30, Stride = 1:
Sample 1: X[0:30], y[RUL at cycle 30]
Sample 2: X[1:31], y[RUL at cycle 31]
...
Sample 163: X[162:192], y[RUL at cycle 192]
Result: 163 training samples from 1 engine
100 engines β 16,300 training samplesInput: (batch_size=256, seq_len=30, n_features=24)
LSTM Cell:
- Cell State (memory): captures long-term dependencies
- Hidden State: short-term memory
- Gates:
* Input gate: what to add to memory?
* Forget gate: what to forget from memory?
* Output gate: what to output?
Layer 1 LSTM (96 units):
Input: (256, 30, 24) β Output: (256, 30, 96)
Dropout (0.3):
Randomly zeroes 30% of outputs
Prevents overfitting
Layer 2 LSTM (96 units):
Input: (256, 30, 96) β Output: (256, 30, 96)
Global Average Pooling:
(256, 30, 96) β (256, 96)
Summarize the whole sequence
Dense Layer (regression head):
(256, 96) β (256, 1)
Final RUL prediction
for epoch in range(45):
# Training phase
for batch in train_loader:
X, y = batch
# Forward pass
pred = model(X) # Shape: (256, 1)
loss = criterion(pred, y) # MSE loss
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Validation phase
with torch.no_grad():
for batch in val_loader:
X_val, y_val = batch
pred_val = model(X_val)
val_loss = criterion(pred_val, y_val)
# Early stopping: if validation loss doesn't improve for 10 epochs, stop
if val_loss > best_val_loss:
patience_counter += 1
if patience_counter >= 10:
break
else:
best_val_loss = val_loss
patience_counter = 0| Dataset | RMSE (cycles) | MAE (cycles) | Test Engines |
|---|---|---|---|
| FD001 (In-Domain) | 15.87 | 11.46 | 100 |
| FD002 (Cross-Domain) | 45.37 | 39.02 | 259 |
| Model | RMSE (FD001) | MAE (FD001) | Notes |
|---|---|---|---|
| Linear Baseline | ~35 | ~28 | No temporal modeling |
| Random Forest | ~18 | ~12 | No sequence awareness |
| Our LSTM | 15.87 | 11.46 | Sequence modeling + dropout |
| State-of-the-art (literature) | ~10 | ~7 | Complex architectures + attention |
Note: Our RMSE of 15.87 is competitive for a 2-layer LSTM with early stopping at epoch 12. State-of-the-art results (~10 RMSE) typically require deeper architectures, attention mechanisms, or ensemble methods.
RMSE = 15.87 means:
- On average, predictions are off by Β±16 cycles
- If actual RUL = 50, prediction β 34β66 (acceptable for scheduling)
- If actual RUL = 10, prediction β 0β26 (critical β use MC Dropout confidence)
This is why uncertainty estimation (MC Dropout) matters!
MC Dropout RMSE: 15.84 (slightly better due to ensemble effect)
Model trained on FD001 (single operating condition):
- FD001 β FD001 test RMSE: 15.87 (in-domain β good)
- FD001 β FD002 test RMSE: 45.37 (cross-domain β significant drop!)
- FD001 β FD002 (mean-std adaptation) RMSE: 51.48 (simple adaptation insufficient)
Reason: FD002 has 6 different operating conditions vs FD001's 1
Key insight: Domain shift causes ~3x RMSE increase
Solution needed: Fine-tuning on FD002 data or advanced domain adaptation
FD001 (Easy):
Same engine, same altitude, same temperature
Degradation pattern: Consistent, predictable
Model task: Learn one degradation curve pattern
FD002 (Hard):
Same engine, 6 different operating conditions
Degradation pattern: Varies by condition
Model task: Learn 6 different degradation patterns simultaneously
Healthy phase: Constant sensor values
Degradation phase: Slow drift, then acceleration
Failure phase: Sudden changes, system shutdown
LSTM captures all three phases differently
High Information Gain Sensors:
- s_2, s_11, s_12, s_15, s_20, s_21
- Show clear degradation trends
- High variance across engine lifetime
Low Information Gain Sensors:
- s_4, s_10, s_16, s_18, s_19
- Remain mostly constant
- Don't correlate with failure
Feature selection could improve model, but all 21 sensors are included for completeness.
Operating Settings (Altitude Γ Throttle):
FD001:
Setting 1, 2, 3 are constant
Engines always operate at same conditions
FD002:
Six different condition combinations:
- Sea level, 60% throttle
- Sea level, 80% throttle
- Sea level, 100% throttle
- 35K ft, 60% throttle
- 35K ft, 80% throttle
- 35K ft, 100% throttle
Same engine at different conditions = different sensor readings!
Sensor s_2:
FD001 mean: 100.5
FD002 mean: 98.2
Difference: 2.3 units (2.3% shift)
This shift applies to all sensors!
Model trained on FD001 sees values like 100.5
Model tested on FD002 sees values like 98.2
Mismatch β Prediction errors
Solution: Domain adaptation techniques (TBD in advanced tasks)
1. Setup & Imports (device, seeds, libraries)
2. Configuration (hyperparameters, window size, RUL cap)
3. Data Loading & Domain Setup (FD001 + FD002)
4. ============= TASK 1: EDA =============
- Data loading & schema validation
- Sensor trend analysis (6 key sensors)
- Distribution analysis & correlation heatmap
- Degradation trajectories (multi-engine overlay)
- RUL label generation & visualization
- Domain comparison (FD001 vs FD002 shift analysis)
- EDA key insights summary
5. ============= TASK 2: MODEL DESIGN =============
- LSTM architecture justification (why LSTM > FF > LR)
- Design decisions (layers, hidden size, dropout)
- Model verification & forward pass test
- Data flow visualization
6. ============= TASK 3: PREPROCESSING =============
- Feature scaling (StandardScaler)
- Sequence windowing (window=30, stride=1)
- Train/validation/test split (80/20)
- Sanity checks
7. ============= TASK 4: TRAINING =============
- Training loop (MSE loss, Adam optimizer)
- Early stopping (patience=10) & LR scheduling
- Loss curves (training vs validation)
- Evaluation: RMSE & MAE on FD001 + FD002
- Predicted vs Actual scatter plots
8. ============= TASK 5: INTERPRETABILITY =============
- Permutation feature importance
- Residual analysis & error distribution
9. ============= TASK 6: DEPLOYMENT =============
- Deployment plan (edge/cloud/API)
- FastAPI REST service demo
- Maintenance scheduling (top 20 priority engines)
- Ethical considerations
10. ============= BONUS: SHAP XAI =============
- SHAP KernelExplainer (global + local explanations)
11. ============= BONUS: MC DROPOUT =============
- Uncertainty estimation (50 forward passes)
- 95% confidence intervals
12. ============= BONUS: DOMAIN ADAPTATION =============
- FD001 -> FD002 transfer analysis
- Mean-std feature alignment
13. ============= BONUS: DASHBOARD =============
- Interactive Plotly visualization
14. ============= BONUS: OPTIMIZATION =============
- Pruning (L1 unstructured)
- Dynamic quantization (int8)
- Optimization evaluation harness
15. Final Conclusions
After completing this project, you'll understand:
β Predictive maintenance use cases and real-world impact β CMAPSS dataset structure and characteristics β RUL concept and label generation β LSTM architecture and why it's suitable for time-series β Data preprocessing for sequence modeling β Training, validation, and early stopping β Model evaluation with domain shift considerations β Interpretability and feature importance β Deployment readiness assessment
- Dataset Paper: Saxena, K. et al. (2008). "Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation"
- CMAPSS Homepage: https://www.nasa.gov/intelligent-systems-division/
- LSTM Paper: Hochreiter & Schmidhuber (1997). "Long Short-Term Memory"
- Start with visualization: Before coding, understand your data visually
- Check data quality: Missing values, outliers, scaling issues
- Validate domain differences: FD001 and FD002 are NOT interchangeable
- Monitor metrics carefully: RMSE alone isn't enough; plot predictions
- Document assumptions: Why window size = 30? Why RUL_CAP = 130?
- Test gradually: Build and test each component independently
The project includes a self-contained interactive dashboard (dashboard.html) built with HTML, CSS, and Chart.js. Simply open it in any browser β no server required.
| Tab | Description |
|---|---|
| Overview | KPI cards (RMSE, MAE, parameters), scatter plot, error distribution, lifecycle histogram, model config |
| Predictions | Per-engine bar charts comparing predicted vs actual RUL for FD001 (100 engines) and FD002 (60 engines) |
| Uncertainty | MC Dropout confidence intervals (95% CI) for all test engines with actual values overlay |
| Feature Analysis | Horizontal bar chart of permutation feature importance β shows RMSE increase when each feature is shuffled |
| Sensor Trends | Multi-line degradation curves for 6 key sensors across the longest-lived engine's full lifecycle |
| Maintenance Plan | Priority queue table of the 20 highest-risk engines with color-coded risk badges (CRITICAL/HIGH/MEDIUM/LOW) |
| Domain Adaptation | Side-by-side RMSE/MAE comparison between FD001 (in-domain) and FD002 (cross-domain) performance |
# Simply open in any browser
start dashboard.htmlQ: Why are engine lifespans different? A: Manufacturing variations, initial wear, operating history, maintenance history all affect how long an engine lasts.
Q: Can we predict the exact failure cycle? A: No, degradation is stochastic. We estimate RUL with uncertainty bounds.
Q: Why not use all 26 columns as features? A: Unit_id and Cycle are indices, not features. We use columns 3-26 (settings + sensors).
Q: Is domain adaptation necessary? A: For production deployment from FD001 β FD002, yes. Without it, performance degrades significantly.
Q: What's a good RMSE for RUL prediction? A: < 15 cycles is decent, < 10 is excellent, < 5 is state-of-the-art.
This project now includes a runnable API demo for real-time integration.
api/rul_service.py: FastAPI service that loads model + scaler fromartifacts/deployment/api/simulator.py: Streams CMAPSS cycles as pseudo live sensor packets
GET /healthβ service statusGET /schema/featuresβ expected feature list and payload shapePOST /ingestβ ingest one sensor packet and (if window ready) return predictionGET /predict/latest?engine_id=<id>β latest prediction for an enginePOST /predict/batchβ ingest/predict for multiple packets
D:\Downloads\miniconda3\envs\pythonenv\python.exe -m uvicorn api.rul_service:app --host 127.0.0.1 --port 8000D:\Downloads\miniconda3\envs\pythonenv\python.exe api\simulator.py --fd FD001 --engine-id 1 --max-cycles 40 --sleep-sec 0.01{
"engine_id": 34,
"timestamp": "2026-04-29T12:30:00Z",
"data": {
"setting_1": -0.0012,
"setting_2": 0.0003,
"setting_3": 100.0,
"s_1": 518.67,
"s_2": 641.90,
"s_3": 1585.4,
"s_4": 1400.2,
"s_5": 14.62,
"s_6": 21.61,
"s_7": 553.2,
"s_8": 2388.1,
"s_9": 9047.1,
"s_10": 1.3,
"s_11": 47.3,
"s_12": 521.0,
"s_13": 2388.2,
"s_14": 8125.2,
"s_15": 8.40,
"s_16": 0.03,
"s_17": 392,
"s_18": 2388,
"s_19": 100.0,
"s_20": 38.98,
"s_21": 23.35
}
}- The API keeps a rolling 30-cycle buffer per engine.
- Predictions start after the buffer is full (
status=warming_upbefore that). - Output includes uncertainty (
uncertainty_std) using MC Dropout and a recommended maintenance action.
Last Updated: April 29, 2026 Author: Piyush Status: All Tasks Complete (Tasks 1β6 + All Bonus Tasks) β
This project includes lightweight helpers and guidance for compressing the trained LSTM model used for RUL inference. Use these to reduce disk size and improve CPU inference latency.
- Pruning (unstructured L1)
- What: Remove a fraction of smallest-magnitude weights from
nn.Linear/nn.Conv2dlayers. - When to use: Quick parameter reduction when you need smaller checkpoints. Expect to fine-tune afterwards for recovery.
- How to run (notebook cells): open
Predictive_Maintenance_RUL_LSTM.ipynband run the "Pruning helper" cell. Example usage is commented in the cell.
- Dynamic Quantization
- What: Convert selected layers (LSTM, Linear) to use int8 weights at inference using PyTorch
quantize_dynamic. - Benefit: Smaller on-disk model and faster CPU inference with minimal accuracy loss in many cases.
- How to run: run the "Dynamic quantization helper" cell in the notebook. Example usage is commented in the cell.
- Smoke Tests
- A smoke-test cell tries to
torch.loadartifacts/deployment/rul_lstm_fd001.ptand run a dummy forward pass if the file contains a fullnn.Module. If the file is astate_dict, load it into your model class instead before testing.
- Recommended workflow
- Save a copy of the original model:
rul_lstm_fd001_orig.pt - Run pruning with
amount=0.1..0.3, then evaluate validation RMSE/MAE. - If pruning reduces accuracy beyond acceptable limits, fine-tune the pruned model for a few epochs.
- Apply dynamic quantization to the (fine-tuned) model and re-evaluate.
- Keep both
*_pruned.ptand*_quantized.ptfor comparison.
- Commands (examples)
# Prune (run in python REPL or notebook cell)
# from the notebook: uncomment example_prune(...) call and run the cell
# Quantize (run in python REPL or notebook cell)
# from the notebook: uncomment example_quantize(...) call and run the cell
# Evaluate baseline vs optimized models (FD001 + FD002 combined RMSE/MAE comparison)
D:\Downloads\miniconda3\envs\pythonenv\python.exe tools\evaluate_optimizations.py --subsets FD001 FD002 --baseline artifacts\deployment\rul_lstm_fd001.pt --pruned artifacts\deployment\rul_lstm_fd001_pruned.pt --quantized artifacts\deployment\rul_lstm_fd001_quantized.ptEvaluation output:
- Console table with RMSE/MAE and deltas versus baseline
- CSV file (long format):
artifacts/deployment/optimization_eval_metrics.csv - CSV file (combined summary):
artifacts/deployment/optimization_eval_summary.csv - Comparison chart (before vs after):
artifacts/deployment/optimization_eval_comparison.png
Notes:
- If pruned/quantized files do not exist yet, omit those flags; baseline-only evaluation still works.
- The evaluator supports both saved
state_dictcheckpoints and fullnn.Moduleobjects.