A Python toolkit for processing Himawari-8/9 AHI (Advanced Himawari Imager) Level-1b satellite data in HSD (Himawari Standard Data) format. Designed for machine learning and deep learning applications.
This project processes compressed .DAT.bz2 files from JMA's Himawari satellite, extracting:
- PNG images - 16-bit grayscale images suitable for ML training
- JSON metadata - Complete observation metadata for each image
- Dataset index - Combined index file for easy dataset loading
- ✅ Uses satpy library for proper HSD format parsing (with fallback raw reader)
- ✅ Extracts calibrated brightness temperature (IR) / reflectance (VIS/NIR)
- ✅ Preserves 16-bit dynamic range for scientific applications
- ✅ Generates comprehensive metadata for each observation
- ✅ PyTorch Dataset loader included for immediate ML use
- ✅ Supports temporal sequence datasets for RNN/LSTM/Transformer models
Files follow the naming convention:
HS_H08_YYYYMMDD_HHMM_Bnn_Rxxx_Rnn_Snnnn.DAT.bz2
| Component | Description |
|---|---|
HS |
Himawari Standard |
H08/H09 |
Himawari-8 or Himawari-9 |
YYYYMMDD |
Observation date |
HHMM |
Observation time (UTC) |
Bnn |
Band number (01-16) |
Rxxx |
Region (R301-R304 = Target area) |
Rnn |
Resolution (R05=0.5km, R10=1km, R20=2km) |
Snnnn |
Segment number |
| Band | Wavelength (μm) | Type | Name | Resolution |
|---|---|---|---|---|
| 1 | 0.47 | VIS | Blue | 1 km |
| 2 | 0.51 | VIS | Green | 1 km |
| 3 | 0.64 | VIS | Red | 0.5 km |
| 4 | 0.86 | NIR | Vegetation | 1 km |
| 5 | 1.6 | NIR | Snow/Ice | 2 km |
| 6 | 2.3 | NIR | Cloud Particle | 2 km |
| 7 | 3.9 | IR | Shortwave IR | 2 km |
| 8 | 6.2 | IR | Upper Water Vapor | 2 km |
| 9 | 6.9 | IR | Mid Water Vapor | 2 km |
| 10 | 7.3 | IR | Lower Water Vapor | 2 km |
| 11 | 8.6 | IR | Cloud Top Phase | 2 km |
| 12 | 9.6 | IR | Ozone | 2 km |
| 13 | 10.4 | IR | Clean IR | 2 km |
| 14 | 11.2 | IR | IR | 2 km |
| 15 | 12.4 | IR | Dirty IR | 2 km |
| 16 | 13.3 | IR | CO2 | 2 km |
cd /path/to/chromosome
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# or
.venv\Scripts\activate # Windowspip install -r requirements.txtsatpy>=0.47.0 # Satellite data processing (HSD reader)
pyresample>=1.28.0 # Geographic resampling
numpy>=1.24.0 # Numerical computing
Pillow>=10.0.0 # Image processing
tqdm>=4.65.0 # Progress bars
h5py>=3.9.0 # HDF5 support
netCDF4>=1.6.0 # NetCDF support
dask>=2023.1.0 # Parallel computing
xarray>=2023.1.0 # Labeled arrays
trollimage>=1.20.0 # Satellite image handling
Process all data files:
python process_all.py# Process with 8-bit output (smaller files)
python process_all.py --bit-depth 8
# Process only 10 random files (for testing)
python process_all.py --sample 10
# Custom input/output directories
python process_all.py -i ./Target_20251012 -o ./processed_output
# Skip dataset index creation
python process_all.py --no-indexfrom himawari_processor import process_single_file, process_directory
# Process a single file
image_path, metadata_path = process_single_file(
"Target_20251012/0000/HS_H08_20251012_0000_B13_R301_R20_S0101.DAT.bz2",
output_image_dir="./output/images",
output_metadata_dir="./output/metadata",
bit_depth=16
)
# Process entire directory
results = process_directory(
input_dir="./Target_20251012",
output_image_dir="./output/images",
output_metadata_dir="./output/metadata",
recursive=True
)output/
├── images/
│ ├── HS_H08_20251012_0000_B13_R301_R20_S0101.png
│ ├── HS_H08_20251012_0000_B13_R302_R20_S0101.png
│ └── ...
├── metadata/
│ ├── HS_H08_20251012_0000_B13_R301_R20_S0101.json
│ ├── HS_H08_20251012_0000_B13_R302_R20_S0101.json
│ ├── dataset_index.json
│ └── ...
{
"filename": "HS_H08_20251012_0000_B13_R301_R20_S0101.DAT.bz2",
"satellite": "Himawari-8",
"observation_time": "2025-10-12 00:00:00",
"observation_time_utc": "2025-10-12T00:00:00Z",
"timeline": "0000",
"band_number": 13,
"band_wavelength_um": 10.4,
"band_type": "IR",
"segment_number": 1,
"total_segments": 1,
"resolution_km": 2.0,
"image_width": 500,
"image_height": 500,
"calibration_type": "brightness_temperature",
"data_unit": "K",
"valid_pixel_count": 245000,
"min_value": 210.5,
"max_value": 305.2,
"mean_value": 275.3,
"std_value": 15.7,
"min_lat": 20.5,
"max_lat": 45.2,
"min_lon": 120.0,
"max_lon": 150.5
}from ml_dataset import HimawariDataset, create_train_val_split
from torch.utils.data import DataLoader
# Load dataset
dataset = HimawariDataset(
data_dir="./output",
bands=[13, 14, 15], # Filter specific bands
timelines=["0000", "0600", "1200", "1800"] # Filter times
)
# Create train/val split (time-based to avoid leakage)
train_dataset, val_dataset = create_train_val_split(dataset, val_ratio=0.2)
# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
# Training loop
for images, metadata in train_loader:
# images: (batch, 1, H, W) tensor
# metadata: list of dicts with observation info
predictions = model(images)
...For time-series models (LSTM, Transformer, etc.):
from ml_dataset import HimawariSequenceDataset
# Create sequences of 6 timesteps
dataset = HimawariSequenceDataset(
data_dir="./output",
sequence_length=6,
stride=1,
band=13
)
for sequence, metadata_list in DataLoader(dataset, batch_size=8):
# sequence: (batch, seq_len, 1, H, W) tensor
# metadata_list: list of metadata for each timestep
predictions = temporal_model(sequence)
...import torchvision.transforms as T
transform = T.Compose([
T.Resize((256, 256)),
T.RandomHorizontalFlip(),
T.Normalize(mean=[0.5], std=[0.5])
])
dataset = HimawariDataset(
data_dir="./output",
transform=transform
)chromosome/
├── himawari_processor.py # Core HSD processing module
├── process_all.py # Main batch processing script
├── ml_dataset.py # PyTorch dataset loaders
├── requirements.txt # Python dependencies
├── README.md # This file
├── Target_20251012/ # Input data (HSD files)
│ ├── 0000/
│ ├── 0010/
│ └── ...
└── output/ # Processed output
├── images/
└── metadata/
The Himawari Standard Data (HSD) format is JMA's proprietary binary format with:
- 11 header blocks containing satellite, calibration, and navigation data
- 16-bit unsigned integer image data
- bzip2 compression
This toolkit uses satpy for robust parsing, with a fallback raw reader for edge cases.
- IR bands (7-16): Converted to brightness temperature in Kelvin
- VIS/NIR bands (1-6): Converted to reflectance percentage
- Images are normalized using 1st-99th percentile scaling
- IR images are inverted so clouds appear bright
- 16-bit output preserves full dynamic range for scientific applications
See LICENSE file.