Skip to content

Colluded-Projects/chromosome

Repository files navigation

Himawari AHI-L1b Target Data Processor

A Python toolkit for processing Himawari-8/9 AHI (Advanced Himawari Imager) Level-1b satellite data in HSD (Himawari Standard Data) format. Designed for machine learning and deep learning applications.

Overview

This project processes compressed .DAT.bz2 files from JMA's Himawari satellite, extracting:

  • PNG images - 16-bit grayscale images suitable for ML training
  • JSON metadata - Complete observation metadata for each image
  • Dataset index - Combined index file for easy dataset loading

Features

  • ✅ Uses satpy library for proper HSD format parsing (with fallback raw reader)
  • ✅ Extracts calibrated brightness temperature (IR) / reflectance (VIS/NIR)
  • ✅ Preserves 16-bit dynamic range for scientific applications
  • ✅ Generates comprehensive metadata for each observation
  • ✅ PyTorch Dataset loader included for immediate ML use
  • ✅ Supports temporal sequence datasets for RNN/LSTM/Transformer models

Data Format

Input: Himawari HSD Format

Files follow the naming convention:

HS_H08_YYYYMMDD_HHMM_Bnn_Rxxx_Rnn_Snnnn.DAT.bz2
Component Description
HS Himawari Standard
H08/H09 Himawari-8 or Himawari-9
YYYYMMDD Observation date
HHMM Observation time (UTC)
Bnn Band number (01-16)
Rxxx Region (R301-R304 = Target area)
Rnn Resolution (R05=0.5km, R10=1km, R20=2km)
Snnnn Segment number

AHI Band Specifications

Band Wavelength (μm) Type Name Resolution
1 0.47 VIS Blue 1 km
2 0.51 VIS Green 1 km
3 0.64 VIS Red 0.5 km
4 0.86 NIR Vegetation 1 km
5 1.6 NIR Snow/Ice 2 km
6 2.3 NIR Cloud Particle 2 km
7 3.9 IR Shortwave IR 2 km
8 6.2 IR Upper Water Vapor 2 km
9 6.9 IR Mid Water Vapor 2 km
10 7.3 IR Lower Water Vapor 2 km
11 8.6 IR Cloud Top Phase 2 km
12 9.6 IR Ozone 2 km
13 10.4 IR Clean IR 2 km
14 11.2 IR IR 2 km
15 12.4 IR Dirty IR 2 km
16 13.3 IR CO2 2 km

Installation

1. Create Virtual Environment

cd /path/to/chromosome
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate  # Windows

2. Install Dependencies

pip install -r requirements.txt

Dependencies

satpy>=0.47.0        # Satellite data processing (HSD reader)
pyresample>=1.28.0   # Geographic resampling
numpy>=1.24.0        # Numerical computing
Pillow>=10.0.0       # Image processing
tqdm>=4.65.0         # Progress bars
h5py>=3.9.0          # HDF5 support
netCDF4>=1.6.0       # NetCDF support
dask>=2023.1.0       # Parallel computing
xarray>=2023.1.0     # Labeled arrays
trollimage>=1.20.0   # Satellite image handling

Usage

Quick Start

Process all data files:

python process_all.py

Command Line Options

# Process with 8-bit output (smaller files)
python process_all.py --bit-depth 8

# Process only 10 random files (for testing)
python process_all.py --sample 10

# Custom input/output directories
python process_all.py -i ./Target_20251012 -o ./processed_output

# Skip dataset index creation
python process_all.py --no-index

Direct Module Usage

from himawari_processor import process_single_file, process_directory

# Process a single file
image_path, metadata_path = process_single_file(
    "Target_20251012/0000/HS_H08_20251012_0000_B13_R301_R20_S0101.DAT.bz2",
    output_image_dir="./output/images",
    output_metadata_dir="./output/metadata",
    bit_depth=16
)

# Process entire directory
results = process_directory(
    input_dir="./Target_20251012",
    output_image_dir="./output/images",
    output_metadata_dir="./output/metadata",
    recursive=True
)

Output Structure

output/
├── images/
│   ├── HS_H08_20251012_0000_B13_R301_R20_S0101.png
│   ├── HS_H08_20251012_0000_B13_R302_R20_S0101.png
│   └── ...
├── metadata/
│   ├── HS_H08_20251012_0000_B13_R301_R20_S0101.json
│   ├── HS_H08_20251012_0000_B13_R302_R20_S0101.json
│   ├── dataset_index.json
│   └── ...

Metadata JSON Format

{
  "filename": "HS_H08_20251012_0000_B13_R301_R20_S0101.DAT.bz2",
  "satellite": "Himawari-8",
  "observation_time": "2025-10-12 00:00:00",
  "observation_time_utc": "2025-10-12T00:00:00Z",
  "timeline": "0000",
  "band_number": 13,
  "band_wavelength_um": 10.4,
  "band_type": "IR",
  "segment_number": 1,
  "total_segments": 1,
  "resolution_km": 2.0,
  "image_width": 500,
  "image_height": 500,
  "calibration_type": "brightness_temperature",
  "data_unit": "K",
  "valid_pixel_count": 245000,
  "min_value": 210.5,
  "max_value": 305.2,
  "mean_value": 275.3,
  "std_value": 15.7,
  "min_lat": 20.5,
  "max_lat": 45.2,
  "min_lon": 120.0,
  "max_lon": 150.5
}

ML/DL Integration

PyTorch Dataset

from ml_dataset import HimawariDataset, create_train_val_split
from torch.utils.data import DataLoader

# Load dataset
dataset = HimawariDataset(
    data_dir="./output",
    bands=[13, 14, 15],  # Filter specific bands
    timelines=["0000", "0600", "1200", "1800"]  # Filter times
)

# Create train/val split (time-based to avoid leakage)
train_dataset, val_dataset = create_train_val_split(dataset, val_ratio=0.2)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# Training loop
for images, metadata in train_loader:
    # images: (batch, 1, H, W) tensor
    # metadata: list of dicts with observation info
    predictions = model(images)
    ...

Temporal Sequence Dataset

For time-series models (LSTM, Transformer, etc.):

from ml_dataset import HimawariSequenceDataset

# Create sequences of 6 timesteps
dataset = HimawariSequenceDataset(
    data_dir="./output",
    sequence_length=6,
    stride=1,
    band=13
)

for sequence, metadata_list in DataLoader(dataset, batch_size=8):
    # sequence: (batch, seq_len, 1, H, W) tensor
    # metadata_list: list of metadata for each timestep
    predictions = temporal_model(sequence)
    ...

Custom Transforms

import torchvision.transforms as T

transform = T.Compose([
    T.Resize((256, 256)),
    T.RandomHorizontalFlip(),
    T.Normalize(mean=[0.5], std=[0.5])
])

dataset = HimawariDataset(
    data_dir="./output",
    transform=transform
)

Project Structure

chromosome/
├── himawari_processor.py  # Core HSD processing module
├── process_all.py         # Main batch processing script
├── ml_dataset.py          # PyTorch dataset loaders
├── requirements.txt       # Python dependencies
├── README.md              # This file
├── Target_20251012/       # Input data (HSD files)
│   ├── 0000/
│   ├── 0010/
│   └── ...
└── output/                # Processed output
    ├── images/
    └── metadata/

Technical Notes

HSD Format

The Himawari Standard Data (HSD) format is JMA's proprietary binary format with:

  • 11 header blocks containing satellite, calibration, and navigation data
  • 16-bit unsigned integer image data
  • bzip2 compression

This toolkit uses satpy for robust parsing, with a fallback raw reader for edge cases.

Calibration

  • IR bands (7-16): Converted to brightness temperature in Kelvin
  • VIS/NIR bands (1-6): Converted to reflectance percentage

Image Processing

  • Images are normalized using 1st-99th percentile scaling
  • IR images are inverted so clouds appear bright
  • 16-bit output preserves full dynamic range for scientific applications

License

See LICENSE file.

References

About

himawari/era5

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •