MobileNetV2 (torchvision) Β· YOLOv8n Β· TinyChangeUNet (MobileNetV3 encoder)
μ€λ΄ κ³΅κ° λ°μ΄ν°λ₯Ό λμμΌλ‘ μ₯μ λΆλ₯ β κ³΅κ° μμ΄ν νμ§ β λ³ν κ°μ§
- ν΄λ ꡬ쑰
- A) Scene Classification (μ₯μ λΆλ₯)
- B) Space Item Detection (μμ΄ν νμ§)
- C) Change Detection (μ ν λ³ν κ°μ§)
- μ¬νμ±
A) Scene Classification (ImageFolder)
space_cls/
train/<class>/*.jpg|png
val/<class>/*.jpg|png
test/<class>/*.jpg|png
B) Space Item Detection (YOLO)
space_data/
images/{train,val,test}/*.jpg|png
labels/{train,val,test}/*.txt # YOLO: cls cx cy w h
space.yaml
C) Change Detection (ν©μ± before/after/GT)
pairs_out_cd/
train/{before_images,after_images,labels}
val/{before_images,after_images,labels}
test/{before_images,after_images,labels}
meta/pairs_{train,val}.json
λͺ¨λΈ: MobileNetV2 (torchvision, ImageNet μ¬μ νμ΅ β νμΈνλ)
μ
λ ₯: 224Γ224
νκΉ ν΄λμ€(5): creative_studio, dance_studio, music_rehearsal_room, small_theater_gallery, study_room
λ°μ΄ν° ꡬμΆ: ν΄λμ€λΉ ~50μ₯ (1μ°¨ ν¬λ‘€λ§ β 2μ°¨ μμμ
μ μ ) ν train:val=8:2 λΆν , ImageFolder ν¬λ§·
# νμ΅ μ€ν¬λ¦½νΈ
python train_mobilenet.py
# μ°μΆλ¬Ό: mobilenetv2.pth, class_names.txtν μ€νΈ μ±λ₯(ν΄λμ€λ³ μ νλ)
| Class | Correct / Total | Acc. |
|---|---|---|
| creative_studio | 10 / 10 | 100.0% |
| dance_studio | 8 / 10 | 80.0% |
| music_rehearsal_room | 7 / 10 | 70.0% |
| small_theater_gallery | 8 / 10 | 80.0% |
| study_room | 10 / 10 | 100.0% |
Overall Acc: 43/50 = 86.0% Β· Macro Acc: 86.0%
μ£Όμ μ€λΆλ₯:
dance_studio β small_theater_gallery2건,music_rehearsal_room β study_room2건 λ±.
λͺ¨λΈ: YOLOv8n (Ultralytics, COCO μ¬μ νμ΅ β 컀μ€ν
νμΈνλ)
λͺ©μ : μ€λ΄ μ¬μ§μμ κ³΅κ° μμ΄ν
(13μ’
)μ νμ§ν©λλ€.
- ν΄λμ€(13μ’
)
air_conditioner, chair, desk, drum, microphone, mirror, monitor, piano, projector, speaker, spotlight, stage, whiteboard - ꡬμΆ: μ₯μ λ°μ΄ν°μμ μλ λ°μ€ λΌλ²¨λ§ νμ΄νλΌμΈμΌλ‘ μ΄μ μμ± β μμμ 보μ
- νμ: YOLO νμ (
images/,labels/*.txt; κ° txt:cls cx cy w h)
λΌλ²¨ ν΅κ³ (νμ€ν κ·Έλ¨)
[dataset_trainval]
0 air_conditioner : 1026 7 piano : 1265
1 chair : 2058 8 projector : 1026
2 desk : 4199 9 speaker : 1900
3 drum : 3444 10 spotlight : 6136
4 microphone : 1126 11 stage : 1012
5 mirror : 1185 12 whiteboard : 1504
6 monitor : 1391
[dataset_test]
0 air_conditioner : 7 7 piano : 22
1 chair : 94 8 projector : 10
2 desk : 69 9 speaker : 22
3 drum : 13 10 spotlight : 142
4 microphone : 15 11 stage : 8
5 mirror : 44 12 whiteboard : 13
6 monitor : 14
YOLO λ°μ΄ν° μ€μ μμ(space.yaml)
path: ./space_data
train: images/train
val: images/val
test: images/test
names:
0: air_conditioner
1: chair
2: desk
3: drum
4: microphone
5: mirror
6: monitor
7: piano
8: projector
9: speaker
10: spotlight
11: stage
12: whiteboardνμ΅/νκ°/μΆλ‘
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
model.train(data="space_data/space.yaml", imgsz=640, epochs=80, batch=16, seed=42)
model.val(data="space_data/space.yaml", imgsz=640, split="val")
model.predict(source="space_data/images/test", imgsz=640, conf=0.25, save=True)μ±λ₯ μμ½
- 80 epochs μλ£, best/last κ°μ€μΉ μ μ₯ (
runs/detect/space_no_leak/weights/best.pt) - Val:
mAP50=0.981,mAP50-95=0.912 - Test:
mAP50=0.979,mAP50-95=0.910
| Split | mAP@0.5 | mAP@0.5:0.95 | ImgSize | Model |
|---|---|---|---|---|
| Val | 0.981 | 0.912 | 640 | YOLOv8n |
| Test | 0.979 | 0.910 | 640 | YOLOv8n |
Speed(ref): ~0.3ms preprocess, 5.0ms inference, 3.2ms postprocess / image (T4)
μ±λ₯ νκ° μκ°ν
λͺ¨λΈ: TinyChangeUNet (MobileNetV3 Small encoder + TinyDecoder)
μ
λ ₯: before(3) + after(3) + diff(1) = 7μ±λ (diff = mean(|before - after|))
λͺ©ν
- μ€μ νκ²½κ³Ό μ μ¬ν λ€μν λ³ν(κ°λ¦Ό/λΈλ¬/ν½μ
ν/μΈνμΈνΈ/μ΄λ) λ₯Ό μλ μ μ©ν΄
(before, after, mask)μ μΌκ΄ μμ± - λ§μ€ν¬ κ·μΉ:
0=λ°°κ²½,255=λ³κ²½ μμ - νμ©: λ³ν κ°μ§, μ /ν λΉκ΅, λΆν (Segmentation) νμ΅/λ²€μΉλ§νΉ
μμ± λ‘μ§ μμ½
- YOLO λΌλ²¨ λ°μ€λ₯Ό κΈ°μ€μΌλ‘ μμ μ ν ν, μλ μ€ νλ μ μ©
black / rect(noise) / blur / pixelate / inpaint / move(μμ μ΄λ) - λ°μ€ jitter/μ¬μ μ λΆλΆ κ°λ¦ΌμΌλ‘ λμ΄λ λ€μν
- κ²°κ³Ό:
before_images/(μλ³Έ),after_images/(λ³ν),labels/(0/255 PNG λ§μ€ν¬)
λͺ¨λΈ ꡬ쑰 μμ½
concat([before, after, diff]) β 1Γ1 convλ‘ 7ch β 3ch μΆμ- Encoder:
MobileNetV3 Small (timm, features_only)β μ±λ μ κ·ν(24/40/64/96) - Decoder(TinyDecoder):
ConvTranspose2dμ μν + μ€ν΅ +DWConvBlock - Head:
1Γ1 conv β logit(1ch)β bilinear μ μν(μν΄μλ)
νμ΅/νκ° (change_detection.ipynb)
- κΈ°λ³Έ:
IMG_SIZE=256,BATCH=8,EPOCHS=40,LR=3e-4 - 루ν: AMP(FP16), Cosine+Warmup(2ep), EMA(0.99), gradient clip
- μμ€:
BCEWithLogits(pos_weight)+Tversky(Ξ±=0.7, Ξ²=0.3) - κ²μ¦ threshold sweep:
th β [0.02, 0.40]μμ F1 μ΅λλ₯Ό μ ν
μ±λ₯ μμ½ (Val λ‘κ·Έ κΈ°λ°)
- Early stop(F1), μ΄ Epoch 39
- Best F1(EMA): 0.510 @ th=0.36
- λ§μ§λ§ Epoch(38):
train loss=0.4530,val loss=0.6248,mIoU=0.413,F1=0.513
| Split | Best-th | mIoU | F1 | AMP | EMA |
|---|---|---|---|---|---|
| Val | 0.36 | 0.413 | 0.513 | β | β |
- κ³΅ν΅ μλ:
42(μ€νλ¦Ώ λμ λ°©μ§, λ‘κ·Έ/체ν¬ν¬μΈνΈ κ³ μ ) - κ²μ¦/μ μ₯: EMA κ°μ€μΉ κΈ°μ€




