Lec15 Transformer-based Detectors

DETR

End-to-End Object Detection with Transformers

Untitled

Set Prediction

모델을 통해 예측한 set of box predictions 과 실제 이미지에 존재하는 object 들을 매칭해서 최적의 조합을 찾고, 손실 함수를 계산하는 과정을 거침

Bipartite Matching Loss

Step1 Hungarian matching (optimal assignment)

: Find best permutation (of box prediction)

$\hat \sigma = \argmin_{\sigma\in\mathfrak{S}N} \sum^N_i \mathfrak L{\text{match}}(y_i, \hat y_{\sigma(i)})$
- $\mathfrak S _N$: All kinds of permutation, $\sigma$: permutation
- $\mathfrak L_{\text{match}}(\cdot)$: a pair-wise matching cost, $y_i$: Ground Truth, $\hat y _{\sigma(i)}$: Prediction with index σ(i)
- 계산 예시:
  
  $Assume $\mathfrak L_{\text{match}}(y_i, \hat y_{\sigma(i)})= (y_i-\hat y_{\sigma(i)})^2$$
  
  Assume $\mathfrak L_{\text{match}}(y_i, \hat y_{\sigma(i)})= (y_i-\hat y_{\sigma(i)})^2$
$\mathfrak L_{\text{match}}(y_i, \hat y_{\sigma(i)})=-\mathbb I_{\{c_i \neq \emptyset\}}\hat p {\sigma(i)}(c_i)+\mathbb I{\{c_i \neq \emptyset\}} \mathfrak L_{\text{box}}(b_i, \hat b_{\sigma(i)})$

$\mathfrak L_{\text{box}}(b_i, \hat b_{\sigma(i)}) = \lambda_{\text{iou}} \mathfrak L_{\text{iou}}(b_i, \hat b_{\sigma(i)})+\lambda_{L1}||b_i-\hat b_{\sigma(i)}||_1$
- $\mathbb I$: Indicator function, $c_i \neq \emptyset$ : Not No object
- $\hat p _{\sigma(i)}$: confidence score, $c_i$: ground truth,
- 예시:
  
  → $\hat p(c)= 0.6$
Step2 Hungarian Loss for training a NN

$\mathfrak L_{\text{Hungarian}}(y, \hat y)=\sum_{i=1}^N[-\log \hat p_{\hat \sigma(i)}(c_i)+ \mathbb I_{\{c_i \neq \emptyset\}} \mathfrak L_{\text{box}}(b_i, \hat b_{\hat\sigma(i)})]$
- $-\log \hat p_{\hat \sigma(i)}(c_i)$: XE (for Classification)
Step1 → 매칭 찾기

Step2 → 매칭된 Loss 최소화

DETR Architecture

Untitled

CNN backbone + PE
Transformer
- Decoder 단점: 기존에 정해진 object query 수 만큼만 detect 가능
Feed-forward
- Query outputs are fed to FNNs for final prediction
- Classification + BBox regression
- Decoder의 outputs을 입력으로 받아 각 class와 bbox 값 예측
  
  class ouput은 각 class에 대한 확률값, bbox output은 중심 좌표, 너비, 높이 (x,y,h,w)

Is NMS unnecessary?

없어도 좋은 결과를 내지만 있으면 더 좋긴 함

HOTR

그림밖에없는뎁쇼?

Untitled

HO Pointers

Untitled

Deformable DETR

Cross-Path Consistency

Untitled

어떤 길로 가든 정답을 맞추는게 목적