[11] Visual Instruction Tuning (LLaVA: Large Language and Vision Assistant)

728x90

[paper] https://arxiv.org/pdf/2304.08485

[Github] https://github.com/haotian-liu/LLaVA

GitHub - haotian-liu/LLaVA: [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyo

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond. - haotian-liu/LLaVA

github.com

Abstract

기존 LLM의 문제점: 이미지를 입력 받지 못해 vision 정보를 처리하는데 어려움 ⇒ multi-modal 연구 부족
GPT-4를 사용해 multi-modal language-image instruction-following 데이터를 생성하는 방법 최초 제시
- LLaVA: Large Language and Vision Assistant
LLaVA: end- to end 학습, vision 인코더와 LLM 모델을 연결
이를 이용한 벤치마크 데이터셋 구축

Introduction

기존의 multi-modal vision-and-language instructions
- 기존에도 classification, detection, segmentation, captioning 등에 language를 이용하긴 했지만, 단순히 이미지를 설명하는데 그침.
  - 매우 제한적
  - 아래 예시처럼, user의 instruction은 입력 받을 수 없기 때문에, 이미지에 대해 대화하는 건 불가능

main contribution
- Multi-modal instruction-following data
  - image-text pair 데이터의 부족 문제 해결을 위해 ChatGPT와 GPT4를 이용해 instruction following 형식으로 변환하는 파이프라인 제안
- Large multi-modal models
  - CLIP의 visual encoder와 LLaMA를 연결하여 생성한 vision-language 데이터를 end-to-end로 fine tuning 하는 LLM 제안.
- Multimodal instruction-following benchmark
  - LLaVA-Bench 제안

GPT-assisted Visual Instruction Data Generation

기존 multi-modal 분야에서 쓰던 데이터인 CC, LAION은 instruction-following 형식이 아님
- 따라서, 이 데이터들을 chatGPT와 GPT4를 이용해 instruction-following 형식으로 만듦
특정 prompt 를 chatGPT와 GPT4의 input으로 사용.

1. 이미지는 input으로 사용하지 않고 이미지 caption과 bbox(COCO dataset에 라벨링되어 있는 값 사용)값들만 이용해 질문 및 대화 셋 생성 (symbolic representation) ⇒ LLM이 인식 가능한 시퀀스로 인코딩 가능

⇒ 이 정보들은 2의 context 자리에 들어가게 됨

2. COCO 이미지를 아래 prompt를 이용해 3가지 유형의 instruction-following 데이터 설계

3. conversation, detailed description, complex reasoning 총 3가지 유형의 데이터 생성

conversation ⇒ assistant-human이 대화하는 형태, 이미지만 보고 알 수 있는 것에 대한 QA 포함, 객체 종류/개수/위치/동작/상대적 위치 등의 시각적 요소 자체에 대한 질문들
detailed description ⇒ 이미지에 대한 상세한 설명을 생성
complex reasoning ⇒ 심층적인 추론을 하는 QA를 생성, 엄격한 설명이 포함된 응답을 구체적인 이유를 포함해 생성하도록 요구

4. 총 15만개의 데이터셋 생성

Visual Instruction Tuning

[Architecture]

LLM: Vicuna 사용 (당시 모델 중 instructio-following 분야에서 가장 성능이 좋았던 모델)

vision encoder: pretrained CLIP visual encoder인 ViT-L/14 사용(이미지를 visual feature화 하는데 사용)

*추가 된 부분: image feature를 word embedding space로 연결하는 linear layer

⇒ 굉장히 lightweight 하고 데이터 중심의 실험을 빠르게 반복 할 수 있는 cost-effective한 scheme

[Training]

각 이미지에 대해 conversation data 생성

각 answer를 assistant의 답변으로 간주
t 번째 instruction은 아래와 같이 설정

일관된 형태로 형성 가능

2. instruction tuning 수행

auto regressive 사용
앞에 나온 단어를 보고 다음 단어를 맞추는 방식
- 기존 모델들과 달리 image feature를 함꼐 사용한다는 점에서 차이가 있음

⇒ 시퀀스의 길이가 L 일때 정답 Xa에 대한 확률 ( X_instruct<i : 현재 예측 토큰인 Xi 이전 모든 경우에 대한 instruction tokens

X_a<i: 현재 예측 토큰인 Xi 이전 모든 경우에 대한 answer token)

3. loss 계산

초록색 부분의 token을 예측
loss도 이 부분만 계산

4. Fine tuning end-to-end

visual encoder 는 frozen, LLM과 projection layer만 학습

Experiments

pretrained parameter
- GPU: A100*8
- Batch size: 8
- epoch: 1
- lr: 2e-3
fine tuning parameter
- GPU: A100*8
- batch size: 32
- epoch: 3
- lr: 2e-5

⇒ LLaVA만이 움직이는 차안에서 다림질 하는 것이 이상함을 인지하고 답변함

⇒ 치킨으로 세계지도를 만들었음을 이해함

⇒ 유머러스하게 표현된 명작도 알아보고 설명가능

정량평가
- COCO dataset에서 랜덤하게 30장 뽑음
- GPT4의 답변을 GT로 설정, GPT4로 부터 종합적인 결과를 점수로 제공받음 (평균+- 표준편차형식)
  - 각 질문에 대해 사람 평가자가 점수 또는 순위 기반으로 평가
  - 모든 질문에서의 **평균(mean)과 표준편차(std)**를 산출
  - → 이걸 모델별로 평균 내어 수치로 표기함
  - 다양한 질문에 대한 성능의 일관성을 보여주기 좋은 지표
  - 표준편차가 작다는 건 모델 응답의 퀄리티가 안정적이라는 의미
  - 학습되지 않은 데이터셋에서도 우수한 성능을 보임
  - complex reasoning 에서 매우 우수한 성능

728x90

저작자표시 (새창열림)

'Paper Review > etc' 카테고리의 다른 글

[10] CrossViT: Cross-Attention Multi-Scale Vision Transformer for ImageClassification (2)	2023.12.26
[9] Supervised Contrastive Learning (1)	2023.11.23
[8] MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE,AND MOBILE-FRIENDLY VISION TRANSFORMER (2)	2023.10.15
[7] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (0)	2023.09.12
[6] MobileOne: An Improved One millisecond Mobile Backbone (0)	2023.08.06

'Paper Review > etc' 카테고리의 다른 글

티스토리툴바