[4] LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

728x90

논문 링크: https://arxiv.org/pdf/2104.08836.pdf

github: https://github.com/microsoft/unilm

GitHub - microsoft/unilm: Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities - GitHub - microsoft/unilm: Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

github.com

hugging face: https://huggingface.co/docs/transformers/model_doc/layoutxlm

LayoutXLM

When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token.

huggingface.co

Introduction

LayoutLMv2 모델을 언어에 관계 없이 사용하고자 다양한 multilingual 문서에 대해서 학습
한국어 포함 총 53개의 언어에 대해 학습

다양한 언어로 이루어진 XFUND 문서 데이터 집합 구축

Architecture

LayoutXLM

LayoutLMv2 구조와 동일
- text, image, layout 정보 사용
- 세 가지의 정보를 각각 embedding 한 후 concat해 input embedding 생성
- multi-modal transformer로 들어가 self-attention 과정을 거침

Experiment

여러 task에 fine-tuning시킨 후 학습 결과

Result Image

red: header / green: key / blue: value

Conclusion

다양한 언어로 이루어진 문서를 이해하기 위한 multi-modal pre-train 모델인 LayoutXLM 모델 제안
XFUND 데이터집합 구축

728x90

저작자표시

'Paper Review > Key Information Extraction' 카테고리의 다른 글

[3] LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding (0)	2023.03.14
[2] Spatial Dual-Modality Graph Reasoning for Key Information Extraction (0)	2023.03.14
[1] Visual FUDGE: Form Understanding via Dynamic Graph Editing (0)	2023.03.14

Introduction

Architecture

Experiment

Result Image

Conclusion

'Paper Review > Key Information Extraction' 카테고리의 다른 글

티스토리툴바