ResearchHub | Open Science Community

MY

Ming Yan

Author with expertise in Statistical Machine Translation and Natural Language Processing

Achievements

This user has not unlocked any achievements yet.

Key Stats

Upvotes received:

0

Publications:

3

(33% Open Access)

Cited by:

1

h-index:

18

/

i10-index:

22

Reputation

Biology

< 1%

Chemistry

< 1%

Economics

< 1%

Show more

How is this calculated?

Publications

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Anwen Hu et al.Jan 1, 2024

Artificial Intelligence

Computer Vision And Pattern Recognition

0

Paper

Artificial Intelligence

Save

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Haowei Liu et al.Jan 1, 2024

Artificial Intelligence

Computer Vision And Pattern Recognition

0

Paper

Artificial Intelligence

Computer Vision And Pattern Recognition

Save

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Anwen Hu et al.Sep 5, 2024

Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2.

Artificial Intelligence

Information Systems

2

Paper

Artificial Intelligence

Information Systems

Save