Logo2.5 Years in Class:

A Multimodal Textbook for Vision-Language Pretraining

Zhejiang University
DAMO Academy, Alibaba Group

Video-to-Textbook

Abstract

Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality multimodal textbook corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving.
Multimodal Textbook
Previous interleaved datasets, e.g., MMC4 and OBELICS, suffer from limitations like weak text-image relations, low knowledge density, and incoherent image sequences. Our multimodal textbook, sourced from massive tutorial videos, employs coarse-to-fine knowledge extraction and multi-level filtering to create a high-quality, textbook-level dataset. It interleaves video keyframes with tutorial texts (extracted from ASR and OCR), enabling VLMs to acquire rich knowledge through tightly coupled text-image and more coherent logic.

Curation of Multimodal Textbook

In this paper, we introduce a multimodal Textbook: a high-quality pre-training corpus that encompasses a wealth of foundational knowledge. Our textbook is constructed from 2.5 years of instructional videos, amounting to 22,000 class hours, covering six fundamental subjects, including mathematics, physics, and others. The whole corpus is presented in an image-text interleaved format, where the text and images are more closely aligned, and the logical relations between images are also more coherent.
Multimodal Textbook
An illustration of constructing a multimodal textbook from instructional videos. We first instruct LLMs to construct a knowledge taxonomy, then retrieve and filter videos at metadata level, collecting 159K instructional videos. Then a video-to-textbook pipeline is designed for multi-level knowledge extraction. We filter out non-instructional videos using ASR transcripts, retaining 75K high-quality videos. We use ASR's timestamp to segment long videos into short clips, discarding those with misaligned visuals and ASR. We detect keyframes from each clip and extract text and symbols by OCR. Our pipeline produces 6.5M keyframes, 259M ASR, and 500M OCR tokens and organizes them into an image-text interleaved textbook.
  • An LLM-powered Pipeline for Automatically Collecting Instructional Videos: We first prompt LLMs to construct a knowledge taxonomy covering six subjects and 3900 knowledge points. Then based on this, we gather relevant instructional videos.
  • A Video-to-Textbook Pipeline: we design a multi-level, coarse-to-fine knowledge extraction and data filtering pipeline for these collected videos:
    • From a visual perspective, we extract keyframes and recognition text, symbols, and formulas (OCR).
    • From an auditory perspective, we perform automatic speech recognition (ASR) on the instructor's verbal explanations and refine their quality.
    • The keyframes and tutorial text are organized into an interleaved format, sequenced chronologically.
Our textbook is an openly accessible pre-training dataset with high-quality 6.5 million images interleaving with 0.75 billion texts. It drawn from 75,000 extensive instructional videos, totoaling over 22000 class hours, covering multiple core subjects such as mathematics, physics, chemistry. Our textbook (the first example) presents three keyframes interleaved with four tutorial texts to dynamically illustrate the geometric concept of complementary angles. These more coherent interleaved context and better-aligned image-text sequences enable VLMs to better grasp foundational knowledge during the pretraining.

Pretraining with Multimodal Textbook

We first employ LLaVA-1.5-7B as base models to study the pretraining performance on our dataset and reference datasets (MMC4, OBELICS). For LLaVA-1.5-7B, we apply continual pretraining on its pre-trained model (aligned using 558K paired data). To investigate our dataset more comprehensively, we also pre-train Idefics2-8B model on our dataset, which is an advanced VLM that already supports multi-image and interleaved format input. For the Idefics2-8B, we design two pretraining settings: 1. Training from scratch using the architecture of Idefics2-8B (i.e., Idefics2-8B with randomly initialized projector) and 2. Continual pretraining from the Idefics2-8B-base which is already pre-trained on OBELICS. For a fair comparison, we sample an equivalent number of samples (610K) from MMC4 and OBELICS and apply the same training parameters across all datasets.
Shots 0124 0124 0124 0124
Datasets ScienceQAIMG OKVQA TextVQA TextVQAocr
MMC4 -1.63.911.6 8.623.621.528.7 12.116.216.820.9 14.523.929.934.7
MMC4-Core-ff -2.110.110.2 11.821.225.330.4 13.618.718.822.1 16.126.628.733.1
OBELICS -2.83.016.4 13.031.735.737.5 9.226.530.232.2 11.030.736.341.0
Textbook-6.5M 26.329.425.137.3 10.231.236.839.9 11.826.732.133.5 14.133.136.442.8
Dataset MathVista MathVision MathVerse Avg.
MMC4 20.430.027.926.0 12.221.315.516.1 8.619.421.215.9 10.919.419.521.9
MMC4-Core-ff 22.533.029.227.8 13.723.416.317.7 8.619.921.815.2 12.320.721.422.3
OBELICS 21.628.531.127.6 13.420.116.814.9 6.919.420.714.0 10.722.824.826.2
Textbook-6.5M 24.343.433.229.2 14.525.618.218.1 7.728.519.814.6 15.531.128.830.8

We continued pre-training the base model of LLaVA-1.5-7B using different interleaved datasets. The results are evaluated on 4 common VQA and 3 math-related benchmarks under few-shot settings.

Dataset Continual Pre-training from Idefics2-8B-base Pre-training Idefics2-8B from scratch
OKVQA TextVQA MathVista MathVision MathVerse OKVQA TextVQA MathVista MathVision MathVerse
MMC4-cf 54.157.727.814.017.3 9.425.124.013.318.3
OBELICS 54.657.527.614.317.5 10.525.724.213.617.7
Textbook-6.5M 55.158.229.716.219.4 10.126.826.114.419.8

Except for LLaVA, we also pre-train advanced VLMs with multi-image ability (Idefics): continual pretraining from Idefics-8B-base or pre-training from scratch. The evaluations are extended to an 8-shot using randomly selected examples as previous works.

Exploring Multimodal Textbook

We synthesize knowledge taxonomy with 3915 knowledge points across 6 subjects, which enabled us to automatically collect 159k English instructional videos based on this taxonomy. Following our video-to-textbook pipeline, we filter 53% low-quality or repetitive videos and retain 75k videos (22,697 class hours) with an average duration of 18 minutes. Then we extract 6.5M keyframes and 0.75B text (ASR+OCR) tokens from these videos. We produce a total of 610K interleaved samples. Each sample contains an average of 10.7 keyframes and 1,297 text tokens. The detailed statistics for each subject is as follows:

Textbook Subject #Video Duration (h) #Topic #Video Clip #Keyframe #ASR Token #OCR Token #Sample
Mathematics 21.7k 4,423 725 809k 1.67M 72.5M 145M 123k
Physics 11k 3,511 530 822k 0.95M 36.7M 73.4M 119k
Chemistry 4.5k 2,643 410 234k 0.49M 15M 30M 32k
Earth Science 12k 3,670 520 640k 1.03M 40M 80M 88k
Engineering 13k 4,096 810 713k 1.15M 43.3M 86.6M 98k
Computer Science 12.8k 4,354 820 782k 1.21M 42.8M 85.5M 150k
All 75k 22,697 3,915 4M 6.58M 258M 500M 610k

The statistics of our multimodal textbook.