A Multimodal Textbook for Vision-Language Pretraining

Video-to-Textbook

Abstract

Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality multimodal textbook corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving.

Previous interleaved datasets, e.g., MMC4 and OBELICS, suffer from limitations like weak text-image relations, low knowledge density, and incoherent image sequences. Our multimodal textbook, sourced from massive tutorial videos, employs coarse-to-fine knowledge extraction and multi-level filtering to create a high-quality, textbook-level dataset. It interleaves video keyframes with tutorial texts (extracted from ASR and OCR), enabling VLMs to acquire rich knowledge through tightly coupled text-image and more coherent logic.

Curation of Multimodal Textbook

In this paper, we introduce a multimodal Textbook: a high-quality pre-training corpus that encompasses a wealth of foundational knowledge. Our textbook is constructed from 2.5 years of instructional videos, amounting to 22,000 class hours, covering six fundamental subjects, including mathematics, physics, and others. The whole corpus is presented in an image-text interleaved format, where the text and images are more closely aligned, and the logical relations between images are also more coherent.

An illustration of constructing a multimodal textbook from instructional videos. We first instruct LLMs to construct a knowledge taxonomy, then retrieve and filter videos at metadata level, collecting 159K instructional videos. Then a video-to-textbook pipeline is designed for multi-level knowledge extraction. We filter out non-instructional videos using ASR transcripts, retaining 75K high-quality videos. We use ASR's timestamp to segment long videos into short clips, discarding those with misaligned visuals and ASR. We detect keyframes from each clip and extract text and symbols by OCR. Our pipeline produces 6.5M keyframes, 259M ASR, and 500M OCR tokens and organizes them into an image-text interleaved textbook.

An LLM-powered Pipeline for Automatically Collecting Instructional Videos: We first prompt LLMs to construct a knowledge taxonomy covering six subjects and 3900 knowledge points. Then based on this, we gather relevant instructional videos.
A Video-to-Textbook Pipeline: we design a multi-level, coarse-to-fine knowledge extraction and data filtering pipeline for these collected videos:
- From a visual perspective, we extract keyframes and recognition text, symbols, and formulas (OCR).
- From an auditory perspective, we perform automatic speech recognition (ASR) on the instructor's verbal explanations and refine their quality.
- The keyframes and tutorial text are organized into an interleaved format, sequenced chronologically.

Our textbook is an openly accessible pre-training dataset with high-quality 6.5 million images interleaving with 0.75 billion texts. It drawn from 75,000 extensive instructional videos, totoaling over 22000 class hours, covering multiple core subjects such as mathematics, physics, chemistry. Our textbook (the first example) presents three keyframes interleaved with four tutorial texts to dynamically illustrate the geometric concept of complementary angles. These more coherent interleaved context and better-aligned image-text sequences enable VLMs to better grasp foundational knowledge during the pretraining.

Pretraining with Multimodal Textbook

We first employ LLaVA-1.5-7B as base models to study the pretraining performance on our dataset and reference datasets (MMC4, OBELICS). For LLaVA-1.5-7B, we apply continual pretraining on its pre-trained model (aligned using 558K paired data). To investigate our dataset more comprehensively, we also pre-train Idefics2-8B model on our dataset, which is an advanced VLM that already supports multi-image and interleaved format input. For the Idefics2-8B, we design two pretraining settings: 1. Training from scratch using the architecture of Idefics2-8B (i.e., Idefics2-8B with randomly initialized projector) and 2. Continual pretraining from the Idefics2-8B-base which is already pre-trained on OBELICS. For a fair comparison, we sample an equivalent number of samples (610K) from MMC4 and OBELICS and apply the same training parameters across all datasets.

Shots	0	1	2	4	0	1	2	4	0	1	2	4	0	1	2	4
Datasets	ScienceQA^IMG				OKVQA				TextVQA				TextVQA^ocr
MMC4	-	1.6	3.9	11.6	8.6	23.6	21.5	28.7	12.1	16.2	16.8	20.9	14.5	23.9	29.9	34.7
MMC4-Core-ff	-	2.1	10.1	10.2	11.8	21.2	25.3	30.4	13.6	18.7	18.8	22.1	16.1	26.6	28.7	33.1
OBELICS	-	2.8	3.0	16.4	13.0	31.7	35.7	37.5	9.2	26.5	30.2	32.2	11.0	30.7	36.3	41.0
Textbook-6.5M	26.3	29.4	25.1	37.3	10.2	31.2	36.8	39.9	11.8	26.7	32.1	33.5	14.1	33.1	36.4	42.8
Dataset	MathVista				MathVision				MathVerse				Avg.
MMC4	20.4	30.0	27.9	26.0	12.2	21.3	15.5	16.1	8.6	19.4	21.2	15.9	10.9	19.4	19.5	21.9
MMC4-Core-ff	22.5	33.0	29.2	27.8	13.7	23.4	16.3	17.7	8.6	19.9	21.8	15.2	12.3	20.7	21.4	22.3
OBELICS	21.6	28.5	31.1	27.6	13.4	20.1	16.8	14.9	6.9	19.4	20.7	14.0	10.7	22.8	24.8	26.2
Textbook-6.5M	24.3	43.4	33.2	29.2	14.5	25.6	18.2	18.1	7.7	28.5	19.8	14.6	15.5	31.1	28.8	30.8

We continued pre-training the base model of LLaVA-1.5-7B using different interleaved datasets. The results are evaluated on 4 common VQA and 3 math-related benchmarks under few-shot settings.

Dataset	Continual Pre-training from Idefics2-8B-base					Pre-training Idefics2-8B from scratch
Dataset	OKVQA	TextVQA	MathVista	MathVision	MathVerse	OKVQA	TextVQA	MathVista	MathVision	MathVerse
MMC4-cf	54.1	57.7	27.8	14.0	17.3	9.4	25.1	24.0	13.3	18.3
OBELICS	54.6	57.5	27.6	14.3	17.5	10.5	25.7	24.2	13.6	17.7
Textbook-6.5M	55.1	58.2	29.7	16.2	19.4	10.1	26.8	26.1	14.4	19.8

Except for LLaVA, we also pre-train advanced VLMs with multi-image ability (Idefics): continual pretraining from Idefics-8B-base or pre-training from scratch. The evaluations are extended to an 8-shot using randomly selected examples as previous works.

Exploring Multimodal Textbook

We synthesize knowledge taxonomy with 3915 knowledge points across 6 subjects, which enabled us to automatically collect 159k English instructional videos based on this taxonomy. Following our video-to-textbook pipeline, we filter 53% low-quality or repetitive videos and retain 75k videos (22,697 class hours) with an average duration of 18 minutes. Then we extract 6.5M keyframes and 0.75B text (ASR+OCR) tokens from these videos. We produce a total of 610K interleaved samples. Each sample contains an average of 10.7 keyframes and 1,297 text tokens. The detailed statistics for each subject is as follows:

Textbook Subject	#Video	Duration (h)	#Topic	#Video Clip	#Keyframe	#ASR Token	#OCR Token	#Sample
Mathematics	21.7k	4,423	725	809k	1.67M	72.5M	145M	123k
Physics	11k	3,511	530	822k	0.95M	36.7M	73.4M	119k
Chemistry	4.5k	2,643	410	234k	0.49M	15M	30M	32k
Earth Science	12k	3,670	520	640k	1.03M	40M	80M	88k
Engineering	13k	4,096	810	713k	1.15M	43.3M	86.6M	98k
Computer Science	12.8k	4,354	820	782k	1.21M	42.8M	85.5M	150k
All	75k	22,697	3,915	4M	6.58M	258M	500M	610k

The statistics of our multimodal textbook.

The Earth science textbook explores fields such as geology, meteorology, and oceanography through combinations of theoretical diagrams and actual images.

In chemistry domain, our textbooks cover a wide range of knowledge from basic chemical reactions to complex molecular structures. Through detailed experimental demonstrations and theoretical explanations, our textbook can help VLMs deeply understand chemical principles.

Computer science textbook cover core contents such as programming fundamentals, data structures, and algorithm analysis. Through animations and code examples, abstract concepts become more concrete and easy to understand for VLMs.

The mathematics textbook systematically introduces important concepts from elementary mathematics to advanced mathematics. Through step-by-step, image-text interleaved derivation processes and intuitive geometric figures, it helps VLMs master mathematical thinking.

$Chemistry Example$

$Mathematics Textbook$

The physics textbook mainly focuses on mechanics and thermodynamics knowledge. Through vivid demonstrations and tutorial texts, it enables VLMs to understand physical laws and natural phenomena.

2.5 Years in Class: