How to prepare pdf datasets from Wikipedia?

July 14, 2025

How to prepare pdf datasets from Wikipedia for research purpose?

For researches focusing on LLM Memory, it is critical to set up multi document dataset of high quality. This blog is about how to download Wikipedia pages as pdf for research purpose.

0. Prerequests: Access to Wikipedia

Run the command to validate Internet connection
```
curl https://www.wikipedia.org/
```
If fail, another blog might help you set up Clash on your server.

1. Dataset Creation Pipeline

See M3DocVQA READMD for detailed implementations.

Trouble Shooting If the pdf dowoloaded is not complete or is corrupted, try the following:

Add header disguise

Wait for complete loading: page.wait_for_load_state("networkidle")

Set up Clash for better connection

Now it is done!

Adam Chen

How to prepare pdf datasets from Wikipedia for research purpose?

0. Prerequests: Access to Wikipedia

1. Dataset Creation Pipeline