How to prepare pdf datasets from Wikipedia?
How to prepare pdf datasets from Wikipedia for research purpose?
For researches focusing on LLM Memory, it is critical to set up multi document dataset of high quality. This blog is about how to download Wikipedia pages as pdf for research purpose.
0. Prerequests: Access to Wikipedia
- Run the command to validate Internet connection
curl https://www.wikipedia.org/
- If fail, another blog might help you set up Clash on your server.
1. Dataset Creation Pipeline
- See M3DocVQA READMD for detailed implementations.
Trouble Shooting If the pdf dowoloaded is not complete or is corrupted, try the following:
- Add header disguise
- Wait for complete loading:
page.wait_for_load_state("networkidle")
- Set up Clash for better connection
Now it is done!