How to prepare pdf datasets from Wikipedia for research purpose?

For researches focusing on LLM Memory, it is critical to set up multi document dataset of high quality. This blog is about how to download Wikipedia pages as pdf for research purpose.

0. Prerequests: Access to Wikipedia

  • Run the command to validate Internet connection
    curl https://www.wikipedia.org/
    
  • If fail, another blog might help you set up Clash on your server.

1. Dataset Creation Pipeline

Trouble Shooting If the pdf dowoloaded is not complete or is corrupted, try the following:

  • Add header disguise
  • Wait for complete loading: page.wait_for_load_state("networkidle")
  • Set up Clash for better connection

Now it is done!