Common Crawl

Common Crawl provides petabytes of open web crawl data, consistently collected over a decade. This vast, free resource is fundamental for training and developing Large Language Models (LLMs), enabling researchers to build robust AI systems without the cost of data collection.

beginner30 min4 steps

The play

Understand Common Crawl's Value
Recognize Common Crawl as a critical, open-source dataset for AI, particularly for training and fine-tuning Large Language Models. Its scale and accessibility democratize LLM development.
Explore Available Crawl Data
Visit the Common Crawl website at `https://commoncrawl.org/the-data/` to browse the various crawl archives, their metadata, and understand the different file types (WARC, WAT, WET) available.
Access Data via AWS S3
Utilize the AWS Command Line Interface (CLI) or an S3-compatible tool to download specific segments or files. Common Crawl data is hosted on AWS S3, allowing direct access to petabytes of web content.
Process Web Archive Files
Employ tools and libraries designed for WARC, WAT, and WET files (e.g., `warcio` in Python, `cc-pyspark` for large-scale processing) to extract, filter, and prepare the data for your AI or LLM training tasks.

Starter code

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1701830605178.69/warc/CC-MAIN-20231206104321-20231206134321-00000.warc.gz .

Source

Articlecommoncrawl.org