Article·commoncrawl.org
llmmachine-learningdata-pipelinesopen-sourceresearchinfrastructure
Common Crawl
Common Crawl provides petabytes of open web crawl data, consistently collected over a decade. This vast, free resource is fundamental for training and developing Large Language Models (LLMs), enabling researchers to build robust AI systems without the cost of data collection.
beginner30 min4 steps
The play
- Understand Common Crawl's ValueRecognize Common Crawl as a critical, open-source dataset for AI, particularly for training and fine-tuning Large Language Models. Its scale and accessibility democratize LLM development.
- Explore Available Crawl DataVisit the Common Crawl website at `https://commoncrawl.org/the-data/` to browse the various crawl archives, their metadata, and understand the different file types (WARC, WAT, WET) available.
- Access Data via AWS S3Utilize the AWS Command Line Interface (CLI) or an S3-compatible tool to download specific segments or files. Common Crawl data is hosted on AWS S3, allowing direct access to petabytes of web content.
- Process Web Archive FilesEmploy tools and libraries designed for WARC, WAT, and WET files (e.g., `warcio` in Python, `cc-pyspark` for large-scale processing) to extract, filter, and prepare the data for your AI or LLM training tasks.
Starter code
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1701830605178.69/warc/CC-MAIN-20231206104321-20231206134321-00000.warc.gz .
Source