Skip to main content
Article·blog.cloudflare.com
devopsmachine-learningdata-pipelinesinfrastructureopen-sourcemcpevaluationdeploymentgitgit-lfsdvc

Artifacts: Versioned storage that speaks Git

Implement Git-native versioning for AI models, datasets, and configurations. This ensures reproducibility and traceability for large non-code assets, streamlining MLOps workflows and overcoming traditional Git limitations.

intermediate30 min7 steps
The play
  1. Identify Large Artifacts
    Pinpoint large AI assets (models, datasets, experiment configurations) that require version control beyond standard Git capabilities.
  2. Select a Versioning Tool
    Choose a Git-integrated tool specifically designed for large files and data, such as DVC (Data Version Control) or Git LFS (Large File Storage).
  3. Initialize Repository
    Set up your chosen tool within your existing Git repository. For DVC, navigate to your project root and run:
  4. Track Artifacts
    Use the tool's commands to start tracking your large AI files or directories. This creates small metadata files that Git *will* track. For DVC:
  5. Commit Metadata to Git
    Add and commit the generated metadata files (e.g., `.dvc` files) to your standard Git repository. These files link to the actual large artifacts.
  6. Push Artifacts and Git History
    Push your Git repository changes as usual, and then push the actual large artifacts to their configured remote storage (e.g., S3, GCS, Azure Blob).
  7. Reproduce Specific Versions
    To retrieve a specific version of your artifacts, checkout the corresponding Git commit. The versioning tool will automatically fetch the correct large files.
Starter code
dvc init
dvc add data/my_dataset.csv
git add data/my_dataset.csv.dvc .dvcignore
git commit -m "Added initial dataset with DVC"
# Configure a DVC remote (e.g., for S3)
# dvc remote add -d myremote s3://your-bucket/your-project
dvc push
Source
Artifacts: Versioned storage that speaks Git — Action Pack