Skip to main content
Article
embeddingsbatch-processingdata-pipelineopenai-apiragsemantic-searchpython-scriptautomation

Generate Embeddings at Scale with Batch Processing

Use the Batch Embedding Generation script to efficiently create embeddings for large text datasets. It handles batching, rate limits, and errors, supporting providers like OpenAI. This automates a crucial step for RAG or semantic search applications.

intermediate30 min5 steps
The play
  1. Setup and Installation
    Download the script from its source URL. Then, install the necessary Python libraries for the provider you plan to use (e.g., OpenAI) and for data handling. This prepares your environment to run the tool.
  2. Prepare Your Input Data
    Create a CSV file with a header. The Batch Embedding Generation script needs a column containing the text you want to embed. A unique ID column is also recommended. For this example, create a file named `docs.csv`.
  3. Run a Basic Embedding Job
    Execute the script from your terminal. You must provide the input file, output file path, and the name of the column containing text. Set your provider's API key as an environment variable.
  4. Use Checkpointing for Reliability
    For large datasets, jobs can be interrupted. Use the `--checkpoint-file` flag to save progress. If the script stops, the Batch Embedding Generation tool will resume from the last completed batch on the next run, saving time and money.
  5. Inspect the Output File
    The script produces a JSON Lines (.jsonl) file. Each line is a JSON object containing the data from the original row plus a new 'embedding' key with the vector array. This format is ready to be loaded into a vector database.
Starter code
import os
import subprocess
import textwrap

# 1. Assume you have downloaded the script as 'generate_embeddings.py'
# This starter script will not run without it.

# 2. Create a sample CSV data file
csv_data = textwrap.dedent("""
    id,text_content
    doc-001,"Generative AI is transforming industries."
    doc-002,"Embeddings represent text as numerical vectors."
    doc-003,"Batch processing is efficient for large datasets."
    doc-004,"Checkpointing ensures long-running jobs can be resumed."
""")

input_filename = "sample_data.csv"
with open(input_filename, "w") as f:
    f.write(csv_data.strip())

print(f"Created '{input_filename}' for processing.")

# 3. Set your API key as an environment variable
# Replace with your actual key or ensure it's set in your shell
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    print("\nWARNING: OPENAI_API_KEY environment variable not set.")
    print("The script will fail. Please set it before running.")
    # Example for demonstration purposes if key is not set:
    # api_key = "sk-YOUR_KEY_HERE"

# 4. Construct and run the command for the Batch Embedding Generation script
output_filename = "sample_embeddings.jsonl"
checkpoint_filename = "sample.chk"

command = [
    "python",
    "generate_embeddings.py",
    "--input-file", input_filename,
    "--output-file", output_filename,
    "--text-column", "text_content",
    "--provider", "openai",
    "--model", "text-embedding-3-small",
    "--checkpoint-file", checkpoint_filename,
    "--batch-size", "2"  # Use a small batch size for demonstration
]

print("\nRunning command:")
print(' '.join(command))

# Execute the command. This requires 'generate_embeddings.py' to be in the same directory.
try:
    # We capture output to show it, but the script writes directly to the file.
    result = subprocess.run(command, check=True, capture_output=True, text=True)
    print("\nScript output (stdout):")
    print(result.stdout)
    print("\nSuccessfully generated embeddings.")
    print(f"Check the output file: '{output_filename}'")
    print(f"Check the checkpoint file: '{checkpoint_filename}'")
except FileNotFoundError:
    print("\nERROR: 'generate_embeddings.py' not found.")
    print("Please download the script and place it in the current directory.")
except subprocess.CalledProcessError as e:
    print("\nERROR: Script execution failed.")
    print("Return code:", e.returncode)
    print("Stdout:", e.stdout)
    print("Stderr:", e.stderr)
Generate Embeddings at Scale with Batch Processing — Action Pack