Skip to main content
Article·huggingface.co
embeddingmultilingualopen-sourcehybrid-retrievalbaaitransformerspytorch

BGE-M3

BGE-M3 is a powerful, open-source embedding model from BAAI that excels in multilingual, multi-functional, and multi-granularity tasks. It supports dense, sparse, and ColBERT-style retrieval across 100+ languages, making it ideal for diverse NLP applications.

beginner15-20 minutes7 steps
The play
  1. Install Necessary Libraries
    Install the required libraries, including `transformers` and `torch`.
  2. Load the BGE-M3 Model
    Load the BGE-M3 model using the `AutoModel` and `AutoTokenizer` classes from the `transformers` library.
  3. Define Input Text
    Define the input text you want to embed. This can be a single sentence or a longer document.
  4. Tokenize the Input
    Tokenize the input text using the loaded tokenizer. Ensure you set `truncation=True` and `return_tensors='pt'` to handle long sequences and return PyTorch tensors.
  5. Generate Embeddings
    Pass the tokenized input to the model to generate embeddings.
  6. Process Embeddings (Optional)
    Depending on your use case, you might need to further process the embeddings (e.g., pooling, normalization).
  7. Use the Embeddings
    Use the generated embeddings for downstream tasks like semantic search, clustering, or classification.
Starter code
from transformers import AutoModel, AutoTokenizer
import torch

model_name = 'BAAI/bge-m3'
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "This is an example sentence in English."
inputs = tokenizer(text, truncation=True, return_tensors='pt')
outputs = model(**inputs)
embeddings = outputs.last_hidden_state

print(embeddings.shape)
Source
BGE-M3 — Action Pack