BGE-M3

BGE-M3 is a powerful, open-source embedding model from BAAI that excels in multilingual, multi-functional, and multi-granularity tasks. It supports dense, sparse, and ColBERT-style retrieval across 100+ languages, making it ideal for diverse NLP applications.

beginner15-20 minutes7 steps

The play

Install Necessary Libraries
Install the required libraries, including `transformers` and `torch`.
Load the BGE-M3 Model
Load the BGE-M3 model using the `AutoModel` and `AutoTokenizer` classes from the `transformers` library.
Define Input Text
Define the input text you want to embed. This can be a single sentence or a longer document.
Tokenize the Input
Tokenize the input text using the loaded tokenizer. Ensure you set `truncation=True` and `return_tensors='pt'` to handle long sequences and return PyTorch tensors.
Generate Embeddings
Pass the tokenized input to the model to generate embeddings.
Process Embeddings (Optional)
Depending on your use case, you might need to further process the embeddings (e.g., pooling, normalization).
Use the Embeddings
Use the generated embeddings for downstream tasks like semantic search, clustering, or classification.

Starter code

from transformers import AutoModel, AutoTokenizer
import torch

model_name = 'BAAI/bge-m3'
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "This is an example sentence in English."
inputs = tokenizer(text, truncation=True, return_tensors='pt')
outputs = model(**inputs)
embeddings = outputs.last_hidden_state

print(embeddings.shape)

Source

Articlehuggingface.co