Article·huggingface.co
embeddingmultilingualopen-sourcehybrid-retrievalbaaitransformerspytorch
BGE-M3
BGE-M3 is a powerful, open-source embedding model from BAAI that excels in multilingual, multi-functional, and multi-granularity tasks. It supports dense, sparse, and ColBERT-style retrieval across 100+ languages, making it ideal for diverse NLP applications.
beginner15-20 minutes7 steps
The play
- Install Necessary LibrariesInstall the required libraries, including `transformers` and `torch`.
- Load the BGE-M3 ModelLoad the BGE-M3 model using the `AutoModel` and `AutoTokenizer` classes from the `transformers` library.
- Define Input TextDefine the input text you want to embed. This can be a single sentence or a longer document.
- Tokenize the InputTokenize the input text using the loaded tokenizer. Ensure you set `truncation=True` and `return_tensors='pt'` to handle long sequences and return PyTorch tensors.
- Generate EmbeddingsPass the tokenized input to the model to generate embeddings.
- Process Embeddings (Optional)Depending on your use case, you might need to further process the embeddings (e.g., pooling, normalization).
- Use the EmbeddingsUse the generated embeddings for downstream tasks like semantic search, clustering, or classification.
Starter code
from transformers import AutoModel, AutoTokenizer import torch model_name = 'BAAI/bge-m3' model = AutoModel.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) text = "This is an example sentence in English." inputs = tokenizer(text, truncation=True, return_tensors='pt') outputs = model(**inputs) embeddings = outputs.last_hidden_state print(embeddings.shape)
Source