Optimizing Korean-Centric LLMs via Token Pruning

Optimize multilingual LLMs for Korean by token pruning. This technique removes irrelevant language tokens and embedding parameters, significantly reducing model size and computational cost. Achieve specialized, efficient AI for niche linguistic markets with better performance and lower inference costs.

advanced2 hours5 steps

The play

Identify Target Language and Model
Choose a pre-trained multilingual Large Language Model (LLM) and define your specific target language (e.g., Korean). This technique is most effective when adapting a broad model to a narrow linguistic scope.
Extract Vocabulary and Embeddings
Access the LLM's tokenizer vocabulary and its embedding layer. These components contain tokens and their numerical representations for all languages the model was trained on.
Determine Irrelevant Tokens
Analyze the vocabulary to identify and filter out tokens that are not relevant to your target language. This might involve comparing against a comprehensive target language lexicon, using language identification tools, or leveraging character range checks for scripts like Hangul.
Prune Model Components
Create a new, pruned vocabulary by excluding irrelevant tokens and adjust the embedding layer to remove the corresponding rows. This directly reduces the model's parameter count and memory footprint.
Validate and Fine-Tune
Integrate the pruned vocabulary and embeddings back into the LLM architecture. Fine-tune the optimized model on Korean-centric datasets and benchmark its performance and efficiency to confirm improvements in speed, size, and task accuracy.

Starter code

import torch

# Simulate a tokenizer's vocabulary and embeddings for a multilingual LLM
# In a real scenario, these would come from a pre-trained model (e.g., Hugging Face Transformers)
multilingual_vocab = ["<unk>", "hello", "world", "안녕하세요", "세계", "bonjour", "monde"]
multilingual_embeddings = torch.randn(len(multilingual_vocab), 768) # Example embedding dimension

# Define a hypothetical function to check if a token is Korean
# In a real scenario, this would be more sophisticated (e.g., regex, char range check)
def is_korean(token):
    # Simple check for common Korean characters (Hangul)
    for char_code in map(ord, token):
        if 0xAC00 <= char_code <= 0xD7A3:  # Hangul Syllables
            return True
    return False

# Step 1: Identify target language (Korean) - implicit here
# Step 2: Extract vocabulary (done)

# Step 3: Determine and filter relevant tokens for Korean
korean_tokens = []
korean_token_indices = []
for i, token in enumerate(multilingual_vocab):
    if is_korean(token) or token in ["<unk>"]: # Keep special tokens and Korean tokens
        korean_tokens.append(token)
        korean_token_indices.append(i)

# Step 4: Prune embeddings
pruned_embeddings = multilingual_embeddings[korean_token_indices]

print(f"Original vocabulary size: {len(multilingual_vocab)}")
print(f"Pruned Korean vocabulary size: {len(korean_tokens)}")
print(f"Original embedding shape: {multilingual_embeddings.shape}")
print(f"Pruned embedding shape: {pruned_embeddings.shape}")

# In a real application, you would then build a new tokenizer and update the model's embedding layer
# with the pruned_embeddings and new_vocab.

Source

Paperarxiv.org