Generate Graph Embeddings with PyTorch Geometric

Use PyTorch Geometric (PyG) to generate node embeddings from a graph. This guide shows how to train a Node2Vec model on a sample dataset and use the resulting embeddings to predict new links in the graph, a common knowledge graph task.

intermediate30 min5 steps

The play

Install PyTorch and PyG
Set up your environment by installing PyTorch and the core PyTorch Geometric library. PyG contains all the tools needed for graph machine learning, including models like Node2Vec and common graph datasets.
Load a Graph Dataset
PyG provides easy access to benchmark datasets. We'll load the 'Cora' citation network, a standard dataset for graph learning tasks. The 'data' object will contain the graph's structure (edges) and node features.
Configure and Train a Node2Vec Model
Instantiate a Node2Vec model from the 'Graph Embedding Generator' toolkit (PyG). We'll configure its parameters, such as embedding dimension and random walk length, then train it on the graph's edge index to learn representations for each node.
Extract Node Embeddings
Once the model is trained, you can call it to get the final node embeddings. The resulting tensor 'z' will have a shape of [num_nodes, embedding_dim], where each row is the vector representation for a node.
Evaluate Embeddings for Link Prediction
A key use of embeddings is link prediction. We can test our model by seeing how well it predicts held-out edges. The model's 'test' function uses the dot product of node embeddings to score potential links and returns an AUC score.

Starter code

import torch
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import Node2Vec
from torch_geometric.utils import train_test_split_edges
from sklearn.metrics import roc_auc_score

# 1. Load Dataset and prepare splits
dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0]

# Use a utility to create train/val/test splits of edges
data.train_mask = data.val_mask = data.test_mask = data.y = None
data = train_test_split_edges(data, val_ratio=0.05, test_ratio=0.1)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# 2. Configure and Train the Graph Embedding Generator (Node2Vec)
model = Node2Vec(
    data.train_pos_edge_index,
    embedding_dim=128,
    walk_length=20,
    context_size=10,
    walks_per_node=10,
    num_negative_samples=1,
    p=1,
    q=1,
    sparse=True,
).to(device)

loader = model.loader(batch_size=128, shuffle=True, num_workers=4 if torch.cuda.is_available() else 0)
optimizer = torch.optim.SparseAdam(list(model.parameters()), lr=0.01)

def train():
    model.train()
    total_loss = 0
    for pos_rw, neg_rw in loader:
        optimizer.zero_grad()
        loss = model.loss(pos_rw.to(device), neg_rw.to(device))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

print("Training Node2Vec model...")
for epoch in range(1, 101):
    loss = train()
    if epoch % 20 == 0:
        print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

# 3. Evaluate Embeddings on Link Prediction Task
@torch.no_grad()
def test(pos_edge_index, neg_edge_index):
    model.eval()
    z = model()
    
    # Positive edges
    pos_y = z.new_ones(pos_edge_index.size(1))
    pos_pred = (z[pos_edge_index[0]] * z[pos_edge_index[1]]).sum(dim=1)

    # Negative edges
    neg_y = z.new_zeros(neg_edge_index.size(1))
    neg_pred = (z[neg_edge_index[0]] * z[neg_edge_index[1]]).sum(dim=1)

    pred = torch.cat([pos_pred, neg_pred], dim=0)
    y = torch.cat([pos_y, neg_y], dim=0)
    
    return roc_auc_score(y.cpu().numpy(), pred.cpu().numpy())

val_auc = test(data.val_pos_edge_index, data.val_neg_edge_index)
test_auc = test(data.test_pos_edge_index, data.test_neg_edge_index)

print(f'\nValidation AUC: {val_auc:.4f}')
print(f'Test AUC: {test_auc:.4f}')

# 4. Get embeddings for a specific node
node_id = 10
node_embedding = model()[node_id]
print(f'\nEmbedding for node {node_id}:\n{node_embedding}')