Article

financial-nlpsec-filingsfinberttable-extractionsentiment-analysispythonpandastransformers

Parse Financial Reports with FinBERT and Python

Use the FinBERT model to perform sentiment analysis on financial text. This pack also shows how to build a Financial Report Parser to extract tables from public SEC filings (like 10-Ks) and convert them into structured pandas DataFrames for analysis.

intermediate30 min4 steps

The play

Install Dependencies
Set up your Python environment. You'll need the `transformers` library to run the FinBERT model, `torch` as its backend, `pandas` for data manipulation, and `requests` to fetch online documents.
Perform Sentiment Analysis
Load the pre-trained FinBERT model and tokenizer. Use them to analyze the sentiment of a financial statement. This is useful for gauging the tone of earnings reports or management discussions.
Find a Public SEC Filing
Locate a public financial filing to parse. The SEC's EDGAR database is the primary source. Find a company's 10-K (annual report) and get the link to the interactive HTML version, which is easiest to parse.
Extract Financial Tables
Use pandas' `read_html` function, a core component of a Financial Report Parser, to automatically extract all tabular data from the filing's HTML. This converts messy HTML tables into a list of clean pandas DataFrames.

Starter code

import pandas as pd
import requests
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

def analyze_sentiment(text, model, tokenizer):
    """Performs sentiment analysis on a given text using a FinBERT model."""
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    scores = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
    labels = model.config.id2label
    return {label: score.item() for label, score in zip(labels, scores)}

def build_financial_report_parser(url):
    """Extracts all tables from a given SEC filing URL."""
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
        response = requests.get(url, headers=headers)
        response.raise_for_status() # Raise an exception for bad status codes
        tables = pd.read_html(response.text)
        print(f"Successfully extracted {len(tables)} tables from the URL.")
        return tables
    except Exception as e:
        print(f"Failed to parse tables: {e}")
        return []

if __name__ == "__main__":
    # --- Part 1: Sentiment Analysis ---
    print("--- Loading FinBERT for Sentiment Analysis ---")
    model_name = "ProsusAI/finbert"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    sample_text = "Guidance for the next quarter is projected to be lower due to supply chain constraints."
    sentiment = analyze_sentiment(sample_text, model, tokenizer)

    print(f"\nAnalysis for: '{sample_text}'")
    for label, score in sentiment.items():
        print(f"{label.capitalize()}: {score:.4f}")

    # --- Part 2: Table Extraction from SEC Filing ---
    print("\n--- Building Financial Report Parser for SEC Filing ---")
    # Using Microsoft's 2023 10-K as an example
    sec_filing_url = "https://www.sec.gov/Archives/edgar/data/789019/000156459023009653/msft-10k_20230630.htm"

    financial_tables = build_financial_report_parser(sec_filing_url)

    if financial_tables:
        # Find and print a meaningful table, e.g., the Consolidated Statements of Income
        # Note: The exact table to look for varies by filing.
        print("\n--- Searching for 'Consolidated Statements of Income' ---")
        found = False
        for i, df in enumerate(financial_tables):
            # A simple heuristic to find the income statement
            if any('revenue' in str(s).lower() for s in df.values.flatten()):
                print(f"Found a potential income statement table (Table #{i}):")
                # Clean up the table for better display
                df_cleaned = df.dropna(how='all', axis=0).dropna(how='all', axis=1).reset_index(drop=True)
                print(df_cleaned.to_string())
                found = True
                break
        if not found:
            print("Could not automatically identify the income statement. Printing first non-empty table instead.")
            for df in financial_tables:
                if not df.empty:
                    print(df.head())
                    break