Article
financial-nlpsec-filingsfinberttable-extractionsentiment-analysispythonpandastransformers
Parse Financial Reports with FinBERT and Python
Use the FinBERT model to perform sentiment analysis on financial text. This pack also shows how to build a Financial Report Parser to extract tables from public SEC filings (like 10-Ks) and convert them into structured pandas DataFrames for analysis.
intermediate30 min4 steps
The play
- Install DependenciesSet up your Python environment. You'll need the `transformers` library to run the FinBERT model, `torch` as its backend, `pandas` for data manipulation, and `requests` to fetch online documents.
- Perform Sentiment AnalysisLoad the pre-trained FinBERT model and tokenizer. Use them to analyze the sentiment of a financial statement. This is useful for gauging the tone of earnings reports or management discussions.
- Find a Public SEC FilingLocate a public financial filing to parse. The SEC's EDGAR database is the primary source. Find a company's 10-K (annual report) and get the link to the interactive HTML version, which is easiest to parse.
- Extract Financial TablesUse pandas' `read_html` function, a core component of a Financial Report Parser, to automatically extract all tabular data from the filing's HTML. This converts messy HTML tables into a list of clean pandas DataFrames.
Starter code
import pandas as pd
import requests
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
def analyze_sentiment(text, model, tokenizer):
"""Performs sentiment analysis on a given text using a FinBERT model."""
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
scores = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
labels = model.config.id2label
return {label: score.item() for label, score in zip(labels, scores)}
def build_financial_report_parser(url):
"""Extracts all tables from a given SEC filing URL."""
try:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(url, headers=headers)
response.raise_for_status() # Raise an exception for bad status codes
tables = pd.read_html(response.text)
print(f"Successfully extracted {len(tables)} tables from the URL.")
return tables
except Exception as e:
print(f"Failed to parse tables: {e}")
return []
if __name__ == "__main__":
# --- Part 1: Sentiment Analysis ---
print("--- Loading FinBERT for Sentiment Analysis ---")
model_name = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
sample_text = "Guidance for the next quarter is projected to be lower due to supply chain constraints."
sentiment = analyze_sentiment(sample_text, model, tokenizer)
print(f"\nAnalysis for: '{sample_text}'")
for label, score in sentiment.items():
print(f"{label.capitalize()}: {score:.4f}")
# --- Part 2: Table Extraction from SEC Filing ---
print("\n--- Building Financial Report Parser for SEC Filing ---")
# Using Microsoft's 2023 10-K as an example
sec_filing_url = "https://www.sec.gov/Archives/edgar/data/789019/000156459023009653/msft-10k_20230630.htm"
financial_tables = build_financial_report_parser(sec_filing_url)
if financial_tables:
# Find and print a meaningful table, e.g., the Consolidated Statements of Income
# Note: The exact table to look for varies by filing.
print("\n--- Searching for 'Consolidated Statements of Income' ---")
found = False
for i, df in enumerate(financial_tables):
# A simple heuristic to find the income statement
if any('revenue' in str(s).lower() for s in df.values.flatten()):
print(f"Found a potential income statement table (Table #{i}):")
# Clean up the table for better display
df_cleaned = df.dropna(how='all', axis=0).dropna(how='all', axis=1).reset_index(drop=True)
print(df_cleaned.to_string())
found = True
break
if not found:
print("Could not automatically identify the income statement. Printing first non-empty table instead.")
for df in financial_tables:
if not df.empty:
print(df.head())
break