How to parse Form 4 XML
The SEC publishes every Form 4 as raw XML. Before you can do anything useful with it, you need to fetch the filing index, find the right document, handle namespace quirks, and extract transactions from two separate tables. This guide shows exactly how painful that is — then shows the EdgarKit JSON equivalent.
The problem
Every Form 4 on EDGAR is stored as XML. The schema is well-documented and consistent, but getting from "I want insider trades for NVDA" to a clean list of transactions requires more steps than you'd expect:
- Look up the CIK for NVDA (EDGAR uses CIKs, not tickers)
- Fetch the filing index for that CIK
- Find the
.xmldocument within the filing bundle (there may be 3-5 files) - Parse the XML, handling optional elements and multi-transaction filings
- Do this twice: once for the non-derivative table (common stock), once for the derivative table (options, RSUs)
- Map codes back to human-readable meanings — see Form 4 transaction codes
The result is a brittle 100-line script before you've written a single line of business logic. Here's what that looks like, followed by the EdgarKit alternative.
The approach
Walk through the raw XML path first so you understand what's actually in these filings. Then contrast it with a single EdgarKit API call that returns the same data already parsed. The goal isn't to tell you the raw approach is wrong — it's to show you the real cost of building it yourself.
Step 1: Fetch the raw XML from SEC EDGAR
Every Form 4 filing is accessible at a URL built from the CIK and accession number. EDGAR's search API returns these:
# Find the 10 most recent Form 4 filings for NVIDIA (CIK 1045810)
curl "https://data.sec.gov/submissions/CIK0001045810.json" \
-H "User-Agent: your-app-name your-email@example.com"
That returns a massive JSON blob. The filings.recent key contains parallel arrays (no nested objects) for accession numbers, filing dates, form types, and primary documents. You have to zip them yourself to find Form 4s.
Once you have an accession number like 0001234567-26-000123, the index URL is:
https://www.sec.gov/Archives/edgar/data/1045810/000123456726000123/0001234567-26-000123-index.htm
The XML document is typically named something like wf-form4_168xxx.xml. There's no standard filename — you have to fetch the index and find the file with type 4.
Step 2: Parse the XML with Python
Here's what parsing the non-derivative transaction table looks like with xml.etree.ElementTree:
import urllib.request
import xml.etree.ElementTree as ET
def fetch_form4_xml(cik: str, accession_number: str) -> str:
"""Fetch the raw Form 4 XML from EDGAR."""
acc_nodash = accession_number.replace("-", "")
# The XML filename isn't standardized — you'd normally fetch the index first
# This assumes you already found the filename from the index
xml_filename = "wf-form4.xml" # placeholder; real code must resolve this
url = (
f"https://www.sec.gov/Archives/edgar/data/{cik}"
f"/{acc_nodash}/{xml_filename}"
)
req = urllib.request.Request(url, headers={"User-Agent": "your-app your@email.com"})
with urllib.request.urlopen(req) as resp:
return resp.read().decode("utf-8")
def parse_transactions(xml_text: str) -> list[dict]:
root = ET.fromstring(xml_text)
# Issuer info
issuer = root.find("issuer")
ticker = issuer.findtext("issuerTradingSymbol") if issuer is not None else None
# Reporting owner
owner = root.find("reportingOwner")
owner_name = None
owner_role = None
if owner is not None:
owner_id = owner.find("reportingOwnerId")
if owner_id is not None:
owner_name = owner_id.findtext("rptOwnerName")
owner_rel = owner.find("reportingOwnerRelationship")
if owner_rel is not None:
# Role is spread across multiple boolean flags
is_officer = owner_rel.findtext("isOfficer") == "1"
is_director = owner_rel.findtext("isDirector") == "1"
title = owner_rel.findtext("officerTitle") or ""
if is_officer:
owner_role = title or "Officer"
elif is_director:
owner_role = "Director"
results = []
# Non-derivative table only — derivative table is a separate element
nd_table = root.find("nonDerivativeTable")
if nd_table is not None:
for txn in nd_table.findall("nonDerivativeTransaction"):
code_el = txn.find("transactionCoding")
amounts = txn.find("transactionAmounts")
post_el = txn.find("postTransactionAmounts")
code = None
if code_el is not None:
code = code_el.findtext("transactionCode")
shares = None
price = None
total_value = None
if amounts is not None:
shares_text = amounts.findtext("transactionShares/value")
price_text = amounts.findtext("transactionPricePerShare/value")
if shares_text and price_text:
shares = float(shares_text)
price = float(price_text)
total_value = shares * price
results.append({
"ticker": ticker,
"filer_name": owner_name,
"filer_role": owner_role,
"transaction_date": txn.findtext("transactionDate/value"),
"transaction_code": code,
"shares": shares,
"price_per_share": price,
"total_value": total_value,
})
return results
# Usage
xml_text = fetch_form4_xml("1045810", "0001234567-26-000123")
transactions = parse_transactions(xml_text)
for t in transactions:
print(t)
That's roughly 70 lines to extract one filing. It doesn't handle amendments, it skips the derivative table entirely, it doesn't resolve the XML filename from the index, and it doesn't map CIKs to tickers for you — you passed the CIK directly.
Step 3: What you still have to handle
Even with the parser above, you're not done:
- Derivative table — options, RSUs, warrants live in
<derivativeTable>, with a different element structure. You need a second parser loop. - Multi-owner filings — a single Form 4 can have multiple
<reportingOwner>blocks (joint filers). The loop above only grabs the first one. - Amendments — Form 4/A files have the same XML shape but need to be reconciled against the original. The
<documentType>element tells you if it's an amendment. - Rate limiting — EDGAR allows 10 requests per second per IP. If you're processing hundreds of filings, you need a throttle.
- Ticker lookup — EDGAR doesn't guarantee the
<issuerTradingSymbol>field is populated. You sometimes need to resolve via a separate CIK-to-ticker map.
The EdgarKit alternative
The same data, in one call:
curl "https://api.edgarkit.com/v1/filings?form_type=4&ticker=NVDA&limit=10" \
-H "Authorization: Bearer YOUR_API_KEY"
{
"filings": [
{
"id": "f4_0001234567_20260615",
"form_type": "4",
"issuer_ticker": "NVDA",
"issuer_name": "NVIDIA Corporation",
"filer_name": "Jensen Huang",
"filer_role": "CEO",
"transaction_date": "2026-06-15",
"transaction_code": "P",
"shares": 5000,
"price_per_share": 118.42,
"total_value": 592100,
"post_transaction_shares": 872500,
"is_derivative": false,
"amended": false,
"filed_at": "2026-06-15T21:04:33Z"
}
]
}
Derivative and non-derivative transactions are unified in one array (use is_derivative to separate them). Amendments are flagged. Tickers are already resolved. You get here from zero lines of parsing code.
Putting it together
Here's a Python script that uses the EdgarKit API to pull the last week of P-coded NVDA filings — the clean version of everything the XML parser above was trying to do:
import os
import requests
from datetime import datetime, timedelta, timezone
API_KEY = os.environ["EDGARKIT_API_KEY"]
def get_insider_purchases(ticker: str, days: int = 7) -> list[dict]:
since = (datetime.now(timezone.utc) - timedelta(days=days)).strftime("%Y-%m-%d")
resp = requests.get(
"https://api.edgarkit.com/v1/filings",
params={
"form_type": "4",
"ticker": ticker,
"transaction_code": "P",
"since": since,
"limit": 100,
},
headers={"Authorization": f"Bearer {API_KEY}"},
timeout=10,
)
resp.raise_for_status()
return resp.json()["filings"]
if __name__ == "__main__":
filings = get_insider_purchases("NVDA")
for f in filings:
print(
f"{f['transaction_date']} {f['filer_name']} ({f['filer_role']}) "
f"{f['shares']:,} shares @ ${f['price_per_share']} "
f"= ${f['total_value']:,.0f}"
)
Twelve lines of business logic, no XML parsing, no CIK lookup, no rate-limit juggling.
FAQ
Is the Form 4 XML schema stable?
Mostly. The SEC has maintained the core schema since it was introduced under Sarbanes-Oxley in 2002. The element names and structure are consistent. The edge cases (missing price, multi-owner filings, derivative-only forms) are the parts that require defensive parsing.
Does EdgarKit return the raw XML too?
Yes. Every filing object includes a filing_index_url pointing to the SEC index page where you can get the original XML. If you need something not in the parsed JSON, you can fall back to the raw document.
What's the difference between Table I and Table II on a Form 4?
Table I covers non-derivative securities (common stock, preferred stock). Table II covers derivative securities (options, warrants, RSUs, convertible notes). The XML separates them into nonDerivativeTable and derivativeTable elements. EdgarKit merges them into one array with an is_derivative flag.
How do I handle a filing with no price (e.g. a gift or grant)?
The transactionPricePerShare element is optional and often absent for gifts (code G) and grants (code A). Your parser needs a null check. EdgarKit returns null for price_per_share and total_value in those cases rather than omitting the fields.
What if I need historical Form 4 data going back several years?
The EdgarKit API supports since and until date parameters and handles pagination with cursor tokens. For bulk historical pulls, use the /v1/bulk endpoint which returns gzipped NDJSON — much faster than paginating through individual filings.