GitHub - onami/SEC-Parsers

SEC Parsers

Parses non-standardized SEC filings into structured xml. Use cases include LLMs, NLP, and textual analysis.

Supported filing types are 10-K, 10-Q, 8-K, S-1, 20-F. More will be added soon, or you can write your own! How to write a Custom Parser in 5 minutes

Update: This package is no longer maintained. Parsed SEC bulk data will soon be available for free on datamule, including a dataset of every 10-K from the past four years and an MD&A subset. For additional data, contact me through that website or LinkedIn.

If you're looking to browse SEC filings and extract data like tables, check out the tool I'm developing SEC Filings Viewer.

If you need to bulk download SEC data, check out Downloading SEC Filings.

Installation

pip install sec-parsers # base package
pip install sec-parsers['all'] # installs all extras
pip install sec-parsers['downloaders'] # installs downloaders extras
pip install sec-parsers['visualizers'] # installs visualizers extras

Links: SEC Downloaders, SEC Visualizers

Quickstart

Load package

from sec_parsers import Filing

Downloading html file (new)

from sec_downloaders import SEC_Downloader

downloader = SEC_Downloader()
downloader.set_headers("John Doe", "johndoe@example.com")
download = downloader.download(url)
filing = Filing(download)

Downloading html file (old)

from sec_parsers download_sec_filing
html = download_sec_filing('https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm')
filing = Filing(html)

Parsing

filing.parse() # parses filing
filing.visualize() # opens filing in webbrowser with highlighted section headers
filing.find_sections_from_title(title) # finds section by title, e.g. 'item 1a'
filing.find_sections_from_text(text) # finds sections which contains your text
filing.get_tree(node) # if no argument specified returns xml tree, if node specified, returns that nodes tree
filing.get_title_tree() # returns xml tree using titles instead of tags. More descriptive than get_tree.
filing.get_subsections_from_section() # get children of a section
filing.get_nested_subsections_from_section() # get descendants of a section
filing.set_filing_type(type) # e.g. 'S-1'. Use when automatic detection fails
filing.save_xml(file_name,encoding='utf-8')
filing.save_csv(file_name,encoding='ascii')

Additional Resources:

quickstart
How to write a Custom Parser in 5 minutes
Archive of Parsed XMLs / CSVs - Last updated 7/24/24.
example parsed filing
example parsed filing exported to csv.

Features:

lots of filing types
export to xml, csv, with option to convert to ASCII
visualization

Feature Requests:

Request a Feature

company metadata (sharif) - will add to downloader
filing metadata (sharif) - waiting for SEC Downloaders first release
Export to dta (Denis)
DEF 14A, DEFM14A (Denis)
Export to markdown (Astarag)
Better parsing_string handling. Opened an issue. (sharif)

Other packages useful for SEC filings

https://github.com/dgunning/edgartools

Updates

Towards Version 1:

Note: next major update will happen in august. It will improve quality of parsing, and dramatically increase speed. Changes: streaming, combined detectors (e.g. all caps / emphasis cap with handling for unique cases), one use detectors, adding parse_id, merging clean parse, xml tree construct into one function.

Most/All SEC textual filings supported

Might be done along the way:

Faster parsing, probably using streaming approach, and combining modules together.
Introduction section parsing
Signatures section parsing
Better visualization interface (e.g. like pdfviewer for sections)

Beyond Version 1:

To improve the package beyond V1 it looks like I need compute and storage. Not sure how to get that. Working on it.

Metadata

Clustering similar section titles using ML (e.g. seasonality headers)
Adding tags to individual sections using small LLMs (e.g. tag for mentions supply chains, energy, etc)

Other

Table parsing
Image OCR
Parsing non-html filings

Current Priority list:

look at code duplication w.r.t to style detectors, e.g. all caps and emphasis. may want to combine into one detector

yep this is a priority. have to handle e.g. Introduction and Segment Overview as same rule. Bit difficult. Will think over.

better function names - need to decide terminology soon.
consider adding table of contents, forward looking information, etc

forward looking information, DOCUMENTS INCORPORATED BY REFERENCE, TABLE OF CONTENTS - go with a bunch,

fix layering issue - e.g. top div hides sections
make trees nicer
add more filing types
fix all caps and emphasis issue
clean text
Better historical conversion: handle if PART I appears multiple times as header, e.g. logic here item 1 continued.

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
Archive/Code		Archive/Code
Assets		Assets
Design		Design
Examples		Examples
Tests		Tests
sec_parsers		sec_parsers
.gitignore		.gitignore
LICENSE		LICENSE
contributors.md		contributors.md
readme.md		readme.md
terminology.md		terminology.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SEC Parsers

Installation

Quickstart

Additional Resources:

Features:

Feature Requests:

Other packages useful for SEC filings

Updates

Towards Version 1:

Beyond Version 1:

Current Priority list:

About

Uh oh!

Releases

Packages

Languages

License

onami/SEC-Parsers

Folders and files

Latest commit

History

Repository files navigation

SEC Parsers

Installation

Quickstart

Additional Resources:

Features:

Feature Requests:

Other packages useful for SEC filings

Updates

Towards Version 1:

Beyond Version 1:

Current Priority list:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages