This project constructs a knowledge graph based on data from the CVE Project's cvelistV5 repository. It parses the raw JSON file information into a Neo4j graph database that can be used for a variety of purposes.
Note: when constructing the complete graph for the first time, expect to leave the program running for a very long time. During testing the overall rate achieved was about 4.5 CVEs per second, resulting in an overall execution time of ~17.7 hours. To avoid needing to go through this lengthy process, you can simply load the most recent neo4j.dump
file into Neo4j - instructions below.
For a detailed description of the project, see the paper.
There are two usage scenarios:
- Viewing the existing database dump: this does not require you to run the script or set up the environment. You will just load the database dump and get going.
- Constructing the graph: either start from scratch or update an existing graph. This involves running the script, and can take a long time (17+ hours) if starting from scratch.
The database dump file can be found in the /dump
directory of the repository.
If you want to run Neo4j as a standalone Docker container, you can execute the following commands:
- Load the dump into a database:
docker run --interactive --tty --rm \
-v ./dump:/dump -v ./neo4j-data:/data \
neo4j/neo4j-admin:latest \
neo4j-admin database load neo4j --from-path=/dump
- Then, launch a container that uses the loaded database:
docker run -d \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/password \
-v ./neo4j-data:/data \
--name=cvegraph-neo4j \
neo4j:latest
Start by downloading Neo4j Desktop. Once launched, create a new project. Do not add a DBMS yet. Click the "Add" button and select "File". Select the file in /dump
.
Then, open the dropdown menu for the file and click "Create new DBMS from dump".
You will now be able to use all Neo4j Desktop features with the newly created DBMS. You can ensure that the import was successful by connecting to the DBMS, selecting the neo4j
database, and looking at the details that pop up on the right.
- Install dependencies
- Set up Python environment:
uv sync
- Clone submodule:
git submodule update --init
uv run cvegraph.py
- Additional data sources
- NVD
- CWE
- Exploit-DB
- CPE
- ATT&CK
- Presents a much larger problem than the others due to the challenge of mapping CVEs to ATT&CK TTPs. However, this would be immensely valuable and facilitate the inclusion of APT/threat actor group related data
- Asynchronicity
- Logic is already fully implemented in
async_cvegraph.py
, but there are some quirks; lots of records go missing. For instance, a test run using only CVE-2024-* files finds 36,080 on disk, but only ~28k get sent to Neo4j
- Logic is already fully implemented in
- Add configuration options to use an external Neo4j database
- New ingestion options
- Full: check every single CVE file (current method)
- Quick: check first and last CVEs in the database, omit all files within that range
- Investigate CSV database import
- According to Neo4j documentation, this is the fastest possible way to perform bulk import
- May be able to speed up the construction of the database from scratch by first writing records to CSV and then importing them, rather than the current approach of sending batches of 1000 at a time