[TOC]
Honas is a system for collecting large quantities of hostname lookup requests in order to check afterwards if certain hostnames have been requested by clients.
It tries to prevent disclosing privacy sensitive details about individual clients by making use of coarse grained cardinality estimation using multiple probabilistic data structures in the form of bloom filters.
Honas was originally developed by Quarantainenet, but extended by SURFnet to fit an ISP context. This extended prototype features a dnstap input for untraceability. Furthermore, it allows network operators to identify the organization or network segment DNS queries were performed in. With this feature, the prototype is useful to gain insight into the threat landscape of an organization.
Honas is build using the Meson build system, for installation instructions and more information, see the Meson website.
Install the necessary dependencies (command below is for Debian based systems):
apt-get install check libyajl-dev libevent-dev libfstrm-dev libprotobuf-dev protobuf-c-compiler libldns-dev clang-tidy
Create the meson build directory:
meson build/
Enter the meson build directory and build all targets:
cd build/
ninja
Run from inside the meson build directory.
- Change the default
honas-gather.conf
path (default:/etc/honas/gather.conf
):meson configure -Ddefault_honas_gather_config_path=/path/to/honas-gather.conf
- Disable the use some hand-coded assembler optimizations:
meson configure -Duse_asm=false
Run from inside the meson build directory.
- Release mode:
meson configure -Dbuildtype=release
- Strip executables:
meson configure -Dstrip=true
- More warnings:
meson configure -Dwarning_level=3
- Fatal warnings:
meson configure -Dwerror=true
- Address sanitzation:
meson configure -Db_sanitize=address
- For more options see:
meson configure
meson build/
cd build/
meson configure -Dbuildtype=release -Dwarning_level=1 -Dwerror=false -Dunity=on -Db_lto=true -Db_coverage=false -Db_sanitize=none
ninja
The interfaces are documented using Doxygen. The main page of the generated documentation will also include this text in a HTML rendered form.
To generate the documentation install doxygen and run (from the meson build directory):
ninja doc
The main Honas programming interface is the [honas state](@ref honas_state). But Honas is mainly designed to be run in the form of a set of programs that generate and search through the honas state files that are managed through the said interface.
Below is a diagram that shows the different parts of Honas and how they fit together:
Each of these parts is discussed in more detail below.
Honas is designed to be run as a separate process that processes DNS server logging. The extended prototype is designed to accept DNS logs via a dnstap interface. With that interface, Honas integrates easily into major DNS resolver packages such as Unbound and Bind. Moreover, the DNS logs are transferred from the resolver process to Honas in an untraceable way.
Honas stores the FQDN and all separate labels from the FQDN in the Bloom filters, including some combinations. This allows DGA detection to some extent. The example below indicates which information from a DNS query is stored in the Bloom filters. The example domain name is
malicious.site.example.com
The following information is stored.
malicious.site.example.com
example.com
malicious
site
example
If the source address of the DNS query can be matched to a known IPv4 or IPv6 prefix, the following information is stored additionally. Suppose that the matched prefix belongs to organization A. (No spaces between the @ sign, this is for markdown compilation)
A @ malicious.site.example.com
A @ example.com
A @ malicious
A @ site
A @ example
The following notes apply to the collection and storage of DNS logs.
- The timestamp value is ignored and "now" is used when processing the entries in the state file
- All host names are expected to be in ASCII, other Unicode host names should be punycode encoded
- A trailing dot after the host name is stripped before being hashed
- The TLD label is not stored because it does not identify a domain name in any way
- Only DNS CLASS 1 (
Internet (IN)
) gets processed - Only DNS Resource Records of type 1 (
A
), 2 (NS
), 15 (MX
) and 28 (AAAA
) are processed
Honas features a subsystem that tests the source IP-address against a collection
of entity-prefix mappings. As we discussed above, the entity is then prepended
and stored with the domain name or label. The entity-prefix mappings are loaded
using a JSON file of subnet activity. The required configuration entry is called
subnet_activity_path
as we will discuss later. An example of such file is
descibed below.
{
"subnet_activity" :
[
{
"entity" : "SURFnet",
"prefixes" :
[
{ "192.87.0.0" : 16 },
{ "145.0.0.0" : 8 },
{ "192.42.0.0" : 16 }
]
},
{
"entity" : "netSURF",
"prefixes" :
[
{ "192.42.113.0" : 24 },
{ "145.220.0.0" : 16 },
{ "2001:67c:6ec::" : 48 },
{ "2001:67c:6ec:201::" : 64 }
]
},
]
}
Manually crafting this JSON file can be time consuming. Therefore, we built a
script that generates this file from a CSV file containing prefixes and entities.
This script is located in the scripts
directory and is called subnet_definitions_generator.py
.
The script can be used as follows.
usage: subnet_definitions_generator.py [-h] -r INPUT_FILE [-w OUTPUT_FILE]
[-v]
Honas subnet definitions file generator
optional arguments:
-h, --help show this help message and exit
-r INPUT_FILE Input file containing entity-prefix mappings
-w OUTPUT_FILE Output file receiving the JSON mappings
-v Verbose output
The INPUT_FILE is the entity-prefix mapping file in CSV format. The OUTPUT_FILE receives
the JSON output that can be loaded into honas-gather
. If no output file is specified,
the JSON is written to stdout. Note that the entity-prefix mappings may change over time.
When the entity-prefix mappings change, the configuration file must be regenerated. We
developed a script that checks for differences between CRM files over time, and regenerates
the JSON configuration file if required. This configuration file is then automatically
reloaded by honas-gather
once the period_length
interval passes. The script is called
crm_diff.py
, and is located in the scripts
directory. This script can be executed
automatically on a daily basis, using cron. An example configuration is shown below.
# Checking for CRM changes, to keep the subnet activity configuration up to date.
0 7 * * * /home/gijs/honas/scripts/crm_diff.py -d ~crm-sync/ -v
The parsed host name lookups are saved to Honas state files. There are multiple state files, one for each period of time.
The finished state files are named after their period begin timestamp (the time
they get created) in UTC, like 2017-12-06T11:50:21.hs
. This ensures that
simply sorting the state files alphabetically will order them correctly in
time.
The period ends on an exact multiple of the period duration since the unix epoch time. This ensures that results of multiple servers with the same period duration can be grouped together based on their period end time.
Each honas state file contains all the relevant meta data in the state file header and no separate configuration of other knowledge is necessary to interpret results contained within that file.
This includes:
- Period begin and end timestamps
- The configuration about the bloom filters:
- Number of bloom filters
- Number of bits per bloom filter
- Number of hashes per bloom filter
- Number of bloom filters per client
- Statistical information:
- First request
- Last request
- Number of requests
- Estimated number of distinct clients
- Estimated number of distinct host names
All state files have a version number that should follow Semantic Versioning rules.
Each state file contains a number of bloom filters. Host name lookups are registered based on a SHA-256 hash of that host name and for all parent domains, except for the TLD itself. For checking if a host name might have been looked up only the SHA-256 hash of the host name to be checked has to be provided.
Different clients will update a different subset of the bloom filters in the state file. The number of bloom filters a host name hash is found in can be used in combination with the number of bloom filters and the number of bloom filters per client to make very coarse grained estimations about the number of different clients that requested that host name.
Each state file also contains two HyperLogLog data structures. These are used for providing estimations about the number of distinct clients being active and the number of distinct host name hashes that are registered across all bloom filters.
These values can be used as a basis for optimally configuring the bloom filters based on historical data.
The HyperLogLog datastructures are embedded in the state file to allow state files to be saved and loaded multiple times while they're being filled. When the state files are written to disk the estimates are written to the state file header.
Search jobs are JSON encoded data structures that can be used to get search
results from a state file using honas-search
.
{
"groups": [
{
"id": <integer>,
"hostnames": {
"<key>": "<hex encoded SHA-256 host name hash>",
...
}
},
...
]
}
- The most minimal valid search job specification is:
{}
- The group "id" is optional and (currently) must be numeric if present
- The host name "key" can be anything, but it's recommended to keep it unique within that "hostname" object (though the current implementation doesn't care).
{
"groups": [
{
"id":1,
"hostnames": {
"www.java.com": "dfefecff1f2e77bfef84ef74920e77c23c811dd70df0b3b281521814e85c00ee"
}
}
]
}
The search results are JSON encoded data structures generated based upon a
search job by honas-search
.
{
"node_version": "<version string>",
"state_file_version": "<version string>",
"period_begin": <integer>,
"first_request": <integer>,
"last_request": <integer>,
"period_end": <integer>,
"estimated_number_of_clients": <integer>,
"estimated_number_of_host_names": <integer>,
"number_of_requests": <integer>,
"number_of_filters": <integer>,
"number_of_filters_per_user": <integer>,
"number_of_hashes": <integer>,
"number_of_bits_per_filter": <integer>,
"flatten_threshold": <integer>,
"filters" : [
{
"number_of_bits_set": <integer>,
"estimated_number_of_host_names": <integer>
"actual_false_positive_rate": <string>,
},
...
],
"flattened_results": <boolean>,
"groups" : [
{
"id": <integer>,
"hostnames": {
"www.java.com": <integer>,
...
},
"hits_by_all_hostnames": <integer>
},
...
]
}
- The fields
period_begin
,first_request
,last_request
andperiod_end
are in number of seconds since unix epoch. - All integer values can be represented as unsigned integers, no negative values are to be expected.
- If the search job contained no
groups
item then the results will also have nogroups
and noflattened_results
items. Conversely if the search job had agroups
item, then the result will also always havegroups
andflattened_results
items, even if there were no hits for any of the host name hashes. - If the search job group had no
id
item then the results will also have noid
item. - Host name hashes that didn't have enough hits (at least
number_of_filters_per_user
) will not be present in the result. The number reported is the number of filters the host name was probably present in. - Groups for which none of the host names had enough hits will not be present in the result.
- The
hits_by_all_hostnames
field tells the number of filters that probably contained all the host name hashes in that group. - All the other information fields are always included in the result, even if none of the queried host name hashes was found.
- If it was deemed that not enough distinct clients were active in this period
(
estimated_number_of_clients
is less then theflatten_threshold
) then the value offlattened_results
will be true and all the hit counts will be at most 1. - The
actual_false_positive_rate
provides an estimation of the actual false positive rate. This estimation depends on the fill rate (number_of_bits_set) of the Bloom filter, its current state. The value is stored as string, but contains a double value.
{
"number_of_filters" : 1,
"number_of_filters_per_user" : 1,
"flattened_results" : false,
"number_of_requests" : 175727076,
"estimated_number_of_clients" : 14287,
"period_end" : 1532127600,
"number_of_hashes" : 10,
"node_version" : "1.0.0",
"flatten_threshold" : 1,
"first_request" : 1532041245,
"estimated_number_of_host_names" : 162956933,
"groups" : [
{
"hits_by_all_hostnames" : 1,
"id" : 1,
"hostnames" : {
"www.java.com" : 1
}
}
],
"last_request" : 1532127645,
"number_of_bits_per_filter" : 491040000,
"filters" : [
{
"actual_false_positive_rate" : "0.0000079791",
"estimated_number_of_host_names" : 18161591,
"number_of_bits_set" : 151814229
}
],
"state_file_version" : "1.0",
"period_begin" : 1532041245
}
Manually crafting a JSON file as discussed above takes time. Therefore, we
introduced a script for testing a generic domain name blacklist against a
Honas state file. The script is located in the scripts
directory and is
called query_generic_blacklist.py
. Usage information is as follows.
usage: query_generic_blacklist.py [-h] -b BLACKLIST_FILE -s STATE_FILE -e
ENTITY_FILE [-v]
Query tool for generic domain blacklists
optional arguments:
-h, --help show this help message and exit
-b BLACKLIST_FILE Input blacklist file containing domain names
-s STATE_FILE Honas state file to query
-e ENTITY_FILE Input file containing all stored entities
-v Verbose output
In which the domain name blacklist is BLACKLIST_FILE, the state file to test
against is STATE_FILE, and ENTITY_FILE is a list of all possible entities that
could be stored in the Bloom filters. This information is used to generate all
queries of the form <possible_entity>@<blacklisted_domain_name>
. Note that
the entity file must be updated when organizations are added or removed. The
number of queries that the script generates and tests is the Cartesian product
of the input blacklist and the entity file. The script will output the JSON file
with results from the Honas searching application. An example is written below.
~/honas/scripts/query_generic_blacklist.py -b example_blacklist.txt -s /data/mm-dd-yyyy/yyyy-mm-dd.hs -e possible_entities.csv
The example query script above resulted in the JSON file yyyy-mm-dd.hs.json
.
This JSON file contains all domain names that were positively tested against the
Bloom filter in the specified state file. Suppose we want to get an overview of
which unique entities are responsible for the DNS queries that were performed.
To do so, we use the jq
command first, to remove all irrevalant JSON content.
We mangle the output next (with example way to do so), to get output in CSV,
containing all unique organizations including how many unique DNS queries came
from their network.
jq .groups[0].hostnames 2018-09-13.hs.json | grep @ | sed 's/ \"//' | sed 's/}//g' | sed 's/{//g' | sed 's/@.*//' | sed 's/\": 1//' | sed 's/,//' | grep -Ev "^$" | sort | uniq -c | sed 's/ //g' | sed -r 's/\s+/,/'
We now need to aggregate the information we retrieved using the entity types.
We assume that the entity types are available in a CSV file, used as input for
the entities_to_sector.py
script. An example command is described below.
~/honas/scripts/entities_to_sector.py -r <output_from_jq_command> -m <csv_file_with_entity_types> -w results_sectorized.csv
The script above results in a CSV output file containing the cumulative number of occurrences in the Bloom filter, aggregated per entity type. An example is given below.
sector,count
Type_A,9
Type_B,46
Type_C,31
Type_D,1
Type_E,15
.....
More examples are given in the real-world use cases, where actu 57AE al real-world problems are discussed in a large network operator.
The testing of blacklists against Bloom filters is a simple process which can
be automated. To do so, we developed a script called query_all_scenarios.py
.
This script is located in the scripts
directory, and can be used to automatically
check a set of blacklists against all Honas state files that have not been checked
yet. An example cron configuration is given below.
# Daily query scripts for all scenarios currently in place.
0 7 * * * ~/honas/scripts/query_all_scenarios.py
This script requires a configuration file containing necessary information about
the scenarios it should perform queries for. This configuration file looks as follows,
and should be placed in the Honas etc
directory (by default /etc/honas/). The script
must be named periodic_search_jobs.conf
.
{
"entities_file" : "/etc/honas/entities_out.csv",
"searchjobs" :
[
{
"name" : "Booters",
"blacklist" : "/var/experiments/booters/booterblacklist.csv",
"result_directory" : "/var/experiments/booters/searchresults"
},
{
"name" : "National Detection Network",
"blacklist" : "/var/experiments/ndn/misp_blacklist-2018-09-27.txt",
"result_directory" : "/var/experiments/ndn/searchresults"
}
]
}
For each search job, a friendly name should be specified. Furthermore, the location of the blacklist and the location of the search result files should be specified. The script will query all state files for which no search result file exists in the target directory.
The honas-gather
program is responsible for parsing all the host name
lookup logging, collecting the information in a honas state and writing these
states to separate files, one for each time period.
Usage: honas-gather [--help] [--config <file>]
-h|--help Show this message
-c|--config <file> Load config from file instead of /etc/honas/gather.conf
-q|--quiet Be more quiet (can be used multiple times)
-s|--syslog Log messages to syslog
-v|--verbose Be more verbose (can be used multiple times)
-f|--fork Fork the process as daemon (syslog must be enabled)
-a|--aggregate Aggregates queries by subnet per filter (predefined subnets)
-d|--dry-run Performs measurements and gives advice about Bloom filter configuration
The Honas gather process needs a number of configuration options to be specified before it can start processing host name lookups and write honas state files.
The default location of the configuration file, for when none is supplied on the command line, can be changed through a build configuration option.
The supported configuration items, which are all required, are:
bloomfilter_path
: The directory where honas state files will be writtensubnet_activity_path
: The input file used to perform entity-prefix mappingsperiod_length
: The maximum period length for each state file in secondsnumber_of_bits_per_filter
: How many bits each bloom filter should havenumber_of_filters
: How many bloom filters should there be per state file. This number should be1
, as this is how the prototype was extendednumber_of_hashes
: How many bits should be set per filter per looked up host namenumber_of_filters_per_user
: How many bloom filters to update for each host name lookup per client
Note: the configuration file is reloaded every period_length
seconds. Therefore, the Honas gather
process does not have to be restarted to change the Bloom filter parameters.
bloomfilter_path /var/spool/honas/
subnet_activity_path /etc/honas/subnet_definitions.json
period_length 3600
number_of_bits_per_filter 491040000
number_of_filters 1
number_of_filters_per_user 1
number_of_hashes 10
flatten_threshold 1
The Honas prototype uses Bloom filters, which have a fixed size. This size must be estimated before storing DNS queries. This way, the false positive rate of the Bloom filters can be controlled. To do so, the extended prototype features a dry-run mode. This dry-run mode can be enabled using the '-d' switch, and it will provide a daily advice on suggested Bloom filter parameters, given the current query stream. It is strongly suggested to perform a dry-run for at least a week, using a query stream that is representative of the actual use of the network. An example advice is given below.
------------------------------------ Advice ------------------------------------
[10-08-2018 13:32] The numbers are rounded up to the nearest hundred-thousand, and a tolerance of 10 percent is added.
-------------------------------- Hourly Filters --------------------------------
[10-08-2018 13:32] For a false positive rate of 1 / 1000, BF size (m) should be 40810000, based on 2578756 unique domain names
[10-08-2018 13:32] The number of hash functions (k) should be 10
[10-08-2018 13:32] For a false positive rate of 1 / 10000, BF size (m) should be 54450000, based on 2578756 unique domain names
[10-08-2018 13:32] The number of hash functions (k) should be 14
[10-08-2018 13:32] For a false positive rate of 1 / 100000, BF size (m) should be 67980000, based on 2578756 unique domain names
[10-08-2018 13:32] The number of hash functions (k) should be 16
-------------------------------- Daily Filters ---------------------------------
[10-08-2018 13:32] For a false positive rate of 1 / 1000, BF size (m) should be 305250000, based on 19300734 unique domain names
[10-08-2018 13:32] The number of hash functions (k) should be 10
[10-08-2018 13:32] For a false positive rate of 1 / 10000, BF size (m) should be 407000000, based on 19300734 unique domain names
[10-08-2018 13:32] The number of hash functions (k) should be 14
[10-08-2018 13:32] For a false positive rate of 1 / 100000, BF size (m) should be 508750000, based on 19300734 unique domain names
[10-08-2018 13:32] The number of hash functions (k) should be 16
-------------------------------------- End -------------------------------------
The Bloom filters are stored in state files by Honas in the bloomfilter_path
at
the end of every period_length
. A set of scripts is shipped with Honas to rotate
and merge these state files, located in the scripts
directory. The honas_state_rotate.py
script looks for state files in the bloomfilter_path
to rotate, and moves them
to a more structured directory. Furthermore, the honas_daily_state_combine.py
script
then aggregates (merges) the Bloom filters in the state files. For example, states
that contain DNS queries for one hour can be combined into a state that contains
all queries for that day. It is possible to automate this process using cron. An
example configuration is depicted below.
# Honas state rotation and merging scripts. Note the 'cd /data' command in
# front of the 'honas_daily_state_combine' call. This command is required
# because the combine script needs to be executed with the data archive
# directory as working directory.
0 4 * * * ~/honas/scripts/honas_state_rotate.py -v
0 6 * * * cd /data && ~/honas/scripts/honas_daily_state_combine.py
Note: Honas state timestamps are handled in UTC. Take a possible time difference into
account when installing the rotation scripts using cron. Furthermore, the scripts are
fairly verbose in their output. This output is written to Syslog by default. Finally,
the daily state combination script requires the working directory to be the data
archive directory, because the honas-combine
does so.
The honas-search
program is responsible for generating a search result based
on a search job and a honas state file.
Usage: honas-search [<options>] <state-file>
Options:
-h|--help Show this message
-j|--job <file> File containing the search job (default: stdin)
-r|--result <file> File to which the results will be saved (default: stdout)
-f|--flatten-threshold <clients>
If fewer than this amount of clients have been seen then
flatten the results (default: never flatten)
-q|--quiet Be
B41A
more quiet (can be used multiple times)
-s|--syslog Log messages to syslog
-v|--verbose Be more verbose (can be used multiple times)
$ echo '{"groups":[{"id":1,"hostnames":{"www.java.com":"dfefecff1f2e77bfef84ef74920e77c23c811dd70df0b3b281521814e85c00ee"}}]}' | honas-search '/data/20-07-2018/2018-07-20.hs' | json_pp
{
"number_of_filters" : 1,
"number_of_filters_per_user" : 1,
"flattened_results" : false,
"number_of_requests" : 175727076,
"estimated_number_of_clients" : 14287,
"period_end" : 1532127600,
"number_of_hashes" : 10,
"node_version" : "1.0.0",
"flatten_threshold" : 1,
"first_request" : 1532041245,
"estimated_number_of_host_names" : 162956933,
"groups" : [
{
"hits_by_all_hostnames" : 1,
"id" : 1,
"hostnames" : {
"www.java.com" : 1
}
}
],
"last_request" : 1532127645,
"number_of_bits_per_filter" : 491040000,
"filters" : [
{
"actual_false_positive_rate" : "0.0000079791",
"estimated_number_of_host_names" : 18161591,
"number_of_bits_set" : 151814229
}
],
"state_file_version" : "1.0",
"period_begin" : 1532041245
}
Note that the host name key 'www.java.com' was purely chosen for this example (to show which host name hash was being queried), the actual key values aren't interpreted by the search process and are simply used verbatim for reporting possible results. One could also use database id's or simple incremental indexes, depending on what is easiest to process.
The honas-info
program is mainly intended to get some basis information in a
human readable format.
Usage: honas-info [<options>] <state-file>
Options:
-h|--help Show this message
-q|--quiet Be more quiet (can be used multiple times)
-v|--verbose Be more verbose (can be used multiple times)
-p|--plotmode Output timestamp and number of hostnames as CSV
$ honas-info '2018-07-01.hs'
## Version information ##
Node version : 1.0.0
State file version: 1.0
## Period information ##
Period begin : 2018-07-01T01:00:38
First request : 2018-07-01T01:00:38
Last request : 2018-07-02T01:00:38
Period end : 2018-07-02T01:00:00
Estimated number of clients : 10163
Estimated number of host names: 128019870
Number of requests : 155622232
## Filter configuration ##
Number of filters : 1
Number of filters per user: 1
Number of hashes : 10
Number of bits per filter : 491040000
Flatten threshold : 1
## Filter information ##
1. Number of bits set: 118252441 (Estimated number of host names: 13528981)
Fill Rate: 0.2408203833 (False positive probability: 0.0000006560)
The honas-combine
program aggregates the Bloom filters and other parameters
and structures in two Honas state files. Note that the first parameter is also
the destination state for the aggregation.
Usage: honas-combine [<options>] <dst-state-file> <src-state-file>
Options:
-h|--help Show this message
-q|--quiet Be more quiet (can be used multiple times)
-v|--verbose Be more verbose (can be used multiple times)
Honas comes with unittests implemented using
check. All unittests reside under the
tests/
directory.
To build and run the unittests, run from inside the meson build directory:
ninja test
There are currently two ways of performing static code analysis.
The first is through the meson supplied 'scan-build' functionality. Run from inside the meson build directory:
ninja scan-build
The second is through an additional target that calls 'clang-tidy' on all sources. Run from inside the meson build directory:
ninja clang-tidy-all
Updates on large bloom filters probably have very bad locality of reference and thus lead to many CPU cache misses and consequent invalidations as the relevant cachelines are loaded into the cache. This not only impacts the performance of Honas itself, but might also adversely impact the performance of other programs.
If the performance of this program is "good enough" and there is a CPU core to spare then it might make sense to disable the CPU cache for accesses to the bloom filter memory. Unfortunately these options, while supported by the Linux kernel, are not available to user space programs. So this would either need a patched kernel, a custom kernel driver or some other dirty trickery (possibly using /dev/mem or similar).
Another option might be to add some kind of LRU cache in front of the bloom filter update process. This should reduce the number of bloom filter update for frequently requested host names. We'd probably want some form of double LRU cache to better handle "bursts" of random host name lookup.
First indications are that on an i5-4570, with 4 filters of 134217728 bits each and 2 filter updates per user, Honas can processes more than 30.000 host name lookups per second. This is expected to be adequate for our use case.
Should it turn out that this is not fast enough then there are some possible options to consider.
The current algorithm for the host name hashing, SHA-256, is a safe choice. Using a cryptographic hash function guarantees good spread of hash values, collision resistance and irreversability.
But, depending on the hardware, calculating SHA-256 might not be very efficient. Furthermore, the bloom filter offset determination makes multiple passes over all those bits to make as optimal a spread as possible.
By using another hashing algorithm the hashing itself might be a lot faster. Any collision resistant hash with a wide spread should be a reasonable choice.
By using a smaller hash value (or perhaps by limiting ourselves to a part of the hash value) the transformations can be made cheaper.
Before considering these radical steps it's very important to verify that these
parts are the biggest bottlenecks as it's very likely that cache trashing and
main memory access is the culprit. The easiest way to verify is to run
honas-gather
using perf record
and check the generated perf.data
using
perf report
.
Should the cpu usage be due to cache misses, then see above.
The National Detection Network is a community run by the NCSC (National Cyber Security Centre) in the Netherlands. Security intelligence is shared in this community, in the form of Indicators of Compromise (IoC). These IoCs often contain domain names, and hence can be tested against the Bloom filters. The IoCs are shared in a MISP (Malware Information Sharing Platform), which enables automated blacklist downloading and IoC lookup. However, this requires some scripting for Honas as well. Therefore, we built a set of scripts to help with these tasks.
The script ndn_download_blacklist.py
downloads a domain name blacklist containing
all domain name attributes in IoCs. This blacklist can then be tested against the
Bloom filters using the query_generic_blacklist.py
script we discussed above. The
script ndn_misp_lookup.py
takes a domain name that was found in a Bloom filter, and
looks up the associated IoC(s) in the MISP. It provides the threat level and TLP tags
in the output, to allow security officials to decide more accurately on a follow-up.
Automation access to the MISP requires an authentication key. The MISP URL and required
authentication key are supposed to be present in a Python file called keys.py
. An example
keys file can be found in keys.py.example
. Furthermore, the scripts
directory contains
a series of undocumented scripts. These scripts were part of the development and validation
process, and are not necessarily required in practice. They solely serve reproducibility of
the research results.
We consider an example. Suppose that we want to obtain an overview of all unique domain names that were found to be hit in the Bloom filters. These domain names can then be looked up in the MISP using the lookup script we discussed above. The exampe command below obtains the overview we want.
jq .groups[0].hostnames yyyy-mm-dd.hs.json | grep @ | sed 's/.*@//g' | sed 's/,//g' | sed 's/\": 1//g' | sort | uniq
Suppose that we now want to look up all domain names in the MISP, to determine which hits are possibly serious, and require targeted investigation, or some other follow-up. We use the following script (for example).
for dn in `grep @ yyyy-mm-dd.hs.json | sed 's/ //g' | sed 's/\"//g' | sed 's/:1//g' | sed 's/.*@//g' | sed 's/,//g' | sort | uniq`; do ./ndn_misp_lookup.py -s $dn; done
If the lookup was succesful, the output should look like the following. In this example,
we looked up the domain name malicious.site.example.com
.
{
"search_value": "malicious.site.example.com",
"event_id": "1337",
"info": "example IoC description",
"threat_level": "Low",
"tlp": "white"
}
Based on the IoC information, threat level or TLP (Traffic Light Protocol, https://first.org/tlp/), we could distinguish which threats are serious enough for a follow-up. The MISP features more information that is not yet included in the script output. However, this information can easily be added.
Booters are websites where one can buy a DDoS attack. They are also called web-stressers
. Similar
to the National Detection Network, the query_generic_blacklist.py
script can be used to test a
blacklist of Booter domain names against the Bloom filters. Booter occurrences could suggest
that a DDOS attack that happened in the same timeframe is requested at a Booter website. This
knowledge could contribute to find the attacker more quickly.
During the development and validation of the Honas prototype, we of course had inspiration for more ideas and extensions. However, we did not have enough time to get to all of these. In this section, we describe a series of ideas for future work.
- Automatic rotation of other files, such as ground truth files, if applicable;
- Using URL blacklists in a mail filtering service, rather than IP blacklists. This way, we can find out whether users clicked on links in malicious emails.
- A user interface (UI) that enables network operators to gain insight into threat detection results, find possibly serious threats and make reports over a long period of time. The National Detection Network scripting could be integrated in this UI as well.