OpenVAS 9 Tips for Large Environments

Introduction

The OpenVAS vulnerability scanner ( https://www.openvas.org ) has a
great UI, an up-to-date library of high-quality tests, and permissive
licensing. You can purchase a turn-key appliance from Greenbone Networks
( https://www.greenbone.net ). It is popular among penetration testers
and is easy to set up on Kali Linux. Deploying a multi-user OpenVAS
system for a large enterprise network presents performance challenges,
both when scanning and reporting. These are some lessons learned from
scaling OpenVAS to scan 10,000 nodes.

OpenVAS Structure, Architecture, and Installation

To understand the scaling behavior of OpenVAS, first we need to review
the structure of the system. There are 3 primary daemons: openvassd,
gvmd, and gsad. openvassd implements the Nessus Attack Scripting Language
to probe the targets. openvasmd (called gvmd in OpenVAS 9 and earlier)
orchestrates openvassd, defining scan tasks to run and collecting the
results. gsad is a light-weight web front-end to gvmd's XML API, making
it easy to click through target definitions and reports.

For large workloads and segmented networks, OpenVAS supports master-slave
operation, with one (or more!) master instance(s) requesting scans from
many scanner instances. This can be accomplished with remote openvassd
daemons directly, or by communication between a master and scanner
gvmd. The communication between openvassd and gvmd is chatty and has no
native authentication, so it's better to use gvmd->gvmd architecture
in a distributed deployment. Master servers should run all 3 daemons,
but scanner servers need only run openvassd and gvmd.

OpenVAS is fully multi-tenant. Any number of users can be administrators,
who have full control over the features of the software but cannot
see or change the data of other administrators. To share data between
users, administrators must create permission records that authorize the
data sharing. This is also how lower-privileged users gain access to
data. Depending on your use case, this security model may be perfect, or
it may be a huge administrative hassle. It also has serious performance
implications. More on this later.

There are no official binary packages of OpenVAS, and in my opinion,
production users should compile from source code. OpenVAS is a complicated
system, and compiling from source will ensure you understand the system
well enough to troubleshoot it when problems arise. Operators will
often need to use strace or gdb to determine the nature of a problem,
and compiling from source will give good context for those investigations.

Reporting Performance

Reporting is the first place you'll struggle with OpenVAS performance in
a large deployment. Each scan will produce 10 - 1000 line item results
per host, which need to be joined against several other SQL tables to
produce a report. This means that SQL performance is the bottleneck that
controls the user experience when clicking through the gsad web interface.

First, you certainly should run a distributed master-scanner architecture,
even if you only need one scanner node. Scanning and reporting are both
highly CPU dependent, and distributed architecture will protect your
reporting performance from the load of scanning.

While the OpenVAS project recommends PostgreSQL for production
deployments, experience shows that SQLite is necessary for the best
performance in a large environment. This is because gvmd issues hundreds
of thousands of SQL queries while displaying a large report, and it does
so completely serially. Therefore, SQL query latency is the dominating
factor in reporting performance. PostgreSQL is well known for scaling
well under heavy concurrent load, but gvmd never creates a load like
that. SQLite has lower latency at these light loads, and therefore
performs significantly better than PostgreSQL for OpenVAS.

Single-core performance of the underlying hardware is the next most
powerful influence on reporting performance. gvmd uses only one
thread per request when calculating reports, so high core counts are
unnecessary. Compared to an ordinary Xeon E5-2697v3 VMware ESXi guest
VM, bare metal Linux on a 5.0GHz overclocked i7-8700k system ran OpenVAS
reporting workflows approximately twice as fast. Fast hardware is a must
for the manager server of a large OpenVAS deployment.

SQLite compiler options can sometimes have a strong impact on the
reporting performance of OpenVAS. When running OpenVAS on ESXi, switching
from the stock Debian SQLite to upstream SQLite compiled with the CFLAGS
from Clear Linux gave a massive 5x performance increase. Surprisingly,
on a 5.0GHz i7-8700k system, the CFLAGS used to compile SQLite had
virtually no effect on OpenVAS performance. The systemd unit file option
"Environment" can be used to force-load an optimized SQLite via LD_PRELOAD
for testing different CFLAGS settings to determine what's best in a given
environment. The Clear Linux CFLAGS can be found on benchmark blogs from
organizations like Phoronix.

OpenVAS's permissions model is also a performance drag in addition to
an administrative overhead. In a large environment with many OpenVAS
users, it can be beneficial to modify the permissions model to avoid that
overhead. A relatively simple patch can modify the OpenVAS security model
so that all users can read everything without explicit permissions. This
reduces the number of SQL queries significantly for about a 2x performance
improvement.

At the end of a scan is a wrap-up phase that takes place on the
controlling manager server; all the results are reviewed and summarized
into other parts of the database. This causes a huge read IO load,
so ensure the manager server has enough RAM to cache the entire SQLite
database; 8GB is probably sufficient. It eventually causes a huge write IO
load, which is many sequential writes with queue depth of 1. Certainly it
is important to have the manager server on SSD storage, and it's likely
that this phase would be improved by running on Intel Optane or Samsung
Z-SSD latency-optimized disk.


Scanning Architecture

The first thing to note about scanning is that reporting performance
needs drive the best designs for scanning workflow. There are several
ways to view reports in OpenVAS, but the quickest is to view the scan
history for a given task. Therefore, it is important to structure
the scan tasks according to anticipated reporting needs. For example,
organizing scan tasks according to business unit and system criticality
will steer users toward using the higher-performance workflows in OpenVAS.

Another way that scan task organization can help performance is by
keeping scan tasks small enough. Smaller scan tasks finish sooner and
produce fewer results with then feel more responsive when reading the
reports. When a task's reports start to get near 50,000 line items,
it's a good idea to split that task into two or more new smaller tasks.

Have enough scanner nodes. To avoid bloating firewall logs, one scanner
node per network zone is a good start. Then add additional nodes to
share the load when a given zone's scanner gets too busy; VMs are just
fine for scanner nodes.

Scan Performance

The port-scanning phase of a vulnerability scan is dominated by the
nmap options used by that task. These can be configured by duplicating
the default "Full and Fast" scan type and altering the configuration
parameters. Tuning nmap scan performance is a balancing act between
accuracy and performance that is discussed in many places. The nmap
book's performance chapter is a great place to start. OpenVAS supports
"network-level scanning" but that feature is poorly supported, reduces
scan accuracy, and does not improve performance; stick with the default
one-host-at-a-time port scanning, and tune the nmap parameters as
appropriate.

The testing phase of a vulnerability scan is dominated by the amount of
parallelism configured in the task options. With enough target hosts
available, openvassd will spawn out up to host_limit x test_limit
jobs. The maximum practical parallelism is eventually limited by the
single-threaded redis server used to store intermediate results. That
limit is about 16 cores in a scanner node, and a practical maximum
parallelism is about 16 hosts and 4 tests per host. Most NASL tests are
not CPU-intensive, so there is a benefit to running parallelism higher
than scanner CPU count.

A small openvassd patch can improve scanner CPU utilization by loosening
tight retry spin loops during connection attempts; openvassd calls poll
with a zero timeout in many places; updating that timeout to 2ms allows
the process to sleep, wasting less CPU time.


Future Work

Reporting performance can be further improved by using a
higher-performance SQLite replacement such as LiteTree. Initial testing
showed 1.5x to 2x performance improvement compared to standard SQLite
on a 5.0GHz i7-8700k system. Further testing is needed to validate the
reliability and long-term performance of LiteTree.

Since the maximum effective parallelism of a scan task is limited by the
single-threaded nature of redis, improving redis performance may allow
more parallelism. Bare metal systems with high clock speeds and memory
frequencies can have redis benchmarks double or more an ordinary vitalized
server. Further testing is needed to determine whether that translates
into the ability to utilize more than 16 cores in an OpenVAS scanner node.

Further testing is needed to determine which hardware specifications have
the most effect on SQLite and redis performance for OpenVAS. High clock
speed, high memory frequency, and tight memory timings all usually come
in the same package, but detailed testing could reveal which of those
settings is most important and allow further tuning options.

Database maintenance can affect reporting performance; often a gvmd
job will be seen consuming a full CPU core running queries like "delete
from report_counts where ..." that never seems to finish. report_counts
is a cache that improves performance of the dashboard display, so it
is not critical to preen the data so carefully. A small operational
improvement can be made by scheduling a cron job that does "echo 'delete
from report_counts;' | sqlite3 /path/to/tasks.db" to take advantage of
SQLite's optimization for delete queries without where clauses. Further
improvement may be available by removing the "where" clause in the
openvasmd code, to simply truncate the table every time.

openvassd 6 ( part of OpenVAS 10 ) does not seem to have the poll()
problem, perhaps because of the refactoring that took place.

gsad 8 ( part of OpenVAS 10 ) is totally rewritten in ReactJS. This
shifts much of the reporting burden onto the client browser, which may
make it harder to get good responsiveness on large result sets. More
testing is needed.

gvmd 9 ( currently in development) has dropped SQLite support. Without
extensive database improvements, this seems that it will cause serious
performance issues. More testing is needed.