Monitoring Corda Nodes (Part 1)

April 16, 2018


Monitoring Corda Nodes (Part 1)

With the launch of Corda 3.0 bringing production readiness and with several projects going into production on Corda in the near future, it’s time to think about some of the issues around running Corda in real production networks. In this post we will look at one set of options for monitoring a Corda node.

For the purpose of this we are going to monitor a single Corda 3.0 node in a standard Ubuntu 17.10 virtual machine running on Microsoft Azure.

We are going to use Datadog as the metrics backend and dashboard solution. However, the monitored metrics and export mechanisms described can be used with any common monitoring solution (Disclaimer: Datadog is a paid for service — we list open source alternatives at the end). We will start with setting up the collection in Part 1 and cover specific metrics, dashboards and alerting in Part 2.

Setting up Datadog

First we need to install and configure the Datadog agent on the Ubuntu virtual machine. Log in to your Datadog account at app.datadoghq.com and click on Integrations>Agent and then Ubuntu in the menu. Copy the one step install line as below (with your specific API key) and paste it into your Azure VM shell:

sudo DD_API_KEY=YOUR_API_KEY bash -c "$(curl -L https://raw.githubusercontent.com/DataDog/datadog-agent/master/cmd/agent/install_script.sh)"

This will install and start the agent which should start returning metrics to the Datadog application straight away. Check everything is working by running the agent status:

sudo datadog-agent status

If everything is working OK you should see a set of status information printed to the shell.

This will give you by default a comprehensive set of metrics from the virtual machine “host” including measurements of cpu, memory, disk, file handling, IO, load, network and uptime. We will come back to these later when we set up the monitoring dashboard.

Next we want to set up process level monitoring for the Corda applications themselves. The process agent is not on by default. We need to enable it in the datadog.yaml configuration file:

sudo nano /etc/datadog-agent/datadog.yaml

and edit the Process agent setting to true:

# Process agent specific settings
process_config:
enabled: "true"

We also need to tell the process agent what processes we want to monitor. Create (or modify) the following file:

sudo nano /etc/datadog-agent/conf.d/process.yaml

Add the following (note that exact_match should be set to False):

Next you will want to set up a TCP check to monitor the ports that your Corda node communicates on (typically 10002 for peer to peer node communication and 10003 for the RPC to node — these are specified in your node.conf file). (Typically you would want to monitor these ports from a remote location to check that you counter-parties can reach your node. One way to do this is to use an uptime service such as Pingdom. We will set up an example of this in Part 2.)

Additionally you can use this functionality to monitor the Corda network services your node depends on, for example counter-parties such as Business Network Operator nodes.

Enable TCP checks by creating (or modifying) /etc/datadog-agent/conf.d/tcp_check.yaml and adding:

Now all we need to do is restart the Datadog agent and check the status messages to see if these services are up and running:

sudo service datadog-agent restart
sudo datadog-agent status
...    
process
-------
Total Runs: 1
Metrics: 15, Total Metrics: 15
Events: 0, Total Events: 0
Service Checks: 1, Total Service Checks: 1
tcp_check
---------
Total Runs: 1
Metrics: 0, Total Metrics: 0
Events: 0, Total Events: 0
Service Checks: 1, Total Service Checks: 1
...

Setting up JVM and Corda metrics logging with Jolokia and JMX

The next step is to configure metrics export from the actual JVM and Corda application. Java and Corda metrics use the JMX framework for exporting metrics.

By default, JMX uses Java serialisation which could be considered insecure when used with untrusted data and as such is disabled in Corda so we use Jolokia to collect JVM and Corda metrics. This also provides a more generic solution with Telegraf (see below).

Download the Jolokia JVM agent to your VM from https://jolokia.org/download.html (make sure to get the JVM agent).

Then run the following command line to start Corda with Jolokia:

java -jar -javaagent:./jolokia-jvm-1.5.0-agent.jar=port=7777,host=localhost corda.jar

After a few seconds the Jolokia agent should start serving metrics at the specified endpoints. Test this is running and working correctly by listing all the available metrics:

curl http://localhost:7777/jolokia/list

To see a specific metric read the appropriate endpoint:

curl http://localhost:7777/jolokia/read/java.lang:type=Memory/HeapMemoryUsage/used
{"request":{"path":"used","mbean":"java.lang:type=Memory","attribute":"HeapMemoryUsage","type":"read"},"value":37096952,"timestamp":1522766757,"status":200}

Setting up Telegraf

The final thing we will do is to set up a metrics collection agent called Telegraf to collect the Jolokia metrics and forward them to the Datadog agent. Telegraf integrates with dozens of time-series database and backend metrics and dashboard solutions and so if you don’t want to use Datadog as described above you can configure Telegraf to send metrics to all the common monitoring systems (see below for some suggestions).

Download and install Telegraf on your Azure VM from https://portal.influxdata.com/downloads.

Now we need to configure Telegraf to output data to the Datadog agent. Edit the configuration file:

sudo nano /etc/telegraf/telegraf.conf

Make sure the following lines are present in this file:

The API key can be found in the Datadog web application under Integrations>APIs.

Now add inputs for each set of Jolokia metrics you want to forward to Datadog. Here is a recommended set of core metrics (you may want to add more and will will be adding more Corda metrics in the future):

Now restart the Telegraf service and check it is running OK:

sudo systemctl restart telegraf
sudo systemctl status telegraf

We recommend adding a Datadog process monitor (see the gist above)for Telegraf so you know if the agent stops running.

In Part 2 we will cover setting up a dashboard and alerting for specific metrics which are useful for monitoring the health of a Corda deployment.

Bonus Section: Alternative monitoring tools

We currently use Datadog for our monitoring because it is very easy to set up and get running and it allows us to focus on our software and deployments and not on hosting monitoring systems. It also integrates well with all the cloud platforms which we use. If you don’t want to use Datadog as the monitoring backend (because this is a commercial product which you have to pay for) there are a number of established open source monitoring solutions which are compatible with Telegraf. Telegraf can also collect system metrics and input from man
y common applications out of the box. The following are a few other options for sending outputs from Telegraf.

1, InfluxDB and Grafana

InfluxDB is an open source time-series database made by the same folks who created Telegraf. It also integrates very nicely with the Grafana monitoring dashboard for a free, simple monitoring solution. Alerting is minimal.

2, Prometheus

Prometheus is a powerful open source monitoring system which includes a time-series database, query language, visualisation and alerting system.