This blog will show how to set up a monitoring system for Java Applications using InfluxDB, Grafana, Telegraph and StatsD-JVM-Profiler. This system will fetch memory usage, GC activity and tomcat sessions from a java web application. Take a look at a sample data visualization from Grafana:

This graph says a lot of about the java application. In my context, it reads: the java application went through some intensive activities, during this period, memory usage was very high, a lot of GC were triggered; what was worse was: after the period of the intensive activity, the memory usage was continuously building up to reach the point of OOM until GC kicked off, this suggested some ill memory usage of the java application. This graph coupled with the graph that shows tomcat activity sessions can tell a lot of about your java application, how to read it will depend on your context and your understanding of the application.

You can find other analysis example in Web request performance analysis charts.

Architecture

The architecture of this system is as follows:

The java application is instrumented with StatsD-JVM-Profiler (https://github.com/etsy/statsd-jvm-profiler), this profiler will periodically fetch memory usage, GC activities from the running JVM and send them to Telegraf.

Telegraf (https://www.influxdata.com/time-series-platform/telegraf/) is a data collector, it has a lot of plugins, one of the input plugins is StatsD. In this monitoring system, we use the StatsD plugin to accept data from StatsD-JVM-Profiler and send it to InfluxDB. Another good thing about Telegraf is that by default it will collect the CPU/Memory of the machine where it is installed and send it to InfluxDB, so machine monitoring comes free.

StatsD-JVM-Profiler can send data directly to InfluxDB, but I prefer it to go through Telegraf: it is more robust and flexible, and Telegrah monitors the machine for free.

InfluxDB(https://www.influxdata.com/time-series-platform/influxdb/) is a time-series DB, perfect for storing monitoring data.

Grafana(http://grafana.org/) is a UI application, perfect for visualizing time-series data.

Note, in this architecture, StatsD is not installed, the involvement of StatsD is on using Telegraf to send data points written in StatsD format (which is very easy to understand and use).

Installation

I install InfluxDB and Grafana on Ubuntu:

vagrant@exp:~$ cat /etc/*-release

DISTRIB_ID=Ubuntu

DISTRIB_RELEASE=15.10

DISTRIB_CODENAME=wily

Install InfluxDB

sudo apt-get update

sudo apt-get upgrade

curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -

source /etc/lsb-release

echo "deb https://repos.influxdata.com/${DISTRIB_ID,,} ${DISTRIB_CODENAME} stable" | sudo tee /etc/apt/sources.list.d/influxdb.list

sudo apt-get update && sudo apt-get install influxdb

sudo service influxdb start

By default, InfluxDB’s port is 8086.

Now InfluxDB is installed, you can create a database:

$influx

Connected to http://localhost:8086 version 1.1.0

InfluxDB shell version: 1.1.0

> create database test

> use test

To verify InfluxDB is in order:

curl -i -X POST "http://localhost:8086/write?db=test" --data-binary 'user.logins,service=payroll,region=us-west value=1 1478177907371000064'

Go back to the InfluxDB command:

> show measurements

name: measurements

name

----

user.logins

> select * from "user.logins"

name: user.logins

time region service value

---- ------ ------- -----

1478177907371000064 us-west payroll 1

Install Grafana

Follow http://docs.grafana.org/installation/debian/:

$ wget https://grafanarel.s3.amazonaws.com/builds/grafana_3.1.1-1470047149_amd64.deb

$ sudo apt-get install -y adduser libfontconfig

$ sudo dpkg -i grafana_3.1.1-1470047149_amd64.deb

$ sudo /bin/systemctl daemon-reload

$ sudo /bin/systemctl enable grafana-server

$ sudo systemctl start grafana-server

The default port for grafana is 3000.

Install Telegraf

Follow https://www.influxdata.com/downloads/:

To install on Ubuntu:

cd /tmp

wget https://dl.influxdata.com/telegraf/releases/telegraf_1.0.1_amd64.deb

sudo dpkg -i telegraf_1.0.1_amd64.deb

It is likely that you will install Telegraf on all sorts of operation systems, for the sake of completeness, here are the commands to install it on Redhat and Windows.

To install on Redhat:

cd /tmp

wget https://dl.influxdata.com/telegraf/releases/telegraf-1.1.1.x86_64.rpm

sudo yum localinstall telegraf-1.1.1.x86_64.rpm

To install on Windows follow https://github.com/influxdata/telegraf/blob/master/docs/WINDOWS_SERVICE.md,

Download and extract the windows distribution https://dl.influxdata.com/telegraf/releases/telegraf-1.1.2_windows_amd64.zi
Create the directory C:\Program Files\Telegraf (if you install in a different location simply specify the -config parameter with the desired location)
Place the telegraf.exe and the telegraf.conf config file into C:\Program Files\Telegraf
To install the service into the Windows Service Manager, run the following in PowerShell as an administrator (If necessary, you can wrap any spaces in the file paths in double quotes "") C:\"Program Files"\Telegraf\telegraf.exe --service install .
If you install it in a different location, you need to specify the config file location with:
telegraf.exe --service install -config full_path_to_config_file

On Linux, the Telegrah configuration file is in /etc/telegraf/telegraf.conf. Make the following changes to it:

[[outputs.influxdb]]

## The full HTTP or UDP endpoint URL for your InfluxDB instance.

## Multiple urls can be specified as part of the same cluster,

## this means that only ONE of the urls will be written to each interval.

# urls = ["udp://localhost:8089"] # UDP endpoint example

urls = ["http://localhost:8086"] # required

## The target database for metrics (telegraf will create it if not exists).

database = "test" # required

…
[[inputs.statsd]]

## Address and port to host UDP listener on

service_address = ":8125"

## Delete gauges every interval (default=false)

delete_gauges = true

## Delete counters every interval (default=false)

delete_counters = true

## Delete sets every interval (default=false)

delete_sets = true

## Delete timings & histograms every interval (default=true)

delete_timings = true

## Percentiles to calculate for timing & histogram stats

percentiles = [90]

## separator to use between elements of a statsd metric

metric_separator = "_"

## Parses tags in the datadog statsd format

## http://docs.datadoghq.com/guides/dogstatsd/

parse_data_dog_tags = false

## Statsd data translation templates, more info can be read here:

## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md#graphite

##templates = [

## "cpu.* measurement*"

##]

## Number of UDP messages allowed to queue up, once filled,

## the statsd server will start dropping packets

allowed_pending_messages = 10000

## Number of timing/histogram values to track per-measurement in the

## calculation of percentiles. Raising this limit increases the accuracy

## of percentiles but also increases the memory usage and cpu time.

percentile_limit = 1000

After changing the configuration file, you need to restart Telegraf:

sudo systemctl restart telegraf

To test if Telegraf can successfully send data to InfluxDB:

echo "user.logins,service=payroll,region=us-west:1|c" | nc -C -w 1 -u localhost 8125

And now go back to the InfluxDB command line, you should be able to see a new measurement “user_logins” has been created, notice the Telegrah automatically convert dot to underscore, this is nice, as you do not have to quote measurement names when select from them.

Instrument a Java application

Just follow the instructions on https://github.com/etsy/statsd-jvm-profiler.

In my case, I wanted to present related data series in one graph and use Grafana’s templating feature to select data series within one graph, for example, all data series of heap usage are shown in one graph (see the previous graph). I modified StatsD-JVM-Profiler to send data series using InfluxdDB tags

, and also I added a profiler to collect active and expired sessions in tomcate – more on this in A Fork from statsd-jvm-profiler.

PerfSpy

Wednesday, December 14, 2016

A monitoring system for Java Application ---using InfluxDB, Grafana, Telegraph and statsd-jvm-profiler