Wednesday, December 14, 2016

A monitoring system for Java Application ---using InfluxDB, Grafana, Telegraph and statsd-jvm-profiler

This blog will show how to set up a monitoring system for Java Applications using InfluxDB, Grafana, Telegraph and StatsD-JVM-Profiler. This system will fetch memory usage, GC activity and tomcat sessions from a java web application. Take a look at a sample data visualization from Grafana:

This graph says a lot of about the java application. In my context, it reads: the java application went through some intensive activities, during this period, memory usage was very high, a lot of GC were triggered; what was worse was: after the period of the intensive activity, the memory usage was continuously building up to reach the point of OOM until GC kicked off, this suggested some ill memory usage of the java application. This graph coupled with the graph that shows tomcat activity sessions can tell a lot of about your java application, how to read it will depend on your context and your understanding of the application.

You can find other analysis example in Web request performance analysis charts.


The architecture of this system is as follows:

The java application is instrumented with StatsD-JVM-Profiler (, this profiler will periodically fetch memory usage, GC activities from the running JVM and send them to Telegraf.

Telegraf ( is a data collector, it has a lot of plugins, one of the input plugins is StatsD. In this monitoring system, we use the StatsD plugin to accept data from StatsD-JVM-Profiler and send it to InfluxDB. Another good thing about Telegraf is that by default it will collect the CPU/Memory of the machine where it is installed and send it to InfluxDB, so machine monitoring comes free. 

StatsD-JVM-Profiler can send data directly to InfluxDB, but I prefer it to go through Telegraf: it is more robust and flexible, and Telegrah monitors the machine for free.

InfluxDB( is a time-series DB, perfect for storing monitoring data.

Grafana( is a UI application, perfect for visualizing time-series data.

Note, in this architecture, StatsD is not installed, the involvement of StatsD is on using Telegraf to send data points written in StatsD format (which is very easy to understand and use).  


I install InfluxDB and Grafana on Ubuntu:
vagrant@exp:~$ cat /etc/*-release

Install InfluxDB

sudo apt-get update
sudo apt-get upgrade
curl -sL | sudo apt-key add -
source /etc/lsb-release
echo "deb${DISTRIB_ID,,} ${DISTRIB_CODENAME} stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
sudo apt-get update && sudo apt-get install influxdb
sudo service influxdb start

By default, InfluxDB’s port is 8086. 

Now InfluxDB is installed, you can create a database:

Connected to http://localhost:8086 version 1.1.0
InfluxDB shell version: 1.1.0

> create database test
> use test

To verify InfluxDB is in order:
curl -i -X POST "http://localhost:8086/write?db=test" --data-binary 'user.logins,service=payroll,region=us-west value=1 1478177907371000064'

Go back to the InfluxDB command:
> show measurements
name: measurements
> select * from "user.logins"
name: user.logins
time                            region  service value
----                            ------  ------- -----
1478177907371000064             us-west payroll 1

Install Grafana

$ wget
$ sudo apt-get install -y adduser libfontconfig
$ sudo dpkg -i grafana_3.1.1-1470047149_amd64.deb

$ sudo /bin/systemctl daemon-reload
$ sudo /bin/systemctl enable grafana-server
$ sudo systemctl start grafana-server

The default port for grafana is 3000.

Install Telegraf

To install on Ubuntu:
cd /tmp
sudo dpkg -i telegraf_1.0.1_amd64.deb
It is likely that you will install Telegraf on all sorts of operation systems, for the sake of completeness, here are the commands to install it on Redhat and Windows.

To install on Redhat:
cd /tmp
sudo yum localinstall telegraf-1.1.1.x86_64.rpm

  1. Download and extract the windows distribution 
  2.  Create the directory C:\Program Files\Telegraf (if you install in a different location simply specify the -config parameter with the desired location)
  3. Place the telegraf.exe and the telegraf.conf config file into C:\Program Files\Telegraf
  4. To install the service into the Windows Service Manager, run the following in PowerShell as an administrator (If necessary, you can wrap any spaces in the file paths in double quotes "")   C:\"Program Files"\Telegraf\telegraf.exe --service install .
    If you install it in a different location, you need to specify the config file location with:
    telegraf.exe --service install -config full_path_to_config_file

On Linux, the Telegrah configuration file is in /etc/telegraf/telegraf.conf. Make the following changes to it:

 ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
 ## Multiple urls can be specified as part of the same cluster,
 ## this means that only ONE of the urls will be written to each interval.
 # urls = ["udp://localhost:8089"] # UDP endpoint example
  urls = ["http://localhost:8086"] # required
 ## The target database for metrics (telegraf will create it if not exists).
  database = "test" # required

   ## Address and port to host UDP listener on
   service_address = ":8125"
   ## Delete gauges every interval (default=false)
   delete_gauges = true
   ## Delete counters every interval (default=false)
   delete_counters = true
   ## Delete sets every interval (default=false)
   delete_sets = true
   ## Delete timings & histograms every interval (default=true)
   delete_timings = true
   ## Percentiles to calculate for timing & histogram stats
   percentiles = [90]

   ## separator to use between elements of a statsd metric
   metric_separator = "_"

   ## Parses tags in the datadog statsd format
   parse_data_dog_tags = false

   ## Statsd data translation templates, more info can be read here:
    ##templates = [
    ##    "cpu.* measurement*"

   ## Number of UDP messages allowed to queue up, once filled,
   ## the statsd server will start dropping packets
   allowed_pending_messages = 10000

   ## Number of timing/histogram values to track per-measurement in the
   ## calculation of percentiles. Raising this limit increases the accuracy
   ## of percentiles but also increases the memory usage and cpu time.
   percentile_limit = 1000

After changing the configuration file, you need to restart Telegraf:
sudo systemctl restart telegraf

To test if Telegraf can successfully send data to InfluxDB:
echo "user.logins,service=payroll,region=us-west:1|c" | nc -C -w 1 -u localhost 8125

And now go back to the InfluxDB command line, you should be able to see a new measurement “user_logins” has been created, notice the Telegrah automatically convert dot to underscore, this is nice, as you do not have to quote measurement names when select from them.

Instrument a Java application

Just follow the instructions on

In my case, I wanted to present related data series in one graph and use Grafana’s templating feature to select data series within one graph, for example, all data series of heap usage are shown in one graph (see the previous graph). I modified StatsD-JVM-Profiler to send data series using InfluxdDB tags
, and also I added a profiler to collect active and expired sessions in tomcate – more on this in A Fork from statsd-jvm-profiler.

No comments:

Post a Comment