Using ELK and Ntopng to monitor data downloads.

Jan. 27, 2017


Cancer Genome Collaboratory is a cloud environment built using open-source technologies that aims to offer cancer researchers access to compute resources close to the large data sets used in their analysis.

Currently we have 4 PB of raw Ceph storage and 2500 CPU cores deployed on top of Ceph and Openstack, but we also use a number of other open-source tools to manage and monitor the infrastructure.

All the servers send their logs to the central Logstash server where they are parsed and injected into Elasticsearch, so we can easily query the download requests.


Because we store the large genomics data files as S3 objects uploaded in 1 GB multiparts, we can focus the Kibana search on the download requests for objects larger than 1 GB, while also excluding the IP addresses of some known clients (regular downloads part of our monitoring system).


Finally, we can create a Kibana visualization out of the search results in order to get a better idea on the major download users (each hit count means 1 GB download).


In addition to the central aggregation of the application logs in ELK, the core switch connecting the environment to the Internet sends sflow statistics to Ntopng for which we have a free license kindly provided by the Ntop maintainers.


This allows us to see in real time external connections, and even to geographically display them on the map.

George Mihaiescu, Cloud Architect
Cloud architect who knows his way around storage, virtualization, networking, security and new design patterns. Enjoys working on performance enhancing problems and proud of making a difference in cancer research.