Cancer Genome Collaboratory is a cloud environment built using open-source technologies that aims to offer cancer researchers access to compute resources close to the large data sets used in their analysis.
Currently we have 4 PB of raw Ceph storage and 2500 CPU cores deployed on top of Ceph and Openstack, but we also use a number of other open-source tools to manage and monitor the infrastructure.
Because we store the large genomics data files as S3 objects uploaded in 1 GB multiparts, we can focus the Kibana search on the download requests for objects larger than 1 GB, while also excluding the IP addresses of some known clients (regular downloads part of our monitoring system).
Finally, we can create a Kibana visualization out of the search results in order to get a better idea on the major download users (each hit count means 1 GB download).
In addition to the central aggregation of the application logs in ELK, the core switch connecting the environment to the Internet sends sflow statistics to Ntopng for which we have a free license kindly provided by the Ntop maintainers.
This allows us to see in real time external connections, and even to geographically display them on the map.