Introduction
After a successful soft launch, the ICGC DCC Software Engineering is proud to announce the availability of Jupyter notebooks as a new tool in the analysis toolbox within the ICGC Data Portal. Our goal is to provide researchers and data scientists a way to programmatically explore the available ICGC data and get their hands dirty with some data science and analysis in a way that is sharable and reproducable.
Users with DACO approval to the ICGC dataset will have the ability to login using their Google credentials. Once inside, users are presented with a familiar looking Jupyter Notebook environment with some value adds provided by us.
Users can get to Jupyter by either visiting the following URL directly in their browser: https://jupyterhub.cancercollaboratory.org/ or by navigating to the analysis toolbox in the ICGC Data Portal
Technology
For a detailed description of all the moving pieces that went into this, please check out Kevin’s excellent blog post on the technical aspects of deploying JupyterHub.
The main driving project behind our ability to deploy this new feature is JupyterHub. JupyterHub itself is a bundle of three main technologies (tornado, node-http-proxy, jupyter) which allows notebook servers to be provisioned on a per user or session basis.
The other secret sauce is that we are using the docker spawner for launching new notebook servers using a docker image that we hand rolled ourselves to include the particular customizations and features that we want. JupyterHub and the docker containers all run inside the Cancer Genome Collaboratory cloud.
We’ve also extended the Authorization mechanism to perform the required DACO checks for access to the ICGC dataset in addition to generalizing the whole deployment using Ansible.
The Ansible deployment of this solution, from the building of docker images to the launching of VMs, has been made available as an Overture component called Jukebox.
Features
The most important fact to know about ICGC Jupyter Notebooks is that they run inside the Cancer Genome Collaboratory, meaning they are co-located with the ICGC dataset itself. This provides the fastest possible access to the data stored there, and we bundle the icgc-storage-client as part of the notebook server containers such that users can interact with this data. The effort has also been made such that all the advanced features of the icgc-storage-client work inside the notebook containers. For example, users will be able to use the FUSE mount capabilities of the client.
Additionally, we’ve also bundled the ICGC Python API for interfacing with the data annotated and indexed in the ICGC Data Portal. This API provides a nice wrapper around the REST API and a programmatic mechanism for downloading the TSV data provided by the dynamic download service.
It’s a sandbox!
We did our best to provision this sandbox environment with the most useful tools, but users will still retain the ability to install any missing packages that they require. We make no guarantees, but user experience with regard to installing python packages with pip should fairly good.
It is important to take this time to remind users that as this is a sandbox that will be occasionally updated with additional tools and extra features, and therefore we recommend users download important notebooks to their local machines. One of the most useful things about Jupyter notebooks is that they are downloadable and sharable, and as such, users should build a habit for backing up their notebooks, even possibly going as far as using a VCS like Git to manage their revisions and backups.
Happy data mining!