Kandinsky The Systems Biology KBase pilot hardware named Kandinsky is a machine designed for optimal support of the Hadoop architecture/runtime. The original recommendation made by the science advisory board focused on examining the MapReduce programming paradigm and it’s applicability to bioinformatics applications. As a result, the hardware configuration includes over 0.5 petabytes of storage on local nodes under the direction of the Hadoop Distributed File System. In addition to supporting Hadoop based applications, support for private cloud virtualization will be added via the Eucalyptus infrastructure software that enables establishment of private cloud computing environments. Eucalyptus is interface-compatible with the Amazon Web Services (AWS) cloud infrastructure, which means users can reuse existing AWS-compatible tools and scripts to manage their own private cloud, run Amazon Machine Images on their private cloud and cloud-burst to other public-clouds (also known as hybrid clouds -- a private on-premise cloud, in this case, a Eucalyptus cloud, working seamlessly with a public cloud). General Information · Software Stack: Kandinsky is a 64 bit Linux system with CentOS Kernel version 5. The Hadoop distribution we use is Cloudera’s distribution of Hadoop version 3(CDH3), which comes with the following tool set: o HDFS. o HBase. o MapReduce. o Hive, Pig, Oozie, Sqoop, Flume, Hue and Zookeeper. User Documentation · Getting Started: First time users need to request a new account. Once the account is available the user can log into the system and access the local as well as Hadoop distributed file system (HDFS) which is part of the Hadoop cloud. To run any Hadoop based application the user will need to transfer their data from the local file system to distributed file system. The other popular data storage scheme on the Hadoop cloud is HBase which can be accessed on Kandinsky via a shell. · Available data: We have the Sequence Read archive (SRA) data available HDFS and HBase. The compressed sequence files are available in HDFS whereas the metadata is available in a huge HBase single master table called SRAData. · Using Installed applications: We currently have the following applications installed and ready to be used: o Crossbow: Crossbow is scalable, portable, and automatic Cloud Computing tool for finding SNPs in genomes from short read data. o CloudBurst: CloudBurst is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics. Developer Documentation · Code for creating your own private distributed SRA: We have developed code libraries based on HDFS API and HBase API which allow individual centers to download their own copy of SRA data and distribute it over their private Hadoop cloud. The software is available on Bitbucket and can be checked out via SRA_Hbase. We would be happy to discuss any development ideas regarding the code base. Related Resources · Hadoop HBase. |
