Chorus Brings Data Science Minds Together

The announcement of OpenChorus Project a few months ago provided a glimpse into the upcoming EMC Greenplum Chorus Release 2.2 release and its superb integrations to accelerate Big Data time to value. Chorus Release 2.2 provides a single platform whereby users now gain direct access to filtered and clean Twitter feeds from Gnip, perform advanced analysis faster with the on-demand assistance from expert Kaggle data scientists, and share insights seamlessly through Tableau advanced visualizations.


Chorus 2.2 is now available for free download, with the same code base also available through the OpenChorus Project download. For those of you not familiar with Chorus, it is the only collaborative Data Science platform that streamlines the complex analytic process, enabling users to quickly create their own sandboxes, and easily collaborate around data sets, analysis, and findings. Additionally, open sourcing Chorus brings greater freedom. Anyone can download the source code and get started, modifying and extending it to any environment. This also promotes an ecosystem of applications and startups around Big Data applications, bringing extensibility into the product at a much higher velocity than we would be able to achieve on our own. For example, the release of Greenplum Chorus 2.2 includes valuable contributions from partners I mentioned earlier – Gnip, Kaggle, and Tableau. Have I peaked your interest to download Chorus 2.2? Here is a Q&A I conducted with Logan Lee, Director of Product Management at EMC Greenplum, to prepare you for success.

1.  What are the system requirements or pre-requisites for Chorus 2.2?

Greenplum Chorus is available for install on the Greenplum DCA (Data Computing Appliance) or on any Linux server with an Intel Pentium Pro compatible (P3/Athlon and above) CPU and 8GB of RAM. In either case, Greenplum recommends 500GB of free disk space. Greenplum Chorus operates on the following platforms and browsers:

  • Red Hat Enterprise Linux 5.5, 5.7, 6.2 (64 bit)
  • CentOS 5.5, 5.7, 6.2 (64 bit)
  • SuSE Linux Enterprise Server 11 (64 bit)
  • OSX Lion x86_64
  • Firefox 14.0 or later
  • Google Chrome 20 or later
  • Internet Explorer 8.0 with Google Chrome Frame
  • Internet Explorer 9.0 (Google Chrome Frame not required)

Greenplum Chorus interoperates with Greenplum Database and Greenplum Hadoop (both available at the EMC Download Center), as well as few more flavors of Apache Hadoop:

  • Greenplum Database 4.0.5.x, 4.1.x, 4.2.x
  • Apache Hadoop 0.20.2, 0.20.203, 0.20.205
  • Greenplum HD 1.1, 1.2
  • Greenplum MR 1.0 1.2.x 2.

2.  Through an open, SQL query layer, Chorus plugs into an existing IT environment, with a mix of databases and tools. Does that mean I can configure Chorus to gain access to data in any database? What other applications have tight integration with Chorus beyond Tableau?

Yes, Chorus is designed to integrate with the data sources referenced above as well as other data sets through Greenplum Database’s External Table functionality. Once sandboxes have been provisioned through Chorus, Chorus users can start analyzing data sets through direct SQL commands or use tools such as Tableau and Alpine Iluminator. Having this tight integration provides a single source of truth throughout the analytical process, eliminating the need to jump from tool to tool for different analytical requirements that can range from simple queries and charts to advanced analytics and visualizations.

3. There is now direct access to Twitter data through integration with Gnip’s Twitter API. What steps are needed to enable this?

You need to first set up an account with Gnip. Once you specify the filter criteria, Gnip will prepare the data and generate a url that fetches your Twitter data set. In Chorus, you simply enter the Gnip url, and Chorus will download, process, and insert the data into a table in your sandbox. This integration removes the time and complexity needed in transforming streaming Twitter data into a clean, tabular format for analysis.

4. Having on-demand access to a pool of Kaggle Data Scientists is also compelling. What steps are needed to enable this?

Data Scientists who are part of Kaggle Connect can choose to opt-in to doing contract work through Chorus. From within the Chorus interface, Chorus users wishing to engage the Kaggle community will search, browse, and drill into profiles of Kaggle community members who are interested in collaborating together. Through secure integration of Chorus and Kaggle APIs, users can expose project relevant information from Chorus Workspaces and communicate through secure messages with the Kaggle Data Scientists. Once Kaggle Data Scientists review the material, they can respond directly to the Chorus user in order to discuss details and initiate the project together.

The value of this integration is that organizations gain direct on-demand access to credible Data Scientists who compete and win in Big Data projects that solve real world problems. Because these competitions span diverse industries, domains, and geographic locations, an organization can easily identify Data Scientists with the right expertise to address their Big Data project needs.

5. How can an organization get started with Chorus and ensure success of a Big Data project?

If you are an existing Greenplum Database customer, you can download Chorus and connect to your Greenplum Database to set up a workspace for a current Big Data project underway. Invite the Data Science team and subject matter experts in engineering and the business to the Chorus workspace so they can use the tool for data preparation, model building, testing, feedback, etc. Users can then evaluate how Chorus has made them successful in their project roles and evaluate how the assets produced can be leveraged in future Big Data projects.

If you are not a Greenplum customer, you can download the Greenplum Database Community Edition to use with Chorus. So even if you are not a current Greenplum customer we offer a complete analytics environment to evaluate and use for your current or next Big Data project.

About the Author: Mona Patel