UltraVista Comments on Forrester's Wave Enterprise Hadoop Solutions Report
When Forrester Wave released their 2012 Enterprise Hadoop Solutions Report, many with questionable ranking have rushed to criticize the report. In spite of some minor comparison "glitches" in their Excel-based vendor comparison tool, we at UltraVista believe that the report meaningfully tries to establish reasonable grounds for establishing a matrix to shed some light into comparing the "Big Data" enterprise solutions.
Based on 15 criteria evaluation of enterprise Hadoop solution providers, Forrester has found that:
- In the Leaders category Amazon Web Services led the pack due to its proven, feature-rich Elastic MapReduce subscription service
- IBM and EMC Greenplum offer strong EDW portfolios
- MapR and Cloudera impress enterprise-grade distributions
- Hortonworks offers an impressive professional services
- Pentaho provides an impressive Hadoop data integration tool
- Of the Contenders, DataStax provides a Hadoop platform for real-time, distributed, transactional deployments
- Datameer has a user-friendly Hadoop/MapReduce modeling tool
- Platform Computing and Zettaset offer best-of-breed cluster management tools
- Outerthought has a high-volume search and indexing optimized platform
- HStreaming offers a strong real-time Hadoop solution.
One of the strangest things we have found in the report is that in utilizing Hadoop for the enterprise "Big Data" participate some of the companies that have set the course against open source for a long time, somehow founding the pride to start "milking" the open source cow for the same purpose they have been initially disliked it. If anything should be the lesson-learned out of this report, than "the open source will rule the world" could be one of that lessons.
Hadoop - The Open Source Heart of the Big Data
Hadoop is an open source software project from Apache Foundation that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer. Hadoop has two main subprojects:
- MapReduce - The framework that understands and assigns work to the nodes in a cluster;
- HDFS - A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes.
Hadoop is supplemented by an ecosystem of Apache projects, such as Pig, Hive and Zookeeper, that extend the value of Hadoop and improves its usability. Hadoop changes the economics and the dynamics of large scale computing. Hadoop enables a computing solution that is:
- Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
- Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
- Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
- Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
Forrester regards Hadoop as the nucleus of the next-generation EDW in the cloud. Hadoop implements the core features that are at the heart of most modern EDWs: cloud-facing architectures, MPP, in-database analytics, mixed workload management, and a hybrid storage layer. Essentially, application development and business process professionals should regard today's Hadoop market as the reinvention of the EDW for the new age of cloud-centric business models that require rapid execution of advanced, embedded analytics against big data.
Consistent with this trend, many EDW vendors, such as EMC Greenplum, IBM, Microsoft, and Oracle, are evolving their offerings to support Hadoop. Both incumbent EDW providers and startups now provide enterprise-grade distributions of Apache Hadoop that incorporate many of the core Hadoop subprojects with various proprietary revisions, extensions, and tools to add functionality, performance, high availability, security, and manageability (see Figure 1). Application development pros must consider the new approach to big data: the Apache Hadoop open source codebase and the commercial offerings that leverage and extend these technologies to help enterprises address critical business challenges that demand extreme scalability in EDW, advanced analytics, business intelligence (BI), online transaction processing (OLTP), and data integration.
The Results: Enterprise Hadoop Distributions Dominate the Market
The evaluation uncovered a market in which :
- Amazon Web Services, IBM, EMC Greenplum, Cloudera, and Hortonworks are Leaders.
- All of the Leaders have a strong Hadoop presence. Amazon leads the pack due to its proven, feature-rich Elastic MapReduce subscription service. IBM and EMC Greenplum offer Hadoop solutions within strong EDW portfolios. Cloudera and MapR impress with best-of-breed enterprise-grade distributions. And Hortonworks is building an impressive Hadoop professional services portfolio.
- Pentaho is a Strong Performer with an impressive Hadoop data integration tool. Among data integration vendors that have added Hadoop functionality to their products over the past year, it has the richest functionality and the most extensive integration with open source Apache Hadoop and with the Amazon, Cloudera, EMC Greenplum, MapR, and Hortonworks distributions of Hadoop.
- DataStax, Datameer, Platform Computing, Zettaset, Outerthought, and HStreaming are
- Contenders. DataStax provides a Hadoop platform for real-time, distributed, transactional deployments. Datameer offers a user-friendly Hadoop/MapReduce modeling tool. Platform Computing and Zettaset offer best-of-breed Hadoop cluster management tools. Outerthought has optimized its Hadoop platform for high-volume search and indexing.
- HStreaming is a Risky Bet. HStreaming is strong in real-time Hadoop and supports complex event processing (CEP). However, it lacks several key solution components — including a Hadoop modeling tool, an ap pliance or cloud/SaaS version, and business applications — and has a small professional services team.
Forrester emphasizes that the evaluation of the enterprise Hadoop solutions market is intended to be a starting point only, encouraging readers to view detailed product evaluations and adapt the criteria weightings to fit their individual needs through the Forrester Wave Excel-based vendor comparison tool.