|  | Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines. | 
 |  | Apex is an enterprise grade native YARN big data-in-motion platform that unifies stream processing as well as batch processing. | 
 |  | Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. | 
 |  | Apache Atlas is a scalable and extensible set of core foundational governance services that enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the complete enterprise data ecosystem | 
 |  | BatchEE projects aims to provide a JBatch implementation (aka JSR352) and a set of useful extensions for this specification. | 
 |  | Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Beam also brings DSL in different languages, allowing users to easily implement their data integration processes. | 
 |  | Blur is a search platform capable of searching massive amounts of data in a cloud computing environment. | 
 |  | CMDA provides web services for multi-aspect physics-based and phenomenon-oriented climate model performance evaluation and diagnosis through the comprehensive and synergistic use of multiple observational data, reanalysis data, and model outputs. | 
 |  | Commons RDF is a set of interfaces and classes for RDF 1.1 concepts and behaviours. The commons-rdf-api module defines interfaces and testing harness. The commons-rdf-simple module provides a basic reference implementation to exercise the test harness and clarify API contracts. | 
 |  | Apache Concerted is a Do-It-Yourself toolkit for building in-memory data engines. | 
 |  | DataFu provides a collection of Hadoop MapReduce jobs and functions in higher level languages based on it to perform data analysis. It provides functions for common statistics tasks (e.g. quantiles, sampling), PageRank, stream sessionization, and set and bag operations. DataFu also provides Hadoop jobs for incremental data processing in MapReduce. | 
 |  | Eagle is a Monitoring solution for Hadoop to instantly identify access to sensitive data, recognize attacks, malicious activities and take actions in real time. | 
 |  | Fineract is an open source system for core banking as a platform. | 
 |  | FreeMarker is a template engine, i.e. a generic tool to generate text output based on templates. FreeMarker is implemented in Java as a class library for programmers. | 
 |  | Gearpump is a reactive real-time streaming engine based on the micro-service Actor model. | 
 |  | Geode is a data management platform that provides real-time, consistent access to data-intensive applications throughout widely distributed cloud architectures. | 
 |  | Guacamole is an enterprise-grade, protocol-agnostic, remote desktop gateway. Combined with cloud hosting, Guacamole provides an excellent alternative to traditional desktops. Guacamole aims to make cloud-hosted desktop access preferable to traditional, local access. | 
 |  | HAWQ is an advanced enterprise SQL on Hadoop analytic engine built around a robust and high-performance massively-parallel processing (MPP) SQL framework evolved from Pivotal Greenplum Database. | 
 |  | HORN is a neuron-centric programming APIs and execution framework for large-scale deep learning, built on top of Apache Hama. | 
 |  | HTrace is a tracing framework intended for use with distributed systems written in java. | 
 |  | Impala is a high-performance C++ and Java SQL query engine for data stored in Apache Hadoop-based clusters. | 
 |  | Open source system that enables the orchestration of IoT devices. | 
 |  | Implementation of JSR-353 JavaTM API for JSON Processing (Renamed from Fleece) | 
 |  | Joshua is a statistical machine translation toolkit | 
 |  | Kudu is a distributed columnar storage engine built for the Apache Hadoop ecosystem. | 
 |  | Logging for C++ | 
 |  | Big Data Machine Learning in SQL for Data Scientists. | 
 |  | Metron is a project dedicated to providing an extensible and scalable advanced network security analytics tool. It has strong foundations in the Apache Hadoop ecosystem. | 
 |  | Distributed Cryptography; M-Pin protocol for Identity and Trust | 
 |  | Mnemonic is a Java based non-volatile memory library for in-place structured data processing and computing. | 
 |  | MRQL is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, Spark, and Flink. | 
 |  | Mynewt is a real-time operating system for constrained embedded systems like wearables, lightbulbs, locks and doorbells. It works on a variety of 32-bit MCUs (microcontrollers), including ARM Cortex-M and MIPS architectures. | 
 |  | Myriad enables co-existence of Apache Hadoop YARN and Apache Mesos together on the same cluster and allows dynamic resource allocations across both Hadoop and other applications running on the same physical data center infrastructure. | 
 |  | Java modules that allow programmatic creation, scanning and manipulation of OpenDocument Format (ISO/IEC 26300 == ODF) documents | 
 |  | Omid is a flexible, reliable, high performant and scalable ACID transactional framework that allows client applications to execute transactions on top of MVCC key/value-based NoSQL datastores (currently Apache HBase) providing Snapshot Isolation guarantees on the accessed data. | 
 |  | Tools and libraries for developing Attribute-based Access Control (ABAC) Systems in a variety of languages. | 
 |  | Quarks is a stream processing programming model and lightweight runtime to execute analytics at devices on the edge or at the gateway. | 
 |  | Quickstep is a high-performance database engine. | 
 |  | The Ranger project is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. | 
 |  | Rya (pronounced "ree-uh" /rēə/) is a cloud-based RDF triple store that supports SPARQL queries. Rya is a scalable RDF data management system built on top of Accumulo. Rya uses novel storage methods, indexing schemes, and query processing techniques that scale to billions of triples across multiple nodes. Rya provides fast and easy access to the data through SPARQL, a conventional query mechanism for RDF data. | 
 |  | S2Graph is a distributed and scalable OLTP graph database built on Apache HBase to support fast traversal of extremely large graphs. | 
 |  | SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms that run on top of distributed stream processing engines (DSPEs). It features a pluggable architecture that allows it to run on several DSPEs such as Apache Storm, Apache S4, and Apache Samza. | 
 |  | Singa is a distributed deep learning platform. | 
 |  | Monitoring Solution. | 
 |  | Slider is a collection of tools and technologies to package, deploy, and manage long running applications on Apache Hadoop YARN clusters. | 
 |  | Apache Streams is a lightweight server for ActivityStreams. | 
 |  | SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations, to distributed computations such as Apache Hadoop MapReduce and Apache Spark. | 
 |  | Tamaya is a highly flexible configuration solution based on an modular, extensible and injectable key/value based design, which should provide a minimal but extendible modern and functional API leveraging SE, ME and EE environments. | 
 |  | Taverna is a domain-independent suite of tools used to design and execute data-driven workflows. | 
 |  | Tephra is a system for providing globally consistent transactions on top of Apache HBase and other storage engines. | 
 |  | TinkerPop is a graph computing framework written in Java | 
 |  | Toree provides applications with a mechanism to interactively and remotely access Apache Spark. | 
 |  | Trafodion is a webscale SQL-on-Hadoop solution enabling transactional or operational workloads on Hadoop. | 
 |  | Twill is an abstraction over Apache Hadoop YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their business logic | 
 |  | Unomi is a reference implementation of the OASIS Context Server specification currently being worked on by the OASIS Context Server Technical Committee. It provides a high-performance user profile and event tracking server. | 
 |  | A wave is a hosted, live, concurrent data structure for rich communication. It can be used like email, chat, or a document. | 
 |  | A collaborative data analytics and visualization tool for distributed, general-purpose data processing systems such as Apache Spark, Apache Flink, etc. |