Big Data and Apache Hivemall: Machine Learning with SQL

Anyone use Machine Learning techniques applied to data know very well that this is a complex task that require knowledge of many programming languages, science, framework, algorithms and it means not a short work.

With Hivemall, Apache provides a scalable library of machine learning that is built as a collection of Hive User Defined Functions that allow anyone to run algorithms of machine learning with knowledge of SQL. Hivemall runs on Hadoop-based data processing frameworks, a distributed file system, with MapReduce parallel data processing model in this ecosystem. It is possible to try these functionalities in Apache Hive or Spark environments.

Apache Hivemall, such as Google BigQuery ML, allows us to apply machine learning to our big data with a series of queries. These solutions introduce a new paradigm “machine learning in query language” and basically these have the same logical approach but Hivemall is more flexible in terms of selection of algorithms and platforms.

Read more