- Hadoop as a Data Lake
- The Hadoop Distributed File System (HDFS)
- Direct File Transfer to Hadoop HDFS
- Importing Data from Files into Hive Tables
- Importing Data into Hive Tables Using Spark
- Using Apache Sqoop to Acquire Relational Data
- Using Apache Flume to Acquire Data Streams
- Manage Hadoop Work and Data Flows with Apache Oozie
- Apache Falcon
- What's Next in Data Ingestion?
Manage Hadoop Work and Data Flows with Apache Oozie
Apache Oozie is a workflow scheduler system designed to run and manage multiple related Apache Hadoop jobs. For instance, complete data input and analysis may require several discrete Hadoop jobs to be run as a workflow where the output of one job will be the input for a successive job. Oozie is designed to construct and manage these workflows.
Oozie is not a substitute for the YARN scheduler mentioned previously. That is, YARN manages resources for individual Hadoop jobs, and Oozie provides a way to connect and control multiple Hadoop jobs on the cluster.
Oozie workflow jobs are represented as DAGs of actions. There are three types of Oozie jobs:
Workflow: A specified sequence of Hadoop jobs with outcome-based decision points and control dependency. Progress from one action to another cannot happen until the first action is complete.
Coordinator: A scheduled workflow job that can run at various time intervals or when data becomes available.
Bundle: A higher-level Oozie abstraction that will batch a set of coordinator jobs.
Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java MapReduce, Streaming MapReduce, Pig, Hive, Spark, and Sqoop) as well as system-specific jobs (such as Java programs and shell scripts). Oozie also provides a CLI and a Web UI for monitoring jobs. An example of a simple Oozie workflow is shown in Figure 4.7. In this example, Oozie runs a basic MapReduce operation. If the application was successful the job ends; if there was an error, the job is killed.
Figure 4.7 A simple Oozie DAG workflow.
Oozie workflow definitions are written in Hadoop Process Definition Language (hPDL), which is an XML-based process definition language. Oozie workflows contain several types of nodes.
Start/Stop control flow nodes define the beginning and the end of a workflow. These include start, end, and optional fail nodes.
Action nodes are where the actual processing tasks are defined. When an action node finishes, the remote systems notify Oozie and the next node in the workflow is executed. Action nodes can also include HDFS commands.
Fork/join nodes allow parallel execution of tasks in the workflow. The fork node allows two or more tasks to run at the same time. A join node represents a rendezvous point that must wait until all forked tasks complete.
Control flow nodes enable decisions to be made about the previous task. Control decisions are based on the results of the previous action (e.g. file size or file existence). Decision nodes are essentially switch-case statements that use JSP EL (Java Server Pages-Expression Language) that evaluates to either true or false.
Figure 4.8 A more complex Oozie DAG workflow.