- Hadoop as a Data Lake
- The Hadoop Distributed File System (HDFS)
- Direct File Transfer to Hadoop HDFS
- Importing Data from Files into Hive Tables
- Importing Data into Hive Tables Using Spark
- Using Apache Sqoop to Acquire Relational Data
- Using Apache Flume to Acquire Data Streams
- Manage Hadoop Work and Data Flows with Apache Oozie
- Apache Falcon
- What's Next in Data Ingestion?
Apache Falcon simplifies the configuration of data motion by providing replication, life cycle management, lineage, and traceability. These features provide data governance consistency across Hadoop components that is not possible using Oozie. For instance, Falcon allows Hadoop administrators to centrally define their data pipelines, and then Falcon uses those definitions to auto-generate workflows in Apache Oozie. In simple terms, proper use of Falcon helps keep your active Hadoop cluster from becoming a confusing mess.
For example, Oozie lets you define Hadoop processing through workflow and coordinator (a recurring workflow) jobs. The input datasets for data processing are often described as part of coordinator jobs that specify properties such as path, frequency, schedule runs, and so on. If there are two coordinator jobs that depend on the same data, these details have to be defined and managed twice. If you want to add shared data deletion or movement, a separate coordinator is required. Oozie will certainly work in these situations, but there is no easy way to define and track the entire data life cycle or manage multiple independent Oozie jobs.
Oozie is useful when initially setting up and testing workflows and can be used when the workflows are independent and not expected to change often. If there are multiple dependencies between workflows or there is a need to manage the entire data life cycle, then Falcon should be considered.
As mentioned, as Hadoop’s high-level workflow scheduler, Oozie may be managing hundreds to thousands of coordinator jobs and files. This situation results in some common mistakes. Processes might use the wrong copies of datasets. Datasets and processes may be duplicated, and it becomes increasingly more difficult to track down where a particular dataset originated. At that level of complexity, it becomes difficult to manage so many dataset and process definitions.
To solve these problems, Falcon allows the creation of a pipeline that is defined by three key attributes:
A cluster entity that defines where data, tools, and processes live on your Hadoop cluster. A cluster entity contains things like the namenode address, Oozie URL, etc., which it uses to execute the other two entities: feeds and processes.
A feed entity defines where data live on your cluster (in HDFS). The feed is designed to designate to Falcon where your new data (that’s either ingested, processed, or both) live so it can retain (through retention policies) and replicate (through replication policies) the data on or from your Cluster. A feed is typically (but doesn’t have to be) the output of a process.
A process entity defines what action or “process” will be taking place in a pipeline. Most typically, the process links to an Oozie workflow, which contains a series of actions to execute (such as shell scripts, Java Jars, Hive actions, Pig actions, Sqoop Actions, you name it) on your cluster. A process also, by definition, takes feeds as inputs or outputs and is where you define how often a workflow should run.
The following example will help explain how Falcon is used. Assume there is raw input data that arrives every hour. These data are processed with a Pig script and the results saved for later processing. At a simple level an Oozie workflow could easily manage the task. However, high-level features, not available in Oozie, are needed to automate the process. First, the input data have a retention policy of 90 days, after which old data are discarded. Second, the processing step may have a certain number of retries should the process fail. And, finally, the output data have a retention policy of three years (and location). It is also possible to query data lineage with Falcon (i.e., Where did this data come from?). The simple job flow is shown in Figure 4.9.
Figure 4.9 A simple Apache Falcon workflow.