- Hadoop as a Data Lake
- The Hadoop Distributed File System (HDFS)
- Direct File Transfer to Hadoop HDFS
- Importing Data from Files into Hive Tables
- Importing Data into Hive Tables Using Spark
- Using Apache Sqoop to Acquire Relational Data
- Using Apache Flume to Acquire Data Streams
- Manage Hadoop Work and Data Flows with Apache Oozie
- Apache Falcon
- What's Next in Data Ingestion?
- Summary
Summary
In this chapter
The Hadoop data lake concept was presented as a new model for data processing.
Various methods for making data available to several Hadoop tools were outlined. The examples included copying files directly to HDFS, importing CSV files to Apache Hive and Spark, and importing JSON files into HIVE with Spark.
Apache Sqoop was presented as a tool for moving relational data into and out of HDFS.
Apache Flume was presented as tool for capturing and transporting continuous data, such as web logs, into HDFS.
The Apache Oozie workflow manager was described as a tool for creating and scheduling Hadoop workflows.
The Apache Falcon tool enables a high-level framework for data governance (end-to-end management) by keeping Hadoop data and tasks organized and defined as pipelines.
New tools like Apache Nifi and Atlas were mentioned as options for governance and data flow on a Hadoop cluster.