Getting Data into Hadoop

␡

< Back Page 11 of 11

This chapter is from the book 

Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale

Learn More Buy

Summary

In this chapter

The Hadoop data lake concept was presented as a new model for data processing.
Various methods for making data available to several Hadoop tools were outlined. The examples included copying files directly to HDFS, importing CSV files to Apache Hive and Spark, and importing JSON files into HIVE with Spark.
Apache Sqoop was presented as a tool for moving relational data into and out of HDFS.
Apache Flume was presented as tool for capturing and transporting continuous data, such as web logs, into HDFS.
The Apache Oozie workflow manager was described as a tool for creating and scheduling Hadoop workflows.
The Apache Falcon tool enables a high-level framework for data governance (end-to-end management) by keeping Hadoop data and tasks organized and defined as pipelines.
New tools like Apache Nifi and Atlas were mentioned as options for governance and data flow on a Hadoop cluster.

< Back Page 11 of 11

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address