Getting Data into Hadoop

By Ofer Mendelevitch, Douglas Eadline, Casey Stella
Feb 17, 2017

📄 Contents

␡

Hadoop as a Data Lake
The Hadoop Distributed File System (HDFS)
Direct File Transfer to Hadoop HDFS
Importing Data from Files into Hive Tables
Importing Data into Hive Tables Using Spark
Using Apache Sqoop to Acquire Relational Data
Using Apache Flume to Acquire Data Streams
Manage Hadoop Work and Data Flows with Apache Oozie
Apache Falcon
What's Next in Data Ingestion?
Summary

⎙ Print

< Back Page 2 of 11 Next >

This chapter is from the book 

Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale

Learn More Buy

The Hadoop Distributed File System (HDFS)

Virtually all Hadoop applications operate on data that are stored in HDFS. The operation of HDFS is separate from the local file system that most users are accustomed to using. That is, the user must explicitly copy to and from the HDFS file system. HDFS is not a general file system and as such cannot be used as a substitute for existing POSIX (or even POSIX-like) file systems.

In general, HDFS is a specialized streaming file system that is optimized for reading and writing of large files. When writing to HDFS, data are “sliced” and replicated across the servers in a Hadoop cluster. The slicing process creates many small sub-units (blocks) of the larger file and transparently writes them to the cluster nodes. The various slices can be processed in parallel (at the same time) enabling faster computation. The user does not see the file slices but interacts with whole files in HDFS like a normal file system (i.e., files can be moved, copied, deleted, etc.). When transferring files out of HDFS, the slices are assembled and written as one file on the host file system.

The slices or sub-units are also replicated across different servers so that the failure of any single server will not result in lost data. Due to its design, HDFS does not support random reads or writes to files but does support appending a file. Note that for testing purposes it is also possible to create a single instance of HDFS on a single hard drive (i.e., a laptop or desktop computer), and in this situation there is no file slicing or replication performed on the file.

< Back Page 2 of 11 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address

Getting Data into Hadoop

This chapter is from the book

This chapter is from the book

This chapter is from the book 

The Hadoop Distributed File System (HDFS)

InformIT Promotional Mailings & Special Offers