- Hadoop as a Data Lake
- The Hadoop Distributed File System (HDFS)
- Direct File Transfer to Hadoop HDFS
- Importing Data from Files into Hive Tables
- Importing Data into Hive Tables Using Spark
- Using Apache Sqoop to Acquire Relational Data
- Using Apache Flume to Acquire Data Streams
- Manage Hadoop Work and Data Flows with Apache Oozie
- Apache Falcon
- What's Next in Data Ingestion?
- Summary
The Hadoop Distributed File System (HDFS)
Virtually all Hadoop applications operate on data that are stored in HDFS. The operation of HDFS is separate from the local file system that most users are accustomed to using. That is, the user must explicitly copy to and from the HDFS file system. HDFS is not a general file system and as such cannot be used as a substitute for existing POSIX (or even POSIX-like) file systems.
In general, HDFS is a specialized streaming file system that is optimized for reading and writing of large files. When writing to HDFS, data are “sliced” and replicated across the servers in a Hadoop cluster. The slicing process creates many small sub-units (blocks) of the larger file and transparently writes them to the cluster nodes. The various slices can be processed in parallel (at the same time) enabling faster computation. The user does not see the file slices but interacts with whole files in HDFS like a normal file system (i.e., files can be moved, copied, deleted, etc.). When transferring files out of HDFS, the slices are assembled and written as one file on the host file system.
The slices or sub-units are also replicated across different servers so that the failure of any single server will not result in lost data. Due to its design, HDFS does not support random reads or writes to files but does support appending a file. Note that for testing purposes it is also possible to create a single instance of HDFS on a single hard drive (i.e., a laptop or desktop computer), and in this situation there is no file slicing or replication performed on the file.