Accessing and Managing HDFS Data
You can access the files or data stored in HDFS in many different ways. For example, you can use HDFS FS shell commands, leverage the Java API available in the classes of the org.apache.hadoop.fs package (http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/package-frame.html), write a MapReduce job, or write Hive or Pig queries. You can even use a web browser to browse the files from an HDFS cluster.
HDFS Command-Line Interface
The HDFS command-line interface (CLI), called FS Shell, enables you to write shell commands to manage your files in the HDFS cluster. It is useful when you need a scripting language to interact with the stored files and data. Figure 3.16 shows the hadoop fs command and the different parameter options you can use with it.
FIGURE 3.16 Hadoop CLI for managing the file system.
Some commonly used parameters with the hadoop fs command follow:
- mkdir—Creates a directory based on the passed URI.
- put—Copies one or more files from the local file system (also reads input from stdin) to the destination file system.
- copyFromLocal—Copies one or more files from the local file system to the destination file system. The -f option overwrites the destination if it already exists.
- get—Copies one or more files to the local file system.
- copyToLocal—Copies one or more files to the local file system.
- ls—Return statistics on the file or content of a directory.
- lsr—Same as ls but recursive in nature.
- rm—Deletes a file or directory. The directory must be empty to drop it.
- rmr—Same as rm but recursive in nature.
Using MapReduce, Hive, Pig, or Sqoop
FS Shell commands are good as long as you want to move files or data back and forth between the local file system and HDFS. But what if you want to process the data stored in HDFS? For example, imagine that you have stored sale transactions data in HDFS and you want to know the top five states that generated most of the revenue.
This is where you need to use either MapReduce, Hive, or Pig. MapReduce requires programming skills and expertise in Java (see Hour 4, “The MapReduce Job Framework and Job Execution Pipeline”; Hour 5, “MapReduce—Advanced Concepts and YARN”; Hour 10, “Programming MapReduce Jobs”; and Hour 11, “Customizing HDInsight Cluster with Script Action”). People with a SQL background can use Hive (see Hour 12, “Getting Started with Apache Hive and Apache Tez in HDInsight,” and Hour 13, “Programming with Apache Hive, Apache Tez in HDInsight, and Apache HCatalog”). Pig helps with data flow and data transformation for data analysis (see Hour 17, “Using Pig for Data Processing”).
You can use Sqoop to move data from HDFS to any relational database store (for example, SQL Server), and vice versa. You can learn more about Sqoop in Hour 18, “Using Sqoop for Data Movement Between RDBMS and HDInsight.”
GO TO Refer to Hour 18, “Using Sqoop for Data Movement Between RDBMS and HDInsight, for more details on Sqoop and how you can use it for data movement.