Using HDFS

When you first set-up an HDFS cluster, the HDFS file system is empty. You need to do a few things to get it set-up so that your users can store files in it and reference those files in their Spark jobs.

NOTE: The commands below assume that you followed my instructions for installing HDFS.

Create the `/user` directory

HDFS assumes that a user's home directory is located in /user/<username> but the /user directory is not created by default. So, let's get that created.

su - hdfs
hadoop fs -mkdir /user

Create the home directory for a new user

su - hdfs
hadoop fs -mkdir /user/<username>
hadoop fs -chown <username>:<username> /user/<username>

Note: you should use the same usernames that you use outside HDFS. HDFS looks to the user account on the machine to determine certain access rights.

Solving permission problems with supergroup

If you run a Spark or MapReduce job and it fails because the job tried to write some data outside of your user directory on HDFS, you may have to add that user to the supergroup to allow that data to be created.

To do this, create the supergroup on the master node and then add the user to that group

sudo addgroup supergroup (only run once)
sudo usermod -a -G supergroup <username>

If you know this is going to be an issue for all users who try to run jobs on your cluster, you can modify the chown command above to read: hadoop fs -chown <username>:supergroup /user/<username> to ensure the user's own directories belong to that group.

Putting files into HDFS

To copy a file from the local file system to HDFS, use the put command:

hadoop fs -put <file1> <file2> ... <hdfs_dir>

Example: hadoop fs -put edges.txt /user/kena

Getting files out of HDFS

To copy a file out of HDFS to the local file system, use the get command:

hadoop fs -get <hdfs_file> <local_dir>

Working with Hadoop PART files

When you create an RDD and have Spark save it uwith the RDD.saveAsTextFile(".../path") command, it creates a directory that contains each partition of the RDD stored in a separate part-XXXXX file, i.e. part-00000, part-00001, etc. Each of those files is just a simple text file (hence the name: saveAsTextFile).

It would be a pain if you have to combine all of those part files yourself each time you run a job... fortunately, you can have Hadoop do that for you (!):

hadoop fs -getmerge /user/kena/counts ./counts.txt

or more generally

hadoop fs -getmerge /path/to/hdfs/dir ./path/to/local/combined/text/file.txt

Listing the size of files

hadoop fs -ls -h

Listing the size of directories

hadoop fs -du -s -h <hdfs_dir>

Deleting Files and Directories

HDFS deletes files and directories as you would expect:

hadoop fs -rm /path/to/hdfs/file.txt
hadoop fs -rm -r /path/to/hdfs/dir

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using HDFS

Create the `/user` directory

Create the home directory for a new user

Solving permission problems with supergroup

Putting files into HDFS

Getting files out of HDFS

Working with Hadoop PART files

Listing the size of files

Listing the size of directories

Deleting Files and Directories

FilesExpand file tree

usingHDFS.md

Latest commit

History

usingHDFS.md

File metadata and controls

Using HDFS

Create the /user directory

Create the home directory for a new user

Solving permission problems with supergroup

Putting files into HDFS

Getting files out of HDFS

Working with Hadoop PART files

Listing the size of files

Listing the size of directories

Deleting Files and Directories

Create the `/user` directory