Get a List of HDFS Files in Python

In an ad hoc work, I need to read in files in multiple HDFS directories based on a date range.

The HDFS data structure is like the following

1
2
3
4
5
6
7
8
9
/data
/20170730
/part-00000
/...
/20170731
/20170801
/20170802
...
/20170903

Provided a date e.g.20170801, I need to read in the files from folder /data/20170801, /data/20170802, …, /data/20170830, but not others.

So to achieve this inside my python script, I searched online and finally arrived at the following solution.

1
2
3
4
5
6
7
8
9
import subprocess

dir_in = "/data"
args = "hdfs dfs -ls "+dir_in+" | awk '{print $8}'"
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

s_output, s_err = proc.communicate()
all_dart_dirs = s_output.split()
# ['/data/20170730', '/data/20170731', ...]

Then, so fit my specific needs, I just need to do a simple filtering for the list.

Coursera Downloader |Coursera 资料抓取Python爬虫

A command-line coursera downloader written in Python.

You can download the script from my GitHub repo here: https://github.com/yingchi/coursera-downloader

When learning wonderful modules on Coursera, I found it quite frustrating when I need to click multiple places and move back and forth to download the learning materials for one course.

After searching around, I found 2 github repo that can help people to download course materials from Coursera. However, one of the repo stopped updating and another is not very convenient to use.

So I decided to combine the good features from them and write a new downloading tool. You will need your Coursera email and password if you use the script, but don’t worry, they are encrypted and saved into a local file.