In an ad hoc work, I need to read in files in multiple HDFS directories based on a date range.
The HDFS data structure is like the following
1
2
3
4
5
6
7
8
9 /data
/20170730
/part-00000
/...
/20170731
/20170801
/20170802
...
/20170903
Provided a date e.g.20170801, I need to read in the files from folder /data/20170801
, /data/20170802
, …, /data/20170830
, but not others.
So to achieve this inside my python script, I searched online and finally arrived at the following solution.
1 | import subprocess |
Then, so fit my specific needs, I just need to do a simple filtering for the list.