6

Note: Loading Tab-Separated Data In Cascalog

 2 years ago
source link: https://blog.jakubholy.net/2012/10/09/note-loading-tab-separated-data-in-cascalog/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Note: Loading Tab-Separated Data In Cascalog

October 9, 2012
To load all fields from a tab-separated text file in Cascalog we need to use the generic hfs-tap and specify the "scheme" (notice that loading all fields and expecting tab as the separator is the default behavior of TextDelimited):

 (hfs-tap
   (cascading.scheme.hadoop.TextDelimited.)
   "hdfs:///user/hive/warehouse/playerevents/epoch_week=2196/output_aewa-analytics-ada_1334697041_1.log")


With a custom separator and fields:

 (hfs-tap
   (cascading.scheme.hadoop.TextDelimited. (cascalog.workflow/fields ["?f1" "?f2"]) "\t") ; or cascading.tuple.Fields/ALL inst. of (fields ...)
   "hdfs:///user/hive/warehouse/playerevents/epoch_week=2196/output_aewa-analytics-ada_1334697041_1.log")


Hadoop doesn't manage to load data files from nested sub-directories (for example from a Hive partitioned table). To load them, you need to use a "glob pattern" to turn the standard Hfs tap into a GlobHfs tap. This is how we would match all the subdirectories (Hadoop will then handle loading the files in them):

 (hfs-tap
   (cascading.scheme.hadoop.TextDelimited.)
   "hdfs:///user/hive/warehouse/playerevents/"
   :source-pattern "epoch_week=*/")


Enjoy.

Are you benefitting from my writing? Consider buying me a coffee or supporting my work via GitHub Sponsors. Thank you! You can also book me for a mentoring / pair-programming session via Codementor or (cheaper) email.

Allow me to write to you!

Let's get in touch! I will occasionally send you a short email with a few links to interesting stuff I found and with summaries of my new blog posts. Max 1-2 emails per month. I read and answer to all replies.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK