MapReduce and Mrjob

Ah Distributed computing! It’s so much fun knowing you’ve got GPUs and processing power from some locked closet full of servers. It can feel like the wild West sometimes because random batch errors show up somwhere in “the cloud” after running a job. God forbid a memory leak of sorts.

I just wanted a place to put some links as I’m working through some distributed computing problems. I feel like it’s a big stepping stone to work with larger and larger data sets. (Is 90 GB scary enough yet?!) I recall running 5 GB of IMDb data in R and hoping it wouldn’t explode (don’t do that locally). Then I discovered the fun of throwing data into S3 buckets. And the promises of converting to parquet.

Useful resources:

nice slides from Duke
formatting mr.job output
configurating mr.job files tips
more mr.job and aws tips