Frank Kane's Taming Big Data with Apache Spark and Python
上QQ阅读APP看书,第一时间看更新

What Spark can do with key/value data?

As I mentioned earlier, one of the most useful things you can do with the key/value RDD is reduceByKey. What that does is to combine all the values that are found for the same key/value, using some function that you define. So, for example, if I want to add up all of the values for a given key, let's say all of the numbers of friends for a given age, something like this would do the job:

rdd.reduceByKey(lambda x, y: x + y) 

When you call a reduce function in Spark, it will get not just an x value but also an x and a y value. You can actually call those values whatever you want, but what you need to do is define a function that defines how to combine things together that are found for the same key. In this case, our function is just x plus y, so it says we're going to keep adding things together to combine things together for a given key/value. For example, if I was doing reduceByKey on a key/value RDD where the keys were ages and the values were a number of friends, I could get the total number of friends found for that age with something like this, because every value that's found for that key will be added together using that function x plus y:

rdd.reduceByKey(lambda x, y: x + y) 

There are other things you can do as well with key/value RDDs. You can call groupByKey, so if you don't actually want to combine them together quite yet, you can just get a list of all the values associated with each key using groupByKey. You can also use sortByKey, so if you want to sort your RDD by the key/values, sortByKey makes it easy to do. Third, you can split out the keys and the values into their own RDD using keys(), values(). You don't see that too often but it's good to know that it exists.

As I mentioned earlier, you've kind of created a NoSQL data store here. It's a giant key/value data store, so you can start to do SQLish sorts of things since we have keys and values in play here. We can use join , rightOuterJoin , leftOuterJoin , cogroup and subtractByKey, all ways to combine two key/value RDDs together to create some joined RDD as a result. Later on, when we look at making similar movie recommendations, we'll have an example of doing that by joining one key/value RDD with another to get every possible permutation of movies that were rated together.