Skip to main content

RDD commands

 

Category Command Description / What It Does Example
Creation sc.parallelize(data) Create RDD from a Python list rdd = sc.parallelize([1,2,3])
sc.textFile(path) Read text file (each line = 1 element) rdd = sc.textFile("data.txt")
sc.wholeTextFiles(path) Read files as (filename, content) pairs rdd = sc.wholeTextFiles("folder/")
Basic Transformations map(func) Apply function to each element rdd.map(lambda x: x*2)
flatMap(func) Apply function & flatten result rdd.flatMap(lambda x: x.split(" "))
filter(func) Keep elements matching condition rdd.filter(lambda x: x>10)
distinct() Remove duplicate elements rdd.distinct()
union(rdd2) Combine two RDDs rdd1.union(rdd2)
intersection(rdd2) Keep common elements rdd1.intersection(rdd2)
subtract(rdd2) Elements in first RDD but not second rdd1.subtract(rdd2)
cartesian(rdd2) All pairs between two RDDs rdd1.cartesian(rdd2)
sample(withReplacement, fraction) Random sample rdd.sample(False, 0.5)
sortBy(func, ascending) Sort elements by key or value rdd.sortBy(lambda x: x, False)
Pair RDD (Key, Value) mapValues(func) Apply function to values only rdd.mapValues(lambda x: x+1)
reduceByKey(func) Merge values with same key (aggregated) rdd.reduceByKey(lambda a,b:a+b)
groupByKey() Group values by key rdd.groupByKey()
sortByKey() Sort key-value pairs by key rdd.sortByKey()
join(rdd2) Inner join by key rdd1.join(rdd2)
leftOuterJoin(rdd2) Keep all keys from left RDD rdd1.leftOuterJoin(rdd2)
rightOuterJoin(rdd2) Keep all keys from right RDD rdd1.rightOuterJoin(rdd2)
cogroup(rdd2) Group values from both RDDs by key rdd1.cogroup(rdd2)
combineByKey() Custom aggregation (advanced) rdd.combineByKey(...)
Actions collect() Return all elements as list rdd.collect()
count() Count number of elements rdd.count()
first() First element rdd.first()
take(n) First n elements rdd.take(5)
top(n) Return top n elements rdd.top(3)
reduce(func) Reduce using given function rdd.reduce(lambda a,b:a+b)
sum() Sum of all elements rdd.sum()
max() / min() Maximum / minimum element rdd.max()
countByValue() Count frequency of each value rdd.countByValue()
saveAsTextFile(path) Save results to text files rdd.saveAsTextFile("output/")
foreach(func) Apply function to each element rdd.foreach(print)
Persistence / Storage cache() Store RDD in memory rdd.cache()
persist() Store with custom storage level rdd.persist(StorageLevel.MEMORY_AND_DISK)
unpersist() Remove RDD from cache rdd.unpersist()
is_cached Check if cached rdd.is_cached
Partitioning / Inspection getNumPartitions() Get number of partitions rdd.getNumPartitions()
repartition(n) Increase partitions (with shuffle) rdd.repartition(4)
coalesce(n) Decrease partitions (no shuffle) rdd.coalesce(2)
glom() Show data per partition rdd.glom().collect()
mapPartitions(func) Operate on each partition rdd.mapPartitions(lambda x:[sum(x)])
mapPartitionsWithIndex(func) See which elements belong to which partition rdd.mapPartitionsWithIndex(lambda i, x: [(i, list(x))]).collect()
File Operations textFile(path) Load data from text file sc.textFile("data.txt")
wholeTextFiles(path) Load (filename, content) pairs sc.wholeTextFiles("folder/")
saveAsTextFile(path) Save RDD to text files rdd.saveAsTextFile("out/")
saveAsSequenceFile(path) Save RDD as Hadoop sequence file rdd.saveAsSequenceFile("seq_out")
saveAsObjectFile(path) Save as serialized objects rdd.saveAsObjectFile("obj_out")

Comments

Popular posts from this blog

Reducer Side Join

package my.org; import java.io.IOException; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class redjoin1036 {     // -------------------- CUSTOMER MAPPER --------------------     public static class CustMapper extends Mapper<LongWritable, Text, Text, Text> {         public void map(LongWritable key, Text value, Context context)                 throws IOException, InterruptedException {             String[] line = value.toString().split(",");   ...

Map Reduce

 /*******************************************************************************************  * BIG DATA LAB – 9  * Name  : 21MIS1029  * Topic : MapReduce Programs – Join Operations & Aggregations  *******************************************************************************************/ /*******************************************************************************************  * (i) MAP SIDE JOIN  * ------------------------------------------------------------------------------------------  * Q1. Develop a Map Reduce Program to display the details of employees using two text files  *     (sample code is given) using Map Side Join.  *  * Input:  *   empdetails.txt → (emp_id, emp_name)  *   empdata.txt    → (emp_id, salary, dept_id)  *  * Output:  *   emp_id, salary, dept_id, emp_name  ****************************************************...