pyspark.RDD

class pyspark.RDD(jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSerializer()))[source]

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Methods

aggregate(zeroValue, seqOp, combOp)

Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.”

aggregateByKey(zeroValue, seqFunc, combFunc)

Aggregate the values of each key, using given combine functions and a neutral “zero value”.

barrier()

Marks the current stage as a barrier stage, where Spark must launch all tasks together.

cache()

Persist this RDD with the default storage level (MEMORY_ONLY).

cartesian(other)

Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.

checkpoint()

Mark this RDD for checkpointing.

coalesce(numPartitions[, shuffle])

Return a new RDD that is reduced into numPartitions partitions.

cogroup(other[, numPartitions])

For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.

collect()

Return a list that contains all of the elements in this RDD.

collectAsMap()

Return the key-value pairs in this RDD to the master as a dictionary.

collectWithJobGroup(groupId, description[, …])

When collect rdd, use this method to specify job group.

combineByKey(createCombiner, mergeValue, …)

Generic function to combine the elements for each key using a custom set of aggregation functions.

count()

Return the number of elements in this RDD.

countApprox(timeout[, confidence])

Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.

countApproxDistinct([relativeSD])

Return approximate number of distinct elements in the RDD.

countByKey()

Count the number of elements for each key, and return the result to the master as a dictionary.

countByValue()

Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.

distinct([numPartitions])

Return a new RDD containing the distinct elements in this RDD.

filter(f)

Return a new RDD containing only the elements that satisfy a predicate.

first()

Return the first element in this RDD.

flatMap(f[, preservesPartitioning])

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

flatMapValues(f)

Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD’s partitioning.

fold(zeroValue, op)

Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral “zero value.”

foldByKey(zeroValue, func[, numPartitions, …])

Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).

foreach(f)

Applies a function to all elements of this RDD.

foreachPartition(f)

Applies a function to each partition of this RDD.

fullOuterJoin(other[, numPartitions])

Perform a right outer join of self and other.

getCheckpointFile()

Gets the name of the file to which this RDD was checkpointed

getNumPartitions()

Returns the number of partitions in RDD

getResourceProfile()

Get the pyspark.resource.ResourceProfile specified with this RDD or None if it wasn’t specified.

getStorageLevel()

Get the RDD’s current storage level.

glom()

Return an RDD created by coalescing all elements within each partition into a list.

groupBy(f[, numPartitions, partitionFunc])

Return an RDD of grouped items.

groupByKey([numPartitions, partitionFunc])

Group the values for each key in the RDD into a single sequence.

groupWith(other, *others)

Alias for cogroup but with support for multiple RDDs.

histogram(buckets)

Compute a histogram using the provided buckets.

id()

A unique ID for this RDD (within its SparkContext).

intersection(other)

Return the intersection of this RDD and another one.

isCheckpointed()

Return whether this RDD is checkpointed and materialized, either reliably or locally.

isEmpty()

Returns true if and only if the RDD contains no elements at all.

isLocallyCheckpointed()

Return whether this RDD is marked for local checkpointing.

join(other[, numPartitions])

Return an RDD containing all pairs of elements with matching keys in self and other.

keyBy(f)

Creates tuples of the elements in this RDD by applying f.

keys()

Return an RDD with the keys of each tuple.

leftOuterJoin(other[, numPartitions])

Perform a left outer join of self and other.

localCheckpoint()

Mark this RDD for local checkpointing using Spark’s existing caching layer.

lookup(key)

Return the list of values in the RDD for key key.

map(f[, preservesPartitioning])

Return a new RDD by applying a function to each element of this RDD.

mapPartitions(f[, preservesPartitioning])

Return a new RDD by applying a function to each partition of this RDD.

mapPartitionsWithIndex(f[, …])

Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.

mapPartitionsWithSplit(f[, …])

Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.

mapValues(f)

Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.

max([key])

Find the maximum item in this RDD.

mean()

Compute the mean of this RDD’s elements.

meanApprox(timeout[, confidence])

Approximate operation to return the mean within a timeout or meet the confidence.

min([key])

Find the minimum item in this RDD.

name()

Return the name of this RDD.

partitionBy(numPartitions[, partitionFunc])

Return a copy of the RDD partitioned using the specified partitioner.

persist([storageLevel])

Set this RDD’s storage level to persist its values across operations after the first time it is computed.

pipe(command[, env, checkCode])

Return an RDD created by piping elements to a forked external process.

randomSplit(weights[, seed])

Randomly splits this RDD with the provided weights.

reduce(f)

Reduces the elements of this RDD using the specified commutative and associative binary operator.

reduceByKey(func[, numPartitions, partitionFunc])

Merge the values for each key using an associative and commutative reduce function.

reduceByKeyLocally(func)

Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary.

repartition(numPartitions)

Return a new RDD that has exactly numPartitions partitions.

repartitionAndSortWithinPartitions([…])

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.

rightOuterJoin(other[, numPartitions])

Perform a right outer join of self and other.

sample(withReplacement, fraction[, seed])

Return a sampled subset of this RDD.

sampleByKey(withReplacement, fractions[, seed])

Return a subset of this RDD sampled by key (via stratified sampling).

sampleStdev()

Compute the sample standard deviation of this RDD’s elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N).

sampleVariance()

Compute the sample variance of this RDD’s elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N).

saveAsHadoopDataset(conf[, keyConverter, …])

Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package).

saveAsHadoopFile(path, outputFormatClass[, …])

Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package).

saveAsNewAPIHadoopDataset(conf[, …])

Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package).

saveAsNewAPIHadoopFile(path, outputFormatClass)

Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package).

saveAsPickleFile(path[, batchSize])

Save this RDD as a SequenceFile of serialized objects.

saveAsSequenceFile(path[, compressionCodecClass])

Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the “org.apache.hadoop.io.Writable” types that we convert from the RDD’s key and value types.

saveAsTextFile(path[, compressionCodecClass])

Save this RDD as a text file, using string representations of elements.

setName(name)

Assign a name to this RDD.

sortBy(keyfunc[, ascending, numPartitions])

Sorts this RDD by the given keyfunc

sortByKey([ascending, numPartitions, keyfunc])

Sorts this RDD, which is assumed to consist of (key, value) pairs.

stats()

Return a StatCounter object that captures the mean, variance and count of the RDD’s elements in one operation.

stdev()

Compute the standard deviation of this RDD’s elements.

subtract(other[, numPartitions])

Return each value in self that is not contained in other.

subtractByKey(other[, numPartitions])

Return each (key, value) pair in self that has no pair with matching key in other.

sum()

Add up the elements in this RDD.

sumApprox(timeout[, confidence])

Approximate operation to return the sum within a timeout or meet the confidence.

take(num)

Take the first num elements of the RDD.

takeOrdered(num[, key])

Get the N elements from an RDD ordered in ascending order or as specified by the optional key function.

takeSample(withReplacement, num[, seed])

Return a fixed-size sampled subset of this RDD.

toDebugString()

A description of this RDD and its recursive dependencies for debugging.

toLocalIterator([prefetchPartitions])

Return an iterator that contains all of the elements in this RDD.

top(num[, key])

Get the top N elements from an RDD.

treeAggregate(zeroValue, seqOp, combOp[, depth])

Aggregates the elements of this RDD in a multi-level tree pattern.

treeReduce(f[, depth])

Reduces the elements of this RDD in a multi-level tree pattern.

union(other)

Return the union of this RDD and another one.

unpersist([blocking])

Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.

values()

Return an RDD with the values of each tuple.

variance()

Compute the variance of this RDD’s elements.

withResources(profile)

Specify a pyspark.resource.ResourceProfile to use when calculating this RDD.

zip(other)

Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc.

zipWithIndex()

Zips this RDD with its element indices.

zipWithUniqueId()

Zips this RDD with generated unique Long ids.

Attributes

context

The SparkContext that this RDD was created on.