Window.
rowsBetween
Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive).
WindowSpec
Both start and end are relative positions from the current row. For example, “0” means “current row”, while “-1” means the row before the current row, and “5” means the fifth row after the current row.
We recommend users use Window.unboundedPreceding, Window.unboundedFollowing, and Window.currentRow to specify special boundary values, rather than using integral values directly.
Window.unboundedPreceding
Window.unboundedFollowing
Window.currentRow
A row based boundary is based on the position of the row within the partition. An offset indicates the number of rows above or below the current row, the frame for the current row starts or ends. For instance, given a row based sliding frame with a lower bound offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from index 4 to index 7.
New in version 2.1.0.
boundary start, inclusive. The frame is unbounded if this is Window.unboundedPreceding, or any value less than or equal to -9223372036854775808.
boundary end, inclusive. The frame is unbounded if this is Window.unboundedFollowing, or any value greater than or equal to 9223372036854775807.
Examples
>>> from pyspark.sql import Window >>> from pyspark.sql import functions as func >>> from pyspark.sql import SQLContext >>> sc = SparkContext.getOrCreate() >>> sqlContext = SQLContext(sc) >>> tup = [(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")] >>> df = sqlContext.createDataFrame(tup, ["id", "category"]) >>> window = Window.partitionBy("category").orderBy("id").rowsBetween(Window.currentRow, 1) >>> df.withColumn("sum", func.sum("id").over(window)).sort("id", "category", "sum").show() +---+--------+---+ | id|category|sum| +---+--------+---+ | 1| a| 2| | 1| a| 3| | 1| b| 3| | 2| a| 2| | 2| b| 5| | 3| b| 3| +---+--------+---+