python – 使用monotonically_increasing_id()将行号分配给pyspark数据帧

我使用monotonically_increasing_id()使用以下语法将行号分配给pyspark数据帧：

df1 = df1.withColumn("idx", monotonically_increasing_id())

现在df1有26,572,528条记录.所以我期待idx值从0-26,572,527.

但是当我选择max(idx)时,它的值非常大：335,008,054,165.

这个功能发生了什么？
使用此函数与具有相似记录数的其他数据集合并是否可靠？

我有大约300个数据帧,我想将它们组合成一个数据帧.因此,一个数据帧包含ID,而其他数据帧包含与行对应的不同记录

最佳答案

从 documentation

A column that generates monotonically increasing 64-bit integers.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

因此,它不像RDB中的自动增量ID,并且它不可靠用于合并.

如果您需要像RDB中那样的自动增量行为并且您的数据是可排序的,那么您可以使用row_number

df.createOrReplaceTempView('df')
spark.sql('select row_number() over (order by "some_column") as num, * from df')
+---+-----------+
|num|some_column|
+---+-----------+
|  1|   ....... |
|  2|   ....... |
|  3| ..........|
+---+-----------+

如果您的数据不可排序,并且您不介意使用rdds创建索引然后回退到数据帧,则可以使用rdd.zipWithIndex()

一个例子可以在here找到

简而言之：

# since you have a dataframe, use the rdd interface to create indexes with zipWithIndex()
df = df.rdd.zipWithIndex()
# return back to dataframe
df = df.toDF()

df.show()

# your data           | indexes
+---------------------+---+
|         _1          | _2| 
+-----------=---------+---+
|[data col1,data col2]|  0|
|[data col1,data col2]|  1|
|[data col1,data col2]|  2|
+---------------------+---+

在此之后,您可能需要进行一些更改,以使您的数据框符合您的需要.注意：不是一个非常高效的解决方案.

希望这可以帮助.祝好运！

编辑：
想想看,你可以结合使用monotonically_increasing_id来使用row_number：

# create a monotonically increasing id 
df = df.withColumn("idx", monotonically_increasing_id())

# then since the id is increasing but not consecutive, it means you can sort by it, so you can use the `row_number`
df.createOrReplaceTempView('df')
new_df = spark.sql('select row_number() over (order by "idx") as num, * from df')

虽然不确定性能.

点击查看更多相关文章

转载注明原文：python – 使用monotonically_increasing_id()将行号分配给pyspark数据帧 - 乐贴网