156

Adding sequential IDs to a Spark Dataframe

 4 years ago
source link: https://www.tuicool.com/articles/qmuIVji
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

TL;DR

Adding sequential unique IDs to a Spark Dataframe is not very straight-forward, especially considering the distributed nature of it. You can do this using either zipWithIndex() or row_number() (depending on the amount and kind of your data) but in every case there is a catch regarding performance.

The idea behind this

7RfAnme.jpg!web

Typical usages for ids — besides the obvious: for identity purposes

Coming from traditional relational databases, like MySQL , and non-distributed data frames, like Pandas , one may be used to working with ids (auto-incremented usually) for identification of course but also the ordering and constraints you can have in data by using them as reference. For example, ordering your data by id (which is usually an indexed field) in a descending order, will give you the most recent rows first etc.

maEf2ef.jpg!web

A representation of a Spark Dataframe — what the user sees and what it is like physically

Depending on the needs, we might be found in a position where we would benefit from having a (unique) auto-increment-ids’-like behavior in a spark dataframe. When the data is in one table or dataframe (in one machine), adding ids is pretty straigth-forward. What happens though when you have distributed data, split into partitions that might reside in different machines like in Spark?

(More on partitions here )

Throughout this post, we will explore the obvious and not so obvious options, what they do, and the catch behind using them.

Notes

  • Please, note that this article assumes that you have some working knowledge of Spark, and more specifically of PySpark . If not, here is ashort intro with what it is and I’ve put several helpful resources in the Useful links and notes section. I’ll be glad to answer any questions I can :).
  • Practicing Sketchnoting again, yes, there are terrible sketches through out the article, trying to visually explain things as I understand them . I hope they are more helpful than they are confusing :).

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK