

Kotlin Dataframe 0.9.1 released!
source link: https://blog.jetbrains.com/kotlin/2023/01/kotlin-dataframe-0-9-1-released/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Kotlin Dataframe 0.9.1 released!

It’s time for another Kotlin Dataframe update to start off the new year.
There have been a lot of exciting changes since the last 0.8.0 preview release. So without any further ado, let’s jump right in!
TL;DR:
OpenAPI Type Schemas
JSON schema inference is great, but it’s not perfect. DataFrame has had the ability to generate data schemas based on given data for a while now, but this can lead to errors in types or nullability when the sample doesn’t correctly reflect how future data might look.
Today, more and more APIs offer OpenAPI (Swagger) specifications. Aside from API endpoints, they also hold Data Models (Schemas) which include all the information about the types that can be returned from or supplied to the API. Obviously, we don’t want to reinvent the wheel and use our own schema inference when we can use the one provided by the API. Not only will we now get the proper names of the types, but we will also get enums, correct inheritance, and overall better type safety.
From DataFrame 0.9.1 onward, we will support the automatic generation of data schemas based on OpenAPI 3.0 type schemas.
To get started, simply import the OpenAPI specification file (.json
or .yaml
) as you would import any other data you would want to generate data schemas for. An OpenAPI file can contain any number of type schemas that will all be converted to a data schema.
We’ll use the pet store example from OpenAPI itself.
Your project does need an extra dependency for this to work:
implementation("org.jetbrains.kotlinx:dataframe-openapi:{VERSION}")
xxxxxxxxxx
implementation("org.jetbrains.kotlinx:dataframe-openapi:{VERSION}")
Importing data schemas can be done using a file annotation:
@file:ImportDataSchema( path = "https://petstore3.swagger.io/api/v3/openapi.json", name = "PetStore", ) import org.jetbrains.kotlinx.dataframe.annotations.ImportDataSchema
xxxxxxxxxx
@file:ImportDataSchema(
path = "https://petstore3.swagger.io/api/v3/openapi.json",
name = "PetStore",
)
import org.jetbrains.kotlinx.dataframe.annotations.ImportDataSchema
Or using Gradle:
dataframes { schema { data = "https://petstore3.swagger.io/api/v3/openapi.json" name = "PetStore" } }
xxxxxxxxxx
dataframes {
schema {
data = "https://petstore3.swagger.io/api/v3/openapi.json"
name = "PetStore"
}
}
And in Jupyter:
val PetStore = importDataSchema( "https://petstore3.swagger.io/api/v3/openapi.json" )
xxxxxxxxxx
val PetStore = importDataSchema(
"https://petstore3.swagger.io/api/v3/openapi.json"
)
After generating the data schemas, all type schemas from the OpenAPI spec file will have a corresponding data schema in Kotlin that’s ready to parse any JSON content adhering to it.
These will be grouped together under the name you give, which in this case is PetStore. Since the pet store OpenApi schema has the type schemas Order, Customer, Pet, etc., you will have access to the data schemas PetStore.Order
, PetStore.Customer
, PetStore.Pet
, etc. that you can use to read and parse JSON data. (Hint: You can explore this generated code in your IDE and see what it looks like.)
For example:
val df = PetStore.Pet.readJson( "https://petstore3.swagger.io/api/v3/pet/findByStatus?status=available" ) val names: DataColumn<String> = df .filter { /* this: DataRow<Pet>, it: DataRow<Pet> */ category.name == "Dogs" && status == Status1.AVAILABLE } .name
xxxxxxxxxx
val df = PetStore.Pet.readJson(
"https://petstore3.swagger.io/api/v3/pet/findByStatus?status=available"
)
val names: DataColumn<String> = df
.filter { /* this: DataRow<Pet>, it: DataRow<Pet> */
category.name == "Dogs" &&
status == Status1.AVAILABLE
}
.name
If you’re interested in the specifics of how this is done, I’ll break down an example below. Otherwise, you can continue to the next section.
OpenAPI Deep Dive
We can compare and see how, for instance, Pet
is converted from the OpenAPI spec to Kotlin DataSchema
interfaces (examples have been cleaned up a bit):Pet
, in the OpenApi spec is defined as:
"Pet": { "required": [ "name", "photoUrls" ], "type": "object", "properties": { "id": { "type": "integer", "format": "int64", "example": 10 }, "name": { "type": "string", "example": "doggie" }, "category": { "$ref": "#/components/schemas/Category" }, "photoUrls": { "type": "array", "items": { "type": "string" } }, "tags": { "type": "array", "items": { "$ref": "#/components/schemas/Tag" } }, "status": { "type": "string", "description": "pet status in the store", "enum": [ "available", "pending", "sold" ] } } }
As you can see, it’s an object
type that has multiple properties. Some properties are required, like name
and photoUrls
. Others, like id
and category
are not. No properties are nullable in this particular example, but since Kotlin has no concept of undefined properties, non-required properties will be seen as nullable too. There are primitive properties, such as id
and name
, but also references to other types, like Category
and Tag
. Let’s see what DataFrame generates using this example:
enum class Status1(override val value: String) : DataSchemaEnum { AVAILABLE("available"), PENDING("pending"), SOLD("sold"); } @DataSchema(isOpen = false) interface Pet { val id: Long? val name: String val category: Category? val photoUrls: List<String> val tags: DataFrame<Tag?> val status: Status1? companion object { val keyValuePaths: List<JsonPath> get() = listOf() fun DataFrame<*>.convertToPet(convertTo: ConvertSchemaDsl<Pet>.() -> Unit = {}): DataFrame<Pet> = convertTo<Pet> { convertDataRowsWithOpenApi() convertTo() } fun readJson(url: java.net.URL): DataFrame<Pet> = DataFrame.readJson(url, typeClashTactic = ANY_COLUMNS, keyValuePaths = keyValuePaths) .convertToPet() fun readJson(path: String): DataFrame<Pet> = ... ... } }
xxxxxxxxxx
enum class Status1(override val value: String) : DataSchemaEnum {
AVAILABLE("available"),
PENDING("pending"),
SOLD("sold");
}
@DataSchema(isOpen = false)
interface Pet {
val id: Long?
val name: String
val category: Category?
val photoUrls: List<String>
val tags: DataFrame<Tag?>
val status: Status1?
companion object {
val keyValuePaths: List<JsonPath>
get() = listOf()
fun DataFrame<*>.convertToPet(convertTo: ConvertSchemaDsl<Pet>.() -> Unit = {}): DataFrame<Pet> = convertTo<Pet> {
convertDataRowsWithOpenApi()
convertTo()
}
fun readJson(url: java.net.URL): DataFrame<Pet> =
DataFrame.readJson(url, typeClashTactic = ANY_COLUMNS, keyValuePaths = keyValuePaths)
.convertToPet()
fun readJson(path: String): DataFrame<Pet> = ...
...
}
}
Let’s look at the generated interface Pet
. All properties from the OpenAPI JSON appear to be there: id
, name
, and so on. Non-required or nullable properties are correctly marked with a ?
. References to other types, like Category
and Tag
, are working too and are present elsewhere in the generated file.
Interestingly, since tags
is supposed to come in the form of an array
of objects
, this is represented as a List
of DataRows
, or more specifically, a data frame. Thus, when Pet
is used as a DataFrame
type, tags
will become a FrameColumn
.
Finally, status
was an enum that was defined inline in the OpenAPI JSON. We cannot define a type inline like that in Kotlin, so it’s generated outside of Pet
.
Since DataSchemaEnum
is used here, this might also be a good opportunity to introduce it properly. Enums can implement this interface to control how their values are read/written from/to data frames. This allows enums to be created with names that might be illegal in Kotlin (such as numbers or empty strings) but legal in other languages.
To be able to quickly read data as a certain type, the generated types have specific .readJson()
methods. The example only shows the URL case in full, but the others are very similar. After calling one of them, the data frame is converted to the right type (in this case, using convertToPet()
, which applies, corrects, and converts all the properties to the expected types). Those conversion functions can also be used to convert your own data frames to one of these generated types.
Adding support for OpenAPI type schemas was a difficult task. OpenAPI is very flexible in ways Kotlin and DataFrame cannot always follow. We’re certain it will not work with 100% of the OpenAPI specifications out there, so if you notice some strange behavior with one of your APIs, please let us know on Github or Slack so we can improve the support. :-)
JSON Options
To make the OpenAPI integration work better, we made several changes to how JSON is read in DataFrame. While the default behavior is the same, we added some extra options that might be directly beneficial to you too!
Key/Value Paths
Have you ever encountered a JSON file that, when read into a data frame, resulted in hundreds of columns? This can happen if your JSON data contains an object with many properties (key/value pairs). Unlike a large list of data, a huge map like this is not so easily stored in a column-based fashion, making it easy for you to lose grip on your data. Plus, if you’re generating data schemas, the compiler will most likely run out of memory due to the sheer number of interfaces it needs to create.
It would make more sense to convert all these columns into just two columns: “key” and “value”. This is exactly what the new key/value paths achieve.
Let’s look at an example:
By calling the API from APIS.GURU (a website/API that holds a collection of OpenAPI APIs), we get a data frame of 2366 columns in the form as shown here:
DataFrame.read("https://api.apis.guru/v2/list.json")
xxxxxxxxxx
DataFrame.read("https://api.apis.guru/v2/list.json")

Inspecting the JSON as a data frame, we can find two places where conversion to keys/values might be useful: The root of the JSON and the versions
property inside each website’s object. Let’s read it again but now with these key/value paths. We can use the JsonPath
class to help construct these paths (available in Gradle too, but not available in KSP) and since we have a key/value object at the root, we’ll need to unpack the result by taking the first row and first column:
DataFrame.readJson( path = "https://api.apis.guru/v2/list.json", keyValuePaths = listOf( JsonPath(), // generates '$' JsonPath() // generates '$[*]["versions"]' .appendWildcard() .append("versions"), ), )[0][0] as AnyFrame
xxxxxxxxxx
DataFrame.readJson(
path = "https://api.apis.guru/v2/list.json",
keyValuePaths = listOf(
JsonPath(), // generates '$'
JsonPath() // generates '$[*]["versions"]'
.appendWildcard()
.append("versions"),
),
)[0][0] as AnyFrame

Way more manageable, right? To play around more with this example, check out the Jupyter notebook or Datalore. This notebook contains examples of key/value paths and examples of the new OpenAPI functionality.
Type Clash Tactics
A little-known feature of DataFrame is how type clashes are handled when creating data frames from JSON. Let’s look at an example:
Using the default type clash tactic ARRAY_AND_VALUE_COLUMNS
, JSON is read as follows:
[ { "a": "text" }, { "a": { "b": 2 } }, { "a": [6, 7, 8] } ]
⌌----------------------------------------------⌍ | | a:{b:Int?, value:String?, array:List<Int>}| |--|-------------------------------------------| | 0| { b:null, value:"text", array:[] }| | 1| { b:2, value:null, array:[] }| | 2| { b:null, value:null, array:[6, 7, 8] }| ⌎----------------------------------------------⌏
Clashes between array elements, value elements, and object elements are solved by creating a ColumnGroup
in the data frame with the columns array
(containing all arrays), value
(containing all values), and a column for each property in all of the objects. For non-array elements, the array
column will contain an empty list. For non-value elements, the value
column will contain null
. This also applies to elements that don’t contain a property of one of the objects.
If you’re not very fond of this conversion and would rather have a more direct representation of the JSON data, you could use the type clash tactic ANY_COLUMNS
. This tactic is also used by OpenAPI to better represent the provided type schema. Using this tactic to read the same JSON sample as above results in the following data frame:
⌌-------------⌍ | | a:Any| |--|----------| | 0| "text"| | 1| { b:2 }| | 2| [6, 7, 8]| ⌎-------------⌏
We could consider more type clash tactics in the future. Let us know if you have any ideas!
How to use JSON Options
Both of these JSON options can be used when reading JSON using the DataFrame.readJson()
functions and (for generating data schemas) using the Gradle- and KSP plugins:
Functions:
DataFrame.readJson( path = "src/main/resources/someData.json", keyValuePaths = listOf( JsonPath() .appendArrayWithWildcard() .append("data"), ), typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS, )
xxxxxxxxxx
DataFrame.readJson(
path = "src/main/resources/someData.json",
keyValuePaths = listOf(
JsonPath()
.appendArrayWithWildcard()
.append("data"),
),
typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS,
)
Gradle:
dataframes { schema { data = "src/main/resources/someData.json" name = "com.example.package.SomeData" jsonOptions { keyValuePaths = listOf( JsonPath() .appendArrayWithWildcard() .append("data"), ) typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS } } }
xxxxxxxxxx
dataframes {
schema {
data = "src/main/resources/someData.json"
name = "com.example.package.SomeData"
jsonOptions {
keyValuePaths = listOf(
JsonPath()
.appendArrayWithWildcard()
.append("data"),
)
typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS
}
}
}
@file:ImportDataSchema( path = "src/main/resources/someData.json", name = "SomeData", jsonOptions = JsonOptions( keyValuePaths = [ """$[*]["data"]""", ], typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS, ), )
xxxxxxxxxx
@file:ImportDataSchema(
path = "src/main/resources/someData.json",
name = "SomeData",
jsonOptions = JsonOptions(
keyValuePaths = [
"""$[*]["data"]""",
],
typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS,
),
)
Apache Arrow
Thanks to @Kopilov, our support for Apache Arrow files has further improved!
To use it, add the following dependency:
implementation("org.jetbrains.kotlinx:dataframe-arrow:{VERSION}")
xxxxxxxxxx
implementation("org.jetbrains.kotlinx:dataframe-arrow:{VERSION}")
On the reading side, this includes better reading of Data and Time types, UInts
, and configurable nullability options. For more information, check out the docs.
On the writing side, well, this is completely new! DataFrame gained the ability to write to both the Arrow IPC Streaming format (.ipc
) and the Arrow Random Access format (.feather
). You can use both formats to save the data to a file, stream, byte channel, or byte array:
df.writeArrowIPC(file) // writes df to an .ipc file df.writeArrowFeather(file) // writes df to a .feather file val ipcByteArray: ByteArray = df.saveArrowIPCToByteArray() val featherByteArray: ByteArray = df.saveArrowFeatherToByteArray()
xxxxxxxxxx
df.writeArrowIPC(file) // writes df to an .ipc file
df.writeArrowFeather(file) // writes df to a .feather file
val ipcByteArray: ByteArray = df.saveArrowIPCToByteArray()
val featherByteArray: ByteArray = df.saveArrowFeatherToByteArray()
If you need more configuration, then you can use arrowWriter
. For example:
// Get schema from anywhere you want. It can be deserialized from JSON, generated from another dataset // (including the DataFrame.columns().toArrowSchema() method), created manually, and so on. val schema = Schema.fromJSON(schemaJson) df.arrowWriter( // Specify your schema targetSchema = schema, // Specify desired behavior mode mode = ArrowWriter.Mode( restrictWidening = true, restrictNarrowing = true, strictType = true, strictNullable = false, ), // Specify mismatch subscriber mismatchSubscriber = { message: ConvertingMismatch -> System.err.println(message) }, ).use { writer: ArrowWriter -> // Save to any format and sink, like in the previous example writer.writeArrowFeather(file) }
xxxxxxxxxx
// Get schema from anywhere you want. It can be deserialized from JSON, generated from another dataset
// (including the DataFrame.columns().toArrowSchema() method), created manually, and so on.
val schema = Schema.fromJSON(schemaJson)
df.arrowWriter(
// Specify your schema
targetSchema = schema,
// Specify desired behavior mode
mode = ArrowWriter.Mode(
restrictWidening = true,
restrictNarrowing = true,
strictType = true,
strictNullable = false,
),
// Specify mismatch subscriber
mismatchSubscriber = { message: ConvertingMismatch ->
System.err.println(message)
},
).use { writer: ArrowWriter ->
// Save to any format and sink, like in the previous example
writer.writeArrowFeather(file)
}
For more information, check out the docs.
Other New Stuff
Let’s finish this blog with a quick-fire round of some bug fixes and new features. Of course, there are far too many to mention, so we’ll stick to the ones that stand out:
- The
skipRows
parameter in theDataFrame.readExcel()
function (thanks @Burtan for the idea). - Locale fixes for parsing
Double
values. ISO_DATE_TIME
support when parsing fromString
.- Examples updated.
- Improved generated data frame accessors with regard to nullability.
CSVFormat.withSkipHeaderRecord()
now actually working in thewriteCSV()
function (thanks @Vhuc).- Improved type recognition:
- More consistent behavior.
Nothing
can now show up for empty lists, for instance.
convertTo
improvements:- Better exceptions:
CellConversionException
can now be thrown (thanks @Kopilov). fill
support for missing columns in the DSL (thanks @Nikitinas).ColumnGroups
andFrameColumns
can now also be converted, not justValueColumns
.- New advanced
convertIf
function in the DSL for all the cases thatconvert<>.with {}
cannot handle easily. - Empty data frame- or null-filled columns can now be generated if needed.
- Empty rows/columns can be converted to anything.
- See the docs for more information.
- Better exceptions:
- Improved
dataFrameConfig {}
DSL in Jupyter. DataSchemaEnum
: interface enums can now inherit to control how their values are stored in data frames.- The
unfold
operation can now unwrap columns of any objects intoColumnGroups
/FrameColumns
; see the docs (thanks @Holgerbrandl for the idea). - Many bugs and fixes – too many to name.
Have a nice Kotlin!
Subscribe to Blog updates
Recommend
-
106
iOS Mobile
-
52
We’re happy to announce the release of Kotlin/Native v0.8! The focus of this release is on: safer concurrent programming extending the stdlib functionality better iOS development support. Also, num…
-
50
We are happy to announce the 0.8.2 release of Kotlin/Native. In addition to the bug fixes and improvements, we have introduced support for Kotlin/Native Gradle projects directly in CLion!
-
38
You can read this blog post in other languages: <!– You can read this blog post in other languages: –> Today we are releasing Kotlin 1.3 together with a set of accompanying librar…
-
59
Today I have just released the first version of Klock.Klock is a Date & Time library for Multiplatform Kotlin 1.3.It is designed to be as allocation-free as possible using Kotlin inline classes, to be consistent and portable across targe...
-
28
Kotlin 1.3.40 released
-
7
Kotlin DataFrame: Overview of Basic Operations2,536 viewsFeb 28, 2022101DislikeShareSave
-
9
Kotlin Dataframe and its place in the data processing ecosystem
-
4
Posted by: Phodal Huang Feb. 17, 2023, 11:13 a.m. (PS:尝试结合 ChatGPT 生成) Dataframe Arrow 是基于 Apache Arrow 的 Kotlin Dataframe 实现,目标是提供高性能的数据处理...
-
10
Kotlin Dataframe的100个案例 这是 Greg Wilson 所著《SQL for Data Scientists in 100 Quer...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK