Swift String’s ABI and UTF-8

Michael_Ilseman (Michael Ilseman) 2018-11-05 23:17:09 UTC #1

String’s ABI and UTF-8

We just landed String’s proposed final ABI on master. This ABI includes some significant changes, the primary one being that native Swift strings are stored as UTF-8 where they were previously stored either as ASCII or UTF-16 depending on their contents. NSStrings are still lazily bridged in to String without copying.

This does not immediately surface in the API, but allows for some important performance wins and gives a more consistent basis for future APIs providing efficient and direct access to the underlying code units. UTF-8 is a one-byte Unicode encoding and is the preferred encoding for interoperating with C, systems programming, server-side programming, scripting, client-side programming, and tools that process source code and textual formats.

Performance

Unifying the storage representation for ASCII and Unicode-rich strings gives us a lot of performance wins. These wins are an effect of several compounding factors including a simpler model with less branching, on-creation encoding validation of native Strings (enabled by a faster validator), a unified implementation code path, a more efficient allocation and use of various bits in the struct, etc.

C Interoperability

By maintaining nul-termination in our storage, interoperability with C is basically free: we just use our pointer. This means that myString.withCString { ... } no longer needs to allocate, transcode, and later free its contents in order to supply the closure with a C-compatible string.

Quantifying this improvement as an n x faster ratio: it’s either millions of times faster or error: division by zero times faster, depending on how you measure.

Decoding

Walking over and decoding the Unicode scalar values that comprise a string is much more efficient now.

Strings of Chinese characters are traditionally a worst-case scenario for UTF-8 decoding performance relative to UTF-16, as UTF-8 resorts to a multi-byte encoding sequence while UTF-16 just stores the scalar value directly as a code unit. This is even worse in reverse, because a continuation-byte in UTF-8 does not communicate the distance to the start of the scalar.

But, this isn’t really an issue: on modern CPUs this increase in encoding complexity is more than offset by the decrease in model complexity by having a unified storage representation.

Walking the Unicode scalar values forwards on Chinese text is now over 3x faster than before and walking in reverse (harder) is now over 2x faster. ASCII benefits even more, despite the old model having a dedicated storage representation and fast paths for ASCII-only strings.

Small UTF-8 Strings

Swift 4.2 introduced a small-string representation on 64-bit platforms for strings of up to 15 ASCII code units in length that stores the value directly in the String struct without requiring an allocation or memory management. With a unified code path that supports UTF-8, we’re able to enhance small strings to support up to 15 UTF-8 code units in length. This means your most important non-ASCII strings such as "smol :dog:! :heart_eyes:" , can in fact, be smol.

We also added small-string support on 32-bit platforms, where we pack in strings of up to 10 UTF-8 code units directly into the String struct.

Miscellaneous

Operations over the UTF-8 view are (obviously) dramatically faster on native Swift strings: ~10x depending on the nature of the operation.

Character -based String modifications, such as String.insert(_:Character) are around 5-10x faster.

Improved normality checking makes String hashing 2-4x faster when the contents are already in NFC (which is the case most of the time).

Creating a String from UTF-8 contents ala String(decoding: codeUnits, as: UTF8.self) is around 5-6x faster.

Efficient Cocoa Interoperability

Efficient interoperability with Cocoa is a huge selling point for Swift, and strings are lazily bridged to Objective-C. String’s storage class is a subclass of NSString at runtime, and thus has to answer APIs assuming constant-time access to UTF-16 code units. We solved this with a breadcrumbing strategy: upon first request from one of these APIs on large strings, we perform a fast scan of the contents to check the UTF-16 length, leaving behind breadcrumbs at regular intervals. This allows us to provide amortized constant-time access to transcoded UTF-16 contents by scanning between breadcrumbs.

This is leveraged by String.UTF16View , so Swift code that imports Foundation and assumes constant-time access to the view also benefits.

We’ll be tweaking and tuning the granularity of these breadcrumbs and improving the scanning time, but this strategy has been proving sufficient for maintaining performance in realistic use cases.

For performance improvements in Cocoa interoperability, we’re working on some sweet bridging optimizations (simpler on a unified storage representation), but it’s too early to report back findings. We expect wins here to be far more important than a higher constant-factor on UTF-16 access.

Current Microbenchmark Issues

We landed with some known microbenchmark regressions that we knew we could fix with some elbow grease. We’re now applying elbow grease. Since this is such a substantial model change, it is far more important from a risk-management perspective to land this now to expose any unknown issues. Even so, net performance is substantially better.

We also haveknown gaps in our String benchmarking, which we will be closing and addressing any issues exposed.

Code Size

We haven’t started to tweak and tune code size, but this change already carries in some nice wins. A simpler model means less code and less reliance on heroic inlining for performance.

The stdlib binary is around 13% smaller with this change, which is a big win for Swift 5.0 applications that will back-deploy to pre-Swift-5 OSes. This also reduces memory usage and provides other system-wide benefits for post-Swift-5 OSes. The Foundation overlay is also around 5% smaller, as are others.

The source compatibility suite saw modest improvements, with an overall 2-3% shrinkage in total binary size. As I said, we haven’t started to tweak and tune, so this may improve more.

The Future of String Performance

Internal Improvements

We have many ideas for further performance enhancements to the internal implementation of String, such as:

Check for (or even guaranteeing) NFC-normalized contents upon creation, making canonical-equivalence comparison super fast
Cache more information on the storage class’s subsequent tail allocations, such as grapheme count and hash value
Perform fast single-scalar-grapheme checks and set relevant performance flags
Vectorize all the things, especially small strings

Low Level APIs

The most exciting aspect of the future of String performance is exposing low-level performant APIs. The unified storage representation allows us to expose low-level APIs on String that directly accessing the underlying storage. Previously, we’d have to expose a pair of each, one for ASCII storage and one for UTF-16 storage, and hope the developer remembers to test both paths. Now, we can expose something akin to the following (details/spellings for demonstration purposes only):

myString.withCodeUnits { codeUnitBuffer in 
  // Access the contents as a contiguous buffer of `UInt8`
  // Awesome synergy with the character litarals pitch
  ...
}

let str = String(withInitialCapacity: 42) { contentsPtr in 
  // Initialize the string directly
  ...

  // Return the actual size we wrote in UTF-8 code units
  return actualSize

  // (UTF-8 validation is performed by String after closure is finished)
}

Of course, we need to figure out a strategy for communicating whether some existing String is native or a lazily-bridged NSString that does not provide contiguous UTF-8 contents. There are approaches with various tradeoffs: do the eager bridge, make everything optional, throw, trap, etc. Figuring this out will be the most important part of designing these APIs.

Shared Strings

The branch also introduces support in the ABI (but currently not exposed in any APIs) for shared strings , which provide contiguous UTF-8 code units through some externally-managed storage. These enable future APIs allowing developers to create a String with shared storage from a [UInt8], Data, ByteBuffer, or Substring without actually copying the contents. Reads would be slightly slower as it will require an extra level of pointer-indirection, but avoiding the copy could be a big win depending on the situation.

How You Can Help

While we are attacking our known-unknowns (regressions and gaps in the benchmark suite), we would really like to get early feedback on the new String ABI. If you encounter any issues or performance regressions, please let us know . I’ll update this thread when toolchains are available for download onswift.org.

Huge thanks to@lorentey,@lancep ,@johannesweiss ,@David_Smith, and@scanon for helping make this happen!

edit: Explicitly mentioned that NSStrings are still lazily-bridged in without copy.

tanner0101 (Tanner) 2018-11-05 23:50:30 UTC #2

Amazing work on this. I'm super excited to check it out as soon as the toolchains are ready.

Thanks@Michael_Ilseman et al!

ddunbar (Daniel Dunbar) 2018-11-06 00:22:40 UTC #3

Thanks@Michael_Ilseman for your amazing work to make this happen!!

taylorswift (^) 2018-11-06 00:37:04 UTC #4

Any movement on this? https://github.com/apple/swift-evolution/pull/939

Michael_Ilseman (Michael Ilseman) 2018-11-06 00:41:41 UTC #5

Off topic for this thread. Has it entered review? If you want it scheduled for review, ping a core team member.

Chris_Lattner3 (Chris Lattner) 2018-11-06 00:45:00 UTC #6

Great work@Michael_Ilseman, congratulations on landing this!

compnerd (Saleem Abdulrasool) 2018-11-06 02:37:50 UTC #7

Congratulations on getting this merged@Michael_Ilseman! It was quite an achievement.

anandabits (Matthew Johnson) 2018-11-06 02:45:45 UTC #8

Very impressive! It’s exciting to see this land. Thanks for your hard work@Michael_Ilseman!

duan (Daniel Duan) 2018-11-06 05:44:36 UTC #9

Frickin amazing. Congratulations Michael.

I look forward to diving into past projects where I used UTF16View for performance and fix them :)

xaver.lohmueller (Xaverlohmueller) 2018-11-06 06:39:50 UTC #10

Really looking forward to this landing in Swift 5!

Will this also benefit regex performance? Swift has some room for improvement in that area: https://benchmarksgame-team.pages.debian.net/benchmarksgame/faster/swift.html (see regex-redux )

SDGGiesbrecht (Jeremy David Giesbrecht) 2018-11-06 09:00:09 UTC #11

Most of this sounds awesome. I do have one concern though:

What does “guaranteeing” mean? Are you thinking of force‐normalizing every String to NFC ?!?

Please don’t. I depend on a wrapper structure that enforces NFD in order to do reliable and efficient searches for combining scalars.

let string = "café"
let acute = "\u{301}"

let nfd = string.decomposedStringWithCanonicalMapping.unicodeScalars
print(nfd.firstIndex(of: acute))
// 4th scalar (→ part of 3rd cluster)
// (✓: The expected answer.)

let nfc = string.precomposedStringWithCanonicalMapping.unicodeScalars
print(nfc.firstIndex(of: acute))
// nil
// (✗: Illogical side effect of legacy encoding designs.)

If “guarantee” means String becomes locked to NFC, then this sort of thing will become impossible, and I will have to write my own String type from scratch.

If “guarantee” means String still allows NFD, but is guaranteed to start in NFC when first initialized, then things will still work but it will be inefficient, because composition and decomposition will happen back and forth unnecessarily.

IanPartridge (Ian Partridge) 2018-11-06 10:51:29 UTC #12

Congratulations on landing this,@Michael_Ilseman - it's an amazing achievement. I am keen to bench various Kitura workloads against these changes. Do you have a Linux toolchain available or should I wait for one to appear onswift.org? Thanks.

dimpiax (Dmytro Pylypenko) 2018-11-06 12:32:05 UTC #13

Great improvement! Thanks for your work!

Michael_Ilseman (Michael Ilseman) 2018-11-06 16:40:07 UTC #14

No, this will not be a 100%-case guarantee. ABI stability (and lazily bridged NSString on Darwin platforms) dictates that String will always need to handle non-NFC-normalized contents. String has a performance flag that can be set when the contents are known to be NFC. If the bit is not set, the standard library has to treat the contents as non-NFC. ABI stability means that all future versions of the standard library have to handle strings with and without that bit set.

The performance goal is to find more ways to set that bit when we want performant comparison, but your example code won't break. There will always be ways of creating a String with a "leave my bits alone!" option. Whether that is the default, an explicit option, new or repurposed initializers, or it varies by use case, it's totally open to future change.

edit: grammar

SDGGiesbrecht (Jeremy David Giesbrecht) 2018-11-06 16:44:02 UTC #15

Thanks for answering. Sounds good.

String’s ABI and UTF-8

Performance

C Interoperability

Decoding

Small UTF-8 Strings

Miscellaneous

Efficient Cocoa Interoperability

Current Microbenchmark Issues

Code Size

The Future of String Performance

Internal Improvements

Low Level APIs

Shared Strings

How You Can Help

Recommend

The Road to GraphQL [Book] - RWieruch

科学松鼠会 » “天眼”年薪10万招不到人，过低“人头费”支撑不起“国之重器”

[图]华为Mate 20 Pro被曝存屏幕边缘漏绿光问题官方承诺免费更换 - Huawei 华为 - cnB...

最新论文认为快速穿越太阳系的怪异星际物体Oumuamua可能是外星飞船 - 科学探索 - cnBe...

GitHub - spatie/data-transfer-object: Data transfer objects with batteries inclu...

GraphQL Gets Its Own Foundation

Cettia Java Server 1.2.0 released | Cettia

bifurcan/comparison.md at master · lacuna/bifurcan · GitHub

TomEE: Announcing TomEE 8 M1! - Tomitribe

#REVIEW: What’s new in Spring Boot 2.1 | Philip's Homepage

About Joyk