Swift字符串性能问题

Posted on

2020-11-29 Edited on 2020-11-30

用Swift做算法题时，经常遇到输入为String的情况，但是用Swift的String API遍历元素非常麻烦，每次都得Google一下。做这类题目时，我一般会直接把String转成Array，然后就能愉快地用数组下标访问元素了。

// 直接使用String遍历字符串
for i in 0..<s.count {
    let ch = s[s.index(s.startIndex, offsetBy: i)]
    // ...
}

// 转换为Array再遍历字符串
let array = Array(string)
for ch in array {
    // ...
}

但是转换为Array毕竟会产生额外的内存消耗，身为有追求的程序员，咱必须最求严格要求自己。于是有一天刷到有效的字母异位词这个题目时，我决定直接使用Swift的String API来写。

题目很简单，主要步骤就是遍历字符串，并存储在哈希表中。直接上代码（解法也很朴素😳）：

func isAnagram(_ s: String, _ t: String) -> Bool {
    guard s.count == t.count else {
        return false
    }
    var map = [Character: Int]()
    for i in 0..<s.count {
        let ch = s[s.index(s.startIndex, offsetBy: i)]
        if map[ch] != nil {
            map[ch] = map[ch]! + 1
        } else {
            map[ch] = 1
        }
    }
    for i in 0..<t.count {
        let ch = t[t.index(t.startIndex, offsetBy: i)]
        if map[ch] != nil {
            map[ch] = map[ch]! - 1
            if map[ch] == 0 {
                map[ch] = nil
            }
        } else {
            return false
        }
    }
    return map.count == 0
}

一提交，超时了。当输入为特别长（如50000）的字符串时，提示上述代码的运行时间超过限制。
我反复研究了自己写的代码，O(n)的时间复杂度，没啥毛病啊。怀着试一试的心态，我又提交了一次，还是超时。。

实在没办法了，我决定把String转成Array，然后直接基于Array来遍历。

func isAnagram(_ s: String, _ t: String) -> Bool {
    guard s.count == t.count else {
        return false
    }
    var map = [Character: Int]()
    let sArray = Array(s)
    let tArray = Array(t)
    for ch in sArray {
        if map[ch] != nil {
            map[ch] = map[ch]! + 1
        } else {
            map[ch] = 1
        }
    }
    for ch in tArray {
        if map[ch] != nil {
            map[ch] = map[ch]! - 1
            if map[ch] == 0 {
                map[ch] = nil
            }
        } else {
            return false
        }
    }
    return map.count == 0
}

再提交，竟然就通过了😂。

难道是LeetCode的问题？我决定用Swift Playground跑个性能测试看看结果。

class PerformanceTest: XCTestCase {
    let string = "..." // 长度为50000的字符串

func testPerformanceOfArray() {
        measure {
            let _ = solution.isAnagramUsingArrayApi(string, string)
        }
    }

func testPerformanceOfString() {
        measure {
            let _ = solution.isAnagramUsingStringApi(string, string)
        }
    }
}

PerformanceTest.defaultTestSuite.run()

使用Array的性能测试结果：

<unknown>:0: Test Case '-[__lldb_expr_20.PerformanceTest testPerformanceOfArray]' measured [Time, seconds] average: 1.709, relative standard deviation: 4.094%, values: [1.792613, 1.672665, 1.666029, 1.608078, 1.619144, 1.832425, 1.732586, 1.706398, 1.772683, 1.683238], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100

平均运行时间为 1.709 s。

使用String的性能测试结果：

<unknown>:0: Test Case '-[__lldb_expr_28.PerformanceTest testPerformanceOfString]' measured [Time, seconds] average: 33.674, relative standard deviation: 1.500%, values: [32.627662, 33.954215, 34.245659, 34.001036, 33.695257, 34.446044, 33.701735, 33.209314, 33.351506, 33.507718], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100

平均运行时间为 33.674 s。

眼前的结果不禁令人陷入沉思，Swift的String API不仅难用，性能也不行，到底是出了什么问题？

Swift的String变迁史

通过阅读Swift的文档，我们知道String其实是一个Collection类型（遵循Collection协议），也就是说，String是可以使用下标访问的。

A Unicode string value that is a collection of characters.
A string is a series of characters, such as “Swift”, that forms a collection. Strings in Swift are Unicode correct and locale insensitive, and are designed to be efficient. The String type bridges with the Objective-C class NSString and offers interoperability with C functions that works with strings.

在Swift语言的历史上，String有三次比较大的改动：

Swift1 版本，遵循Collection协议
Swift2-3 版本，不再遵循Collection协议
Swift4 版本，重新遵循Collection协议

可以看到，Swift Core Team对String该何如实现也是有过争议和动摇的，但是最终还是让遵循Collection协议了。
为什么最终会遵循Collection协议呢？我理解主要有以下两点：

更易于理解，字符串其实就是字符的集合
从设计上，String和Collection可以直接用一套API

关于最终的选型，Swift Core Team有更深入的思考，有兴趣的可以继续阅读https://github.com/apple/swift/blob/main/docs/StringManifesto.md。

遵循Collection协议本身对开发者应该是个好事，因为这样更容易理解，一套API就可以搞定所有常用的集合类型。但是为什么String的接口这么复杂，性能也不高呢？

字符串一直都很难

字符串一直是个老大难的问题，Unicode也一直是开发者的噩梦。

上文说过，Swift的String对外表现是字符的集合，其底层实现其实是UTF-8编码单位的集合，每个字符由1到4个UTF-8编码单位组成。也就是说，String的每一个元素的长度是不一定相等的，我们无法直接使用Int类型的数字下标去访问集合中的字符，这也是String与Array等其他集合类型的最大区别。

正是由于这个原因，Swift不得不放弃Int类型的下标，在内部通过自定义的Index类型记录对应元素的偏移量，并提供通过Index进行元素访问的接口。

/// A position of a character or code unit in a string.
@frozen public struct Index {
}

// 读取string的第2个字符
let string = "Swift"
let index = string.index(string.startIndex, offsetBy: 1)
string[index]

对性能的影响

无论是Int类型的下标还是Index类型的下标，作为集合类型，通过下标访问元素的时间复杂度都应该是 O(1)，那为什么Swift中方法String的元素会比访问Array的元素性能差这么多呢？

问题就出在Index的计算上。也就是

let index = string.index(string.startIndex, offsetBy: 1)

对于每个 index ，都需要根据当前 string 的实际字符情况计算真实的偏移量，不难看出，这一步的时间复杂度应该是 O(n)。
而对于数组来说，不需要计算这个 index ，自然性能会更高。

回到最开始的算法题，如下代码中，遍历获取字符串 s 中的字符时，每次都会重新计算 index，导致运行时间过长。

func isAnagram(_ s: String, _ t: String) -> Bool {
    // ...
    for i in 0..<s.count {
        let ch = s[s.index(s.startIndex, offsetBy: i)]
        if map[ch] != nil {
            map[ch] = map[ch]! + 1
        } else {
            map[ch] = 1
        }
    }
    // ...
}

那如果不要每次都重新计算 index ，而是保存当前的 index 结果，每次只需继续向后偏移 1 位，会不会提升性能呢？
我们将上面的代码稍作修改，重新跑一边性能测试。

func isAnagram(_ s: String, _ t: String) -> Bool {
    // ...
    var index = s.startIndex
    for _ in 0..<s.count {
        let ch = s[index]
        index = s.index(index, offsetBy: 1)
        if map[ch] != nil {
            map[ch] = map[ch]! + 1
        } else {
            map[ch] = 1 
        }
    }
    // ...
}

<unknown>:0: Test Case '-[__lldb_expr_38.PerformanceTest testPerformanceOfString]' measured [Time, seconds] average: 6.137, relative standard deviation: 6.898%, values: [5.295313, 6.041929, 7.026869, 6.305301, 6.466630, 5.881635, 5.932458, 5.979560, 6.214758, 6.226155], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100

果然，平均运行时间降低到了 6.137s，虽然还是比用Array慢，不过已经快了很多！验证结果符合预期✌️。

看来String内部并没有对index的计算结果做缓存，不过从语言设计上看，确实不该在内部做这个缓存。

Swift中的字符串历经多个版本的变迁，依然令人非常头疼。其之所以使用起来麻烦，是因为内部字符长度长度不固定，无法直接使用普通的数字下标进行索引，必须维护内部特殊的Index。而Index需要经过计算得出，也就带来了访问字符串元素时的性能问题。

但是无论如何，Swift一直在对String进行持续优化，包括提供更方便的接口（如Offset-Based Access）等。从Swift的设计理念上来说，正确性比易用性更重要。为了保证代码不出问题，稍微麻烦一点也是可以接受的。

String还有很多有意思的地方值得深入学习，接下来的文章继续探讨吧。

Swift字符串性能问题

Swift字符串性能问题

Swift的String变迁史

字符串一直都很难

对性能的影响

Recommend

Introducing Workspace ONE for Linux Endpoint Management

Second Life Maps | Gigglez City

GitHub - skydoves/android-developer-roadmap: 🗺 Android Developer Roadmap 2022 su...

What's New In DevTools (Chrome 97)

VMware Horizon and Google Chromebooks Are Enabling a Hybrid Workforce

Second Life Maps | Fireball

GitHub - ja-netfilter/ja-netfilter: A javaagent lib for network filter

A Guide to Load Testing Node.js APIs with Artillery

Grouping Related Objective-C Constants

Second Life Maps | IVES

About Joyk