7

Data Structures & Algorithms I Actually Used Working at Tech Companies

 3 years ago
source link: https://blog.pragmaticengineer.com/data-structures-and-algorithms-i-actually-used-day-to-day/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Do you actually use algorithms and data structures on your day to day job? I've noticed a growing trend of people assuming algorithms are pointless questions that are asked by tech companies purely as an arbitrary measure. I hear more people complain about how all of this pointless, and a purely academic exercise. This notion was definitely popularized after Max Howell, the author of Homebrew, posted his Google interview experience:

Google: 90% of our engineers use the software you wrote (Homebrew), but you can’t invert a binary tree on a whiteboard so fuck off.

— Max Howell (@mxcl) June 10, 2015

While I've also never needed to use binary tree inversion, but I have come across everyday use cases of data structures and algorithms when working at Skype/Microsoft, Skyscanner and Uber . This included writing code and making decisions based on these concepts. Even more, I used this knowledge to understand how and why some things were built and how I can use or modify them.

This article is a set of real-world examples where data structures like trees, graphs, and various algorithms were used in production. Almost all of these are first-hand experiences. I hope to illustrate that a generic data structures and algorithms knowledge is not "just for the interview" - but something that you'll find yourself reaching to both when working at, or when building up a fast-growing, innovative tech company.

I've used a very small subset of algorithms, but almost all data structures. It should be of no surprise that I am no fan of algorithm-heavy and non-practical interview questions with exotic data types like Red-Black trees or AVL trees. Never asked these, and never will. You can read about what I think about these interviews at theend of this article. Still, I find lots of value in being aware of what options for basic data types they can choose to tackle certain problems. With this, let's jump into examples.

Graphs and graph traversing: Skype and Uber

When we built Skype of Xbox One, we worked on a barebones Xbox OS, that was missing key libraries. We were building one of the first full-fledged applications on the platform. We needed a navigation solution that we could hook up both to touch gestures and to voice commands.

We built a generic navigation framework on top of WinJS. To do so, we needed to maintain a DOM-like graph to keep track of the actionable elements. To find these elements, we did DOM traversal - basically, a B-tree traversal - across the existing DOM. This is a classic case of BFS or DFS (breadth-first search or depth-first search).

At Uber, the team built many tools to visualize nodes, dependencies, and their connections. One example was a visualization tool for RIB nodes. The approach was the same in this case. The tool needed to maintain a tree , visualize this into an SVG, then update the tree, as the RIB tree on the mobile device changed. Also, RIBs themselves maintain a logical tree structure for state management that is different from the rendered objects: this is one of the key ideas behind their design.

Screenshot-2020-07-08-at-14.46.32.png RIBs tree visualization: a classic example of using trees to both represent data, and to visualize it. See the whole presentation here .

Weighed graphs and shortest paths: Skyscanner

Skyscanner finds the best deals on airline tickets. It does this by scanning all routes worldwide, then putting them together. While the nature of the problem is more on crawling, and less on caching - as airlines calculate the layover options - the multi-city planning option becomes the shortest path problem .

Multi-city was one of the features that took Skyscanner quite a bit of time to build - in all fairness, the difficulty was more on the product side, than anything. The best multi-city deals are calculated by using shortest path algorithms like Dijkstra or A*. Flight routes are represented as a directed graph, with each edge having a weight of the cost of the ticket. Calculating the cheapest price option between two cities was done via an implementation of a modified A* search algorithm per route. If you're interested in flights and shortest paths, the article on implementing the shortest flight search path using BFS by Sachin Malhotra is a good read.

With Skyscanner, the actual algorithm was far less important, though. Caching, crawling, and handling the varying website load were much more difficult things to crack. Still, a variation of the shortest paths problem comes up with many several travel companies that optimize for price based on combinations. Unsurprisingly, this topic was also a source of hallway discussions here.

Sorting: Skype (kind of)

Sorting is an algorithm family I rarely had an excuse to implement or needed to use in-depth. It's interesting to understand the different types of ways to sort, from bubble sort, insertion sort, merge sort, selection sort and - the most complex one - quicksort. Still, I found that there is rarely a reason I had to implement any of this, especially as I never had to write sort functions as part of a library.

At Skype, I got to exercise a bit on this knowledge, though. One of the other engineers decided to implement an insertion sort for listing contacts. In 2013, when Skype connected to the network, contacts would arrive in bursts, and it would take some time for all the contacts to arrive. So this engineer thought it's more performant to build the contact list organized by name, using insertion sort.

We had a back-and-forth on this, over why not just use the default sort algorithm. In the end, it was more work to properly test the implementation, and to benchmark it. I personally didn't see much point in doing so: but we were in the stage of the project that we had the time.

There are definitely some real-world use cases where efficient sorting matters, and having control over what type of sorting you use, based on the data, can make a difference. Insertion sort can be useful when streaming realtime data in large chunks and building realtime visualization for these data sources. Merge sort can work well with divide-and-conquer approaches if it comes to large amounts of data stored on different nodes. I've not worked with these, so I'll still mark sorting that I've had little use for, beyond the appreciation of the different approaches.

Hashtables and hashing: everywhere

The most frequent data structure I've used regularly was hashtables and the hashing function . It's such a handy tool from counting, to detecting duplications, to caching, all the way to distributed systems use cases like sharding . After arrays, it's easily the most common data structure I've used on countless occasions. Almost all languages come with this data structure, and it's simple to implement if you'd need it.

Stacks and queues: every now and then

The stack data structure will be very familiar to anyone who has debugged a language that has a stack trace. As a data structure, I've had a few problems to use it for, but debugging and performance profiling makes me intricately familiar with it.

I rarely chose queues as data structures for my code, but I came across it plenty of times in codebases, code popping, or pushing. For a build bottleneck detector tool that analyzed and profiled builds, I read code that was cleverly optimized with priority queues, using the Python heap queue algorithm .

Crypto algorithms: Uber

Uber adopted several languages and technologies early on, and cryptography was not implemented to the stability that we needed. A good example was iOS and Android. User-entered sensitive information coming from the clients needs to be encrypted before sending through the network, only to be decrypted on a specific service.

There were cases where we had to build our own encryption / decryption implementations, formally verifying and auditing them, in the absence of the framework supporting it, or audited libraries being available.

Building crypto is always a lot of fun. This is more of an implementation challenge: you don't come up with a new algorithm - with crypto, this can be one of the worst ideas when you're in engineering. Instead, you take an existing, well-documented standard that fits the bill, then code it, verify it, audit it, and audit it again. In this case, this was implementing the AES standard . It's a fun intellectual challenge. We've built some mobile and web crypto implementations, me learning in-depth details of the Advanced Encryption Standard ( AEP ), Hashed Message Authentication Codes ( HMAC ), or the RSA public-key encryption.

As part of auditing the solution, investigating attack vectors like message tampering or the impact of denial-of service. Verifying that a series of encryption steps are provably secure was another interesting thing to understand. As was getting to the bottom of how, between encrypt-and-MAC, MAC-then-encrypt, and encrypt-then-MAC, only one of them are provably secure , but that doesn't mean the others are not secure.

Probability theory and speculation: SubmitQueue at Uber

When I joined Uber in 2016, the biggest pain point on mobile was just how long the build took, and how much longer it took to merge your changes. A full build would run about 40 minutes, from build and all tests - unit, integration, E2E. We had about 300 developers pushing to production, and all changes had to go through the build before landed. Master needed to always be green.

While builds were parallelized, 60% of the time, after the build passed, the merge would fail. Someone else would change a similar code path, and their change would already be in the repo. So it took, on average, 2-3 hours to merge your change in, with multiple retries.

The Developer Experience team sitting next to me wanted to solve this problem, and they did it from two angles. First, they sped up the build and tests to take no more than 30 minutes. Second, their goal was that everyone's change should take 30 minutes, 95% of the time. But what about merge conflicts? Let's predict them, and queue builds accordingly. And this is what they did, by creating conflict and speculation graphs. There are a lot more details in this whitepaper , but the result was a much faster queue - called SubmitQueue - that optimized build time, and made the life of now 400 mobile engineers far more pleasant. With some algorithms behind the scenes.

Screenshot-2020-07-08-at-18.11.18.pngIncreasing the throughput of builds using SubmitQueue, and probabilistic speculation with a conflict analyzer. The result: faster builds, less merge conflicts.

Hexagonal Grids, Hierarchical Indexes

This last project is one I was not involved in, but one I've observed and have played around with the tools that have been built on top of it. Here, I learned about a new data structure: hexagonal grids with hierarchical indexes.

One of the most difficult and interesting problems to solve at Uber is how to optimize the pricing of trips, and the dispatching of partners. Prices can be dynamic, and drivers are constantly on the move. H3 is a grid system engineers at Uber built to both visualize and analyze data across cities, at an increasingly granular level. The data - and visualization - structure for this is a hexagonal grid with hierarchical indexes.

Screenshot-2020-07-08-at-17.24.25.png

The data structure has specific indexing, traversal, hierarchical grid, region, and unidirectional edge functions, detailed in the API reference . For a more detailed deep-dive, see the article on the H3 library , the source code , or the presentation on how and why this tool was built.

What I really liked about this experience is how I learned how creating your own specialized data structures can make sense in a niche area. There aren't many use cases where hexagonal grids with hierarchical indexes would make sense beyond combining mapping with various data levels within each cell. Still, if you're familiar with some data structures, understanding this new data structure is much easier - as it would be to design yet another data structure for a specialized need.

Interviews and algorithms and data structures

Those were the highlights of the actual data structures and algorithms I've used professionally between multiple companies and many years. So let's go back to the original tweet that complained about asking things like inverting a binary tree on a whiteboard. I'm on Matt's side on this one.

Knowing how popular algorithms or exotic data structures work are not something you should need to know to work at a tech company.You should know what an algorithm is, and should be able to come up with simple ones on your own, like a greedy one. You should also know about basic data structures that are pretty common, like hashtables, queues, or stacks. But specific algorithms like Dijkstra or A* are not ones you'd need to memorize: you'll have a reference for this, just like I had when implementing crypto. Same with exotic data structures like Red-Black trees or AVL trees. I've never had to use these data structures, and even if I did, I would look them up again. I have never asked questions that needed this kind of knowledge to solve them, and never will.

I am all for asking practical coding exercises, where there are many good solutions, from brute force or greedy approaches to potentially more sophisticated ones. For example, asking to implement a justify text function like this question is a fair one: it's something I did when I was building a simple text renderer on Windows Phone. You can solve this problem just by using an array and a few if/else statements, without any fancy data structures.

The reality is that many teams and companies are going overboard with algorithmic challenges.I can see the appeal of algorithmic questions: they give you signal in 45 minutes or less, and questions can be easily swapped around; thus, there is little harm if the question leaks. They are also easy to scale when recruiting, as you can have a question pool of 100+ questions, and any interviewer can evaluate any of them. Especially in Silicon Valley, it is more and more common to hear questions geared for dynamic programming or exotic data structures. These questions will help hire strong engineers - but also result in turning away people who would have excelled at a job that doesn't need advanced algorithms knowledge.

To anyone reading anyone whose company has a bar to hire people who know some of the advanced algorithms by heart: think again if this is what you need. I've hired fantastic teams at Skyscanner London and Uber Amsterdam without any tricky algorithm questions, covering no more than data structures and problem solving. You shouldn't need to know algorithms by heart. What you do need is awareness of the most common data structures and the ability to come up with simple algorithms to solve the problem at hand, as a toolset.

Data structures and algorithms are a toolset

If you work in a fast-moving, innovative tech company, you will almost certainly come across all sorts of data structures and different algorithm implementations in the codebase. When you build new and innovative solutions, you will often find yourself reaching to the right data structure. This is when you'll want to be aware of the options to choose from and their tradeoffs.

Data structures and algorithms are a tool that you should use with confidence when building software. Know these tools, and you'll be familiar with navigating codebases that use them. You'll also be far more confident in how to implement solutions to hard problems. You'll know the theoretical limits, the optimizations you can make, and you'll come up with solutions that are as good as they get - all tradeoffs considered.

To get started, I recommend the following resources:

  • Read up on the hashtable, linked list, tree, graphs, heap, queue and stack data structures. Play around with how to can use them in your language. For an overview, GeeksforGeeks has a good overview . For coding practice, I'd recommend the HackerRank Data Structures collection .
  • Grokking Algorithms Aditya Bhargava is hands down the best guide on algorithms from beginners to experienced engineers. A very approachable, and visual guide, covering all that most people - including myself - need to know on this topic. I am convinced that you don't need to know more about algorithms than this book covers.
  • The Algorithm Design  Manual and Algorithms: Fourth Edition are both books I picked up back in the day to refresh some of the stuff I studied at university. I gave up midway, and found them pretty dry, and not applicable to the work I've done.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK