6

The Day My Script Killed 10,000 Phones in South America

 2 years ago
source link: https://new.pythonforengineers.com/blog/the-day-i/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The Day My Script Killed 10,000 Phones in South America

Shantnu Tiwari

Oct 18, 2021 • 7 min read

This post is a horror story about what happens when your code/test scripts go wrong. It's also a horror story on how not to test your code.

Most testing advice hits low hanging fruit advice:

Kid, you should write unit tests.
Sure, grandpa

We won't be doing that. Instead, I want to show you why test scripts need as much care and planning as "proper" code.

The Night of Scripting Horror

Like all horror stories, mine started at 17:15 on a Friday evening, just as I was about to log off.

I got a message from a fellow employee:

Remember that script we wrote yesterday? We've locked thousands of phones in South America, and people are complaining. We might be fired

I read that with increasing horror. This was the height of the covid lockdown, I was a contractor with no job security and jobs were few. Being fired would have been terrible.

How the hell did our script lock thousands of phones?

Tales of a K-Pop Phone company

I was working for a security company in the test automation group; the main product was an app, sold to mobile operators, that would lock the phone if it was stolen, or if the customer stopped paying. The app was built as part of the Android OS, so you couldn't uninstall it. It would lock the low-level features that allowed you to make calls, use Wifi, or even post pictures on Instagram/Facebook (the horror!) until you paid up.

All good. The company also made the front end for the operators to check which phones to lock. Our own internal tool worked well (I would like to say it was all due to me, which would be an arrogant thing to say, but also 100% true, so I will say it: It was all or mostly due to me).

But one mobile phone manufacturer (that I will call the K-Pop Phone Manufacturing Company), based in Korea (yes, that one), was too good to use our software and wanted us to use their crappy web-based tool. Which was fine, I was getting paid, so I would test what they gave me.

So where did it go wrong?

Like any perfect disaster, this one had many factors contributing to it.

The slow trainwreck

Our company had been bought by an investment firm, and they wanted their pound of flesh. All projects deadlines were moved up, and at one time we were testing 3 products in parallel, all with different requirements.

Which meant we were only testing the happy path, assuming everything went okay.

One of the final tests we had to do was: Confirm that when a mobile phone operator uploaded a Csv file with multiple phones, they were all locked.

Easy peasy. I wrote a Python script that would generate a few random phone numbers, would login to the web portal and lock the phones, then login to a different portal and check the results. This way we could test tens of thousands of combinations with one script that only took a few hours to run.

The final script actually generated hundreds of thousands of random numbers, because we had a few different use cases/scenarios.

You can already see where this is going.

Because of time pressures, there was no time (or political will) to check the script was well written. As soon as I banged it out, it was live. And I mean literally, 10 seconds after I pressed save, the script was running on live production servers.

I could have tested it better, but that would have meant working late into the night. No thanks. I had already worked a few late nights/weekends and I was done.

I ran the script; it was fine. The managers confirmed it did what we expected it to do. Everyone was happy. We could release the product next Monday, no one would be working the weekend.

And then I got the email: We had locked thousands of phones in South America, places like Peru and Chile.

AAAGGGHHHH!

Remember I told you the phones were "randomly" generated? The python script would randomly generate an 11 letter "phone number".

And of course, some of those phones were actual numbers!

And due to the way I wrote the script, and some weird IMEI hack (an IMEI is a unique number that identifies every phone), all those phones were in South America.

The long night

It was 17:30 on a Friday evening when I found all this out.

"Do you still have the Csv's you used when testing?" asked my colleague. We could just the scripts again, but this time with an unlock setting.

Of course, I didn't! The script had been hacked together and overwrote the last values. I just had the most recent ones. It was meant to be a one-time thing for a quick test.

There was a way around the problem--I could login to this Korean K-Pop mobile phone company and download the list– but they would only allow you to download 100 at a time, and I needed to download 10,000+

So another script to download the phone numbers. And yet another one to clean up the shitty csv's they gave us back because of course, they mangled everything up.

And a 3rd script to reconnect to the server and unlock all phones.

My colleague was helping me, so we got it all done in about an hour. By 18:30, we had unlocked all phones. A project manager confirmed it, and I took a breath of relief and quickly logged off before more fuckups could be found.

The Lessons we learnt (or more likely, didn't learn):

  • Test scripts need as much care and love as normal code
  • You should never test critical production code under deadlines / high pressure, no matter what management thinks

(and yes, no one follows the above advice; if you want to keep your job, you do as your told or look for a new job elsewhere)

  • The Korean K-Pop mobile phone maker had no checks– it was a pain to register on their system, but once you were registered, you could lock any phone anywhere in the world

The K-pop company knew, based on the IMEI number, which operator "owned" a phone, and hence were legally responsible for it. So if you buy a phone from T-Mobile, this is registered (in multiple places and databases), and only T-Mobile is (should be?) allowed to lock the phone, at least until the customer pays off their phone or changes the provider.

And yet, K-Pop Central allowed anyone to lock any phone if you just knew their phone number or IMEI (both of which values are not hard to get; the phone number is in many cases public knowledge and known to many people).

The lesson here is: Do some f'king checks before you lock someone's phone, but I don't know if that lesson applied to us or K-pop (most likely, both of us).

  • Testing should be done incrementally. Rather than testing the final system, we should have first tested each component. Why didn't we? Because we had no time
  • Happy path testing, where you just test the "best case", is never enough. Ideally, we should have considered testing for "bad" actors locking someone else's phone. Or just a rogue script doing it, as we did. This is not just good QA, but good security as well. But bah! None of that silliness for us, thanks.
  • We assumed that because our own software checked for this sort of thing, that the K-pop phone's software would do as well. And like that saying goes: When you assume, you make an ass out of you and me
  • Don't test shit in production! Now, we had a dev and staging server, but they weren't connected to the live app (due to time constraints and not enough people). Also, LOL, we needed to release the product in the next 2-3 days, so could we just shut up and get on with the testing, please?

As I said, we didn't learn any of these lessons, as as soon as the product was officially released next week, we were back to our shitty scripts dumping crap in databases and pretending we had "tested" the code.

Look ma, this box on this UI is ticked, the code is working!

We didn't get fired, because

a) We fixed that shit real quickly

but more importantly,

b) Some project managers and Important sales managers had also been doing the same thing – making up phone numbers and locking real phones. Of course, they were doing it manually so only locked 3-4 phones, while we locked 10,000, but the final effect was the same. We were all pissing in the pool.

Everyone pretended it was an honest mistake. Lol, look how silly we are. Not a big deal, live and learn ole chap, live and learn.

Except for the poor sod in Peru who couldn't make phone calls or post duck-face pictures to Instagram, but no one asked him.

Update 20/10/2021 based on lots and lots of Reddit/HN comments:

Im seeing the same type of comments here and on Reddit, I'll try to answer a few common ones.

1. Yes, I know it was stupid testing on production. But we'd been told we needed to release the product on Monday (this was Friday), no objections. A previous project manager had been fired for not being fast enough, and the whole test department was at risk of redundancy. Vice president level executives were asking for updates in daily standups.

Like I say in the post, sometimes, you do what you're told or you find another job. If it hadnt been the height of Covid , I might have chosen option 2.

2. Why didnt we test on a staging / dev server? We did have those, but they didnt work with the K-Pop companies servers. And we didnt have time (or enough people) to make it work. Lots of other engineers decided to just quit. Besides, we never thought K-pop would actually lock phones we didn't own.

3. Some people have asked about it: This type of app is perfectly legal. If you get a phone on contract, you don't actually own it till you pay it off, and the company is entitled to lock it. This is builtin on iPhones, Android need external apps to enforce this.

4. While it may seem stupid in retrospect (LOL, this guy calls himself an engineer, tests in production)-- at that time, it was just one firefight after another, and there was no time to think or plan. I can joke about it now, but at the time, it was one of the worst pressure cooker environments.

Some updates: After we released the 3 products, the whole QA/test department was made redundant. Including my boss, who was fired when he was attending his mother's funeral. Funnily enough, I stayed because I knew details of a legacy product. So far the last 3-4 weeks, I had no boss or anyone to report to.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK