10

Visualizing Repetitions in String using Python and Matplotlib

 3 years ago
source link: https://urish.medium.com/visualizing-repetitions-in-string-using-python-and-matplotlib-5e4e1ddff0c9
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Visualizing Repetitions in String using Python and Matplotlib

Stepping Out of My Comfort Zone in Python and a Security Conference

A few months ago, I was accepted to present in BSidesTLV, a cyber security conferences. This was quite unusual for me, as I usually speak at Web / JavaScript / Angular related conferences. The conference covered topics such as malware analysis, block chain security, security vulnerabilities in movie subtitles (that was epic) and even hacking into luxury yachts.

As you can probably imagine, speaking in that conference was totally stepping out of my comfort zone. And so was my talk subject — I spoke about a research project I did last summer, breaking into an embedded device firmware that was encrypted using a substitution cipher. The talk itself is not online yet — I will tweet when it goes public.

I didn’t have much hands-off experience in cryptography, and while I have extensively used Python in the past, and still use it for some of my IoT projects, had to learn how to use numpy, Matplotlib and scipy for this project, and completing it proved to be a really difficult challenge for me.

Why I Needed to Visualize Character Repetitions in Python

One of the methods I used for defeating the encryption was a known-plaintext attack — basically, I had some indication that a specific string appeared somewhere inside the firmware (once decrypted). I didn’t know the specific location of the string, and didn’t even know it was there for sure.

Thus, I decided to try look for the string by trying to match the repetition patterns. I assumed I dealt with a substitution cipher, which means that for every character that repeated in the plaintext string, there had to be a matching character that repeated in the same pattern inside the encrypted firmware.

Explaining the above concept on stage is quite challenging — and so is seeing the repetition patterns of characters in a string. So I decided to devise a method to visualize these repetitions. I did the talk in live-coding format, as I usually do nowadays, and since I used Python and Matplotlib (inside Jupyter) for my code examples, I wanted the visualization to use the same tools.

Using Colors to Visualize

Initially, I thought about just painting different letters with different colors. However, this is quite confusing even for a simple example such as “Hello World”:

Image for post
Image for post

It’s not immediately obvious that the “o” repeats twice, the “l” three times, and all the other letters only appear once. In addition, coding this kind visualization with Matplotlib proved to be challenging, as you had to draw each of the letters individually and take care of the layout yourself. And above all — using just colors to visualize repetitions means that people with color blindness will totally miss the point.

Using Position to Visualize

My next attempt was to set the vertical position of each character according to its vertical position:

The magic happens at line 7, where I set the y offset of each character to its ASCII value (returned by the ord() function). Lines 9–10 dictate which area of the chart should be visible. This is what the result looks like:

Image for post
Image for post

In my opinion, this is actually better — not as colorful and attractive as the previous one, but it is much easier to spot the repetitions.

However, when I tried this with the string I actually wanted to visualize during my talk, it didn’t look as good:

Image for post
Image for post

You could still easily spot the repetition patterns, but if you try to figure out what the original strings reads… well, good luck!

Additionally, you miss all the whitespaces (and can barely see dots and commas).

Combining the Approaches — Colors and Positions

My next attempt was to combine both approaches. Instead of scattering the letters along the vertical axis, which makes the string really hard to read, I’ll keep them all in a single line, and use star shapes above the text, where both the vertical position and the color of each star will be set according to the character below it.

In other words, all the A’s will have the same color and vertical position for the stars, all the B’s will share the same color and vertical position for the stars, etc. This is what it looked like:

Image for post
Image for post
Suddenly the text is readable!

You can easily see the repetition patterns of each letter by looking at the star above it — for instance, the “O” in the word “TO” has a red star above it. If you scan vertically, you can easily see there are 5 more O’s, all pretty far away to the right.

In addition, spotting the dots (orange star around Y = 46) and the whitespaces (blue star at Y=32, that’s the ASCII code for space). And the source code:

Lines 6–7 extract a list of all the characters that appear in the string, sorted. In other words, chars will contain the following value:

[' ', '.', 'A', 'B', 'C', 'D', 'E', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'R', 'T', 'U', 'W']

Then, we go over each character, and extract all the indices where it appears in the string (line 9). In line 10 we ask Python to draw stars in all these indices, setting the Y position to the ASCII value of the character.

You may wonder about the colors — the code never talks about colors. This is actually handled implicitly by Matplotlib: whenever you call the plot() method, it uses the next available color from a set of 10 predefined colors. That’s the reason why we sorted the characters: it ensures adjacent characters won’t accidentally get the same color.

Lines 12–13 simply draw all the characters, one-by-one, at a predefined vertical position — I used 55, to fill the empty space between the letters and the punctuation / white spaces.

Final Touches

I’m a bit of perfectionist, and if you are like me, you probably noticed all the “I” characters a bit off.

This is easily fixed by using a monospace font — just add a family='monospace' parameter to the plt.text() call in line 13:

Image for post
Image for post

A final touch was adding vertical guide lines, to make it easier to match a start with the respective letters. I achieved this by adding the following two lines of code just before calling plt.show():

plt.gca().xaxis.grid(True, 'minor')
plt.gca().xaxis.set_minor_locator(MultipleLocator(1))

These lines draw a grid on the X axis (thus forming vertical lines), setting the distance between the grid lines to 1 (so it’s drawn between every two successive characters):

Image for post
Image for post

Can You Do Better?

At this point I was pretty happy with the visualization. The point I was trying to make in my talk is that the string I was looking for had a very unique pattern of character repetitions, so if I did find a match in the encrypted firmware, it had to be that I found this string — there was virtually no chance that I’d be wrong.

I think the visualization does make the point, but I would love to see if others can come with even better ways to visualize repetitions in strings — be it in Python, CSS, SVG, D3, Three.js or whatever technology that you like. How creative can you get here?


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK