A Journey through a Buyer’s life and Shop similarity

A friend recently pointed me toward “ Interactive Map of Reddit and Subreddit Similarity Calculator ” and asked me if a similar approach could be used to cluster online shops and get a shop similarity calculator. The idea was interesting enough for me to spend the last few weeks exploring it.

Working at Shopify we get all sort of interesting data on the merchants we serve (over 600k business worldwide) and the customers who purchase on those. The analysis I will outlay here is similar to what has been done for reddit . Instead of looking at subreddits, we will look at shops on the Shopify platform and instead of looking at comments posted by users in those subreddit, we will look at purchase made by customers of those shops.

In order to have significant clustering of those shops, I have looked at one year of purchase history from buyers. Close to a billion orders made on those shops. The principle is quite similar than in the case of the reddit analysis, here we have: “guilt by shared buyers”, if a same buyer make purchase in two stores, it is likely that those stores fulfill the needs of that buyer. By aggregating this over all of the buyers those shops serve we will observe trends where shops with shared buyers fulfill the needs of specific type of buyers. So each shops can be represented by a mathematical vector which summarize how many buyers they share with other shops.

I will go through the steps of the analysis in the second part of this post, but first let see some of the results. After clustering the shops I ended up with about 80 clusters. Having only so many different colors at my disposal, the representation of those cluster is what you could literally call a clusterfuck! We can see there are some clusters, but by the number we have, not much is distinguishable.

To get some order in this chaos I decided to look at two clusters on opposite ends of this figure and dug in a random sample of the shops constituting those clusters to figure out what they were about. This is how I got the theme of this activity: how to get from vaping to having kids in 9 easy steps!

Obviously this is not the only story that could be told here. Only one of the possible path through the 80 or so clusters I got through the analysis. Some path are more mundane, some might be more funny. At least this one is somewhat entertaining!

As a first step, if you are vaping and not yet in the hip-hop / skater culture, you might want to look into it! It is a small move, but one in the right direction.

Next, you should hit the gym and get all you need to make it your new passion. By now, your vaping days should be behind you. You are ready for a new life!

But that’s probably not enough. You will have to get what you need for more weight lifting, protein bars, drinks and nutrients to get you in shape.

As I said, this is just one possible path. Other things needs to happen at the same time, since your next step should be to purchase some jewelry, maybe an engagement ring?

Why stop there, now look at luxurious vacations in the south seas, think honeymoon, and have fun.

Now is a good time to get back to the source, get some “modest” clothing, look again at religion and prepare to get a baby.

If all go according to plan, you or your partner is ready to get pregnant and live the joys of expecting a baby!

And let me tell you as a father of five kids that in no time, those babies will be kids! And these are the nine steps from vaping to kids!

And now is time to give some more insights on the methodology followed. The source on which I based myself was providing the SQL syntax used to extract the subreddit representation vector input data. I will not go to that extent as anyway this is non-public data, but will comment on what we need to achieve here.

Representation Vector

The shop representation vector is built around the concept of latent semantic analysis , where instead of words and paragraphs we have shops and shared buyers. In order to build the representation vector we need first a table of the number of shared buyers for each shops against each other shops. As I mentioned, Shopify hosts more than 600k shops, you can figure that this could become quite big to produce, so some rules and limits will need to be enforced. In the same way that the reddit analysis representation vector was not representing all subreddit versus all other subreddit co-occurrences (the vector only represents co-occurrence of comments with 2,000 of the top subreddits), we choose our vector to represent shared buyers for only the first thousand most “shared” shops. We also excluded a number of extremely popular shops which are just “too” present everywhere and which are basically collinear with all other shops in term of sharing buyers.

Once we have that table, the next step is to extract a positive pointwise mutual information matrix. The following piece of code takes as input a pandas dataframe for which the index is a shop_id and for which each column is the number of shared buyers for this list of a thousand or so most “shared” shops.

For more details on the pointwise mutual information you can also refer to Document Summarization Using Positive Pointwise Mutual Information . Also many kudos to Max Shenfield and his Python implementation of the subreddit distance calculation. And pointing out that the Euclidean distance of normalized vectors is a metric equivalent to the cosine similarity between those vectors .

Distance Calculation

Distance metric and distance calculation selection is bounded for the major part by the choice of clustering algorithm and what it support. We chose to go with the Minkowski distance with p=2 (which is basically the Euclidean distance). We made also some experiments with different values of p. Notably, p=3 seems to provide slightly better results (personal taste) although this is not necessarily supported for all the different clustering we may want to try. Now that we have the positive pointwise mutual information matrix, the process of getting distances, and more importantly close neighbors is quite straightforward. For example below, you can get the ten closest neighbors from a specific example point.

Clustering

Having selected this distance metric now allows to do clustering. Here we had to go through a lot of back and forth. We currently have the intuition that our clusters have certain properties:

There are a number of outliers, shops with no or little close enough neighbors,
Not all clusters present the same “density”.

How do we know about that? The first one is easy, by looking at the distance to the closest neighbor for all shops we can see that the closest one is quite far a a good proportion of the shops. Those shops should not cluster but should be “removed” as outliers either as an initial step, or by the clustering algorithm if if supports outliers detection. The second one is more of a hunch at this point resulting from a number of observations. For example, we observed a number of clusters which were country or language specific. Although those cluster present a certain interest, it would be better if they could further separate by center of interest. Cluster based on specific center of interest is what we observe when we look only at US based merchants for example.

This hunch would lead us toward using a variable density based clustering algorithm and we are looking forward trying HDBSCAN in the future for that purpose. We haven’t got time to explore it yet. Still we tried DBSCAN which leads to some interesting results, however the implementation we use is quite slow… so for experimentation purpose we came to rely on k-means, after removing “outliers” i.e. shops where the closest neighbor is further than a cut-off distance. After clustering, we also re-classify certain clusters as “outliers” when they contain less than a certain number of shops. We are in a way “faking” some of the DBSCAN properties using k-means.

The basic required for k-means clustering can be summarized with below code:

The fact that we used 80 cluster as a “magic” number may raise some questions. That “magic” number comes in fact from the analysis of the Calinski Harabaz Score versus the number of clusters for a range of possible number of clusters values. Running below code we obtained the following graph and estimated the elbow at around 80.

Presentation

Finally there is a need for presenting those clusters visually as in the pictures you saw in the first part of this post. For that purpose we used a Linear Discriminant Analysis method and were interested in the first two dimensions separating the most the clusters we obtained from k-means clustering.

Obviously I’m skipping on a lot of implementation details, for example removing outliers clusters from the 80 k-means detected clusters. But the gist of the method lies within what I exposed above.

Hope this can help you apply a similar methodology to find similarity between different concepts. As you see this method not only yields interesting results for words in a document and subreddits classification but can also apply to similarity between shops on a SasS platform. Let me know of other interesting applications you put the method to use!

Cover picture by Ake on Rawpixel .

Originally published at thelonenutblog.wordpress.com on February 6, 2019.

Representation Vector

Distance Calculation

Clustering

Presentation

Recommend

系统出现大量HTTP Auto Proxy Detection Worker Process

PhpStorm 2019.1 EAP #3

iOS const、宏、static、extern的关系 - 简书

Java 各种锁的小结 - 简书

2016-2018，杜蕾斯春节借势营销文案

外媒：Facebook首席公关主管将离职

亚马逊平台假货泛滥股价盘中跌1.16%

微博除夕及春晚大数据发布：活跃用户达2.34亿

GitHub - JakeWharton/nopen: An error-prone checker which requires that classes b...

去年欧盟罚款超过税款谷歌母公司警告数据隐私伤害业绩

About Joyk