![](/style/images/good.png)
![](/style/images/bad.png)
Part 2/3: Wikipedia Clickstream analysis with Neo4j - queries and exploration
source link: http://blog.bruggen.com/2021/03/part-23-wikipedia-clickstream-analysis.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Monday, 29 March 2021
Part 2/3: Wikipedia Clickstream analysis with Neo4j - queries and exploration
In the previous blogpost, I showed you how easy it was to import data into Neo4j from the official Wikipedia clickstream data. I am sure you would agree that it was surprisingly easy to import a reasonably sized dataset like that, within a very reasonable timeframe. So now we can have some fun with that data, and start applying some graph queries to it. All of these queries are also on github, of course, and you can play around with them there as well.
So let's take a look at some of these queries.
Some data profiling and exploration
match (n)-[r:LINKS_TO]->(m)return distinct r.type, count(r);match (n) return count(n);
match (source:Page)-[sourcelink:LINKS_TO]->(neo:Page {title:"Neo4j"})return source, sourcelink, neo;
match path = (sourcepage:Page)-[:LINKS_TO*..2]->(neopage:Page {title: "Neo4j"})return pathlimit 100;
Understanding the importance of the links
And we can actually do that for the links two hops away:
match (source:Page)-[sourcelink:LINKS_TO*..2]->(neo:Page {title:"Neo4j"})return source.title, REDUCE(sumofq = 0, r IN sourcelink | sumofq + r.quantity) AS totalorder by total desclimit 10;
match (sourceofsource:Page)-[sourceofsourcelink:LINKS_TO]->(source:Page)-[sourcelink:LINKS_TO]->(neo:Page {title:"Neo4j"})return sourceofsource.title+"==>"+source.title+"==> Neo4j" as pages, sourceofsourcelink.quantity+sourcelink.quantity as totalorder by total desclimit 10;
match path = (source:Page)-[sourcelink:LINKS_TO*..2]->(neo:Page {title:"Neo4j"})RETURN [node IN nodes(path) | node.title] as titles,reduce(sumofq = 0, r in relationships(path) | sumofq + r.quantity) as totalorder by total desclimit 10;
Which highlights a very interesting link between the British royals and Neo4j :) ...
match path = (target:Page)<-[targetlink:LINKS_TO*..4]-(neo:Page {title:"Neo4j"})return reverse([node IN nodes(path) | node.title]) as titles,reduce(sumofq = 0, r in relationships(path) | sumofq + r.quantity) as totalorder by total desclimit 10;
Tackling case sensitivity
One thing I noticed while I was working on the above examples is that it pretty confusing to do some querying that is case insensitive. Take the following example:match (p:Page)where p.title contains "beer"return count(p);
match (p:Page)where p.title =~ "(?i).*beer.*"return count(p);
CALL db.index.fulltext.createNodeIndex("pagetitleindex", ["Page"], ["title"]);
CALL db.index.fulltext.queryNodes("pagetitleindex", "*beer*") YIELD node return count(node);
CALL db.index.fulltext.queryNodes("pagetitleindex", "*beer*") YIELD node, scoreRETURN node.title;
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK