Making sense of 2020's mad calendar with Neo4j
source link: http://blog.bruggen.com/2020/11/making-sense-of-2020s-mad-calendar-with.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Wednesday, 4 November 2020
Making sense of 2020's mad calendar with Neo4j
As we enter November 2020, I - like many people I assume - can't help but feel quite "betwattled" by all of the events taking place this year. I took some time last weekend to look at all the crazy events that happened ... starting with pretty normal January and February, moving slowly to ominous March, and then living the weird, (semi-) locked down lives that we have been living until this very day I write this, which is the day after the bizarre US elections.In any case, I decided to have some fun while reflecting about all this. And in my world, that means playing with data, using my favourite tools... CSV files, Google Sheets, and of course, Neo4j. Let me take you for a ride.
Starting out with my calendar
The starting point of all this is of course my Google Calendar - which is buried in online calls and meetings these days.
So in order to work with that data, I had to go to the Google support page and learn how to export it into an "ICS" file. This turned out to be easy enough. ICS file format takes a little getting used to:But it is pretty easy to convert that into a CSV file using a tool like this one. The result is a flat file structure that is very useable:However, to make it more straightforward to follow this blogpost, I have created a sample file that completely corresponds to my actual import. It's in a Google Sheet over here, and that then obviously also corresponds to a CSV download that you could work with straight away in your queries.
load csv with headers from "file:/calendarfull.csv" as csv
create (e:Event)
set e = csv
set e.starttime = datetime(e.starttime)
set e.endtime = datetime(e.endtime)
set e.createdtime = datetime(e.createdtime)
set e.lastmodifiedtime = datetime(e.lastmodifiedtime);
create index on :Event(starttime);create index on :Event(endtime);create index on :Event(createdtime);create index on :Event(lastmodifiedtime);create index on :Event(summary);create index on :Event(description);create index on :Company(name);create constraint on (p:Person)assert p.email is unique;create index on :ConferenceProvider(name);
Graphifying the imported data
Then there's some clean-up that needs to be done. First, I will correct the fact that there are Events in the data that don’t have an organiser. I will just manually give them my own details as organiser with this query:MATCH (e:Event)
WHERE e.starttime.year < 2020 OR e.calendartype = "personal"
DELETE e;
And then lastly before we start the actual "graphification" process, let me consolidate all the email addresses, remove the "mailto:" prefix, and make these updates both in the "organizer" and the "attendee" fields of the Event nodes:match (e:Event)
where e.organizer is null
set e.organizer = "mailto:[email protected]";
These three steps proceed very quickly as you can imagine:match (e:Event)
set e.organizer = replace(e.organizer,"mailto:","")
set e.attendee = replace(e.attendee,"mailto:","")
set e.organizer = replace(e.organizer,"@neotechnology.com", "@neo4j.com")
set e.organizer = replace(e.organizer,"@neo4j.org", "@neo4j.com")
set e.attendee = replace(e.attendee,"@neotechnology.com", "@neo4j.com")
set e.attendee = replace(e.attendee,"@neo4j.org", "@neo4j.com");
and the we just need to look at the "attendee" field of the event. This field contains a list of people's email addresses, and therefore we need to split the field up, and unwind that collection. This is how we remove the attendees from the events and connect them:match (e:Event)
merge (p:Person {email: e.organizer})
create (p)-[:ORGANIZES]->(e);
match (e:Event)
with e, split(e.attendee,";") as CollectionOfAttendees
unwind CollectionOfAttendees as IndividualAttendee
merge (p:Person {email: IndividualAttendee})
merge (p)-[:ATTENDS]->(e);
The query for this process looks like this:
MATCH (p:Person)
WITH distinct split(p.email,"@")[1] as domain, p
WHERE not domain in ['email.com','163.com','free.fr','gmail.com','group.calendar.google.com','resource.calendar.google.com','qq.com','hotmail.com','mail.com','outlook.com','skynet.be','vanbruggen.be']
merge (c:Company {name: toUpper(domain)})
merge (p)-[:WORKS_FOR]->(c);
MATCH (c:Company)
set c.name = replace(c.name,".COM","")
set c.name = replace(c.name,".ORG","")
set c.name = replace(c.name,".BE","")
set c.name = replace(c.name,".NL","")
set c.name = replace(c.name,".FR","")
set c.name = replace(c.name,".CO.UK","")
set c.name = replace(c.name,".GOV.UK","")
set c.name = replace(c.name,".CO","")
set c.name = replace(c.name,".DK","")
set c.name = replace(c.name,".NO","")
set c.name = replace(c.name,".IO","");
So now we look at the the updated model, and see the graphification to take hold:match (c1:Company), (c2:Company)
where id(c1) < id(c2)
and c1.name = c2.name
with collect([c1,c2]) as companies
unwind companies as company
call apoc.refactor.mergeNodes(company,{properties:"discard",mergeRels:true}) yield node
return node;
match (e:Event)
where e.description contains "meet.google.com"
merge (cp:ConferenceProvider {name:"Google Meet"})
merge (e)-[:USES_CP]->(cp);
match (e:Event)
where e.description contains "neo4j.zoom.us"
merge (cp:ConferenceProvider {name:"Zoom"})
merge (e)-[:USES_CP]->(cp);
match (e:Event)
where e.description contains "teams.microsoft.com"
merge (cp:ConferenceProvider {name:"Microsoft Teams"})
merge (e)-[:USES_CP]->(cp);
match (e:Event)
where toUpper(e.description) contains "GOTOMEETING"
merge (cp:ConferenceProvider {name:"LogMeIn GoToMeeting"})
merge (e)-[:USES_CP]->(cp);
match (e:Event)
where toUpper(e.description) contains "WEBEX"
merge (cp:ConferenceProvider {name:"Cisco WebEx"})
merge (e)-[:USES_CP]->(cp);
Querying the CalendarGraph
You can see how I started switching more to Zoom.match (e:Event)--(cp:ConferenceProvider)
return cp.name, count(cp);
match (e:Event)
where not ((e)--(:ConferenceProvider))
return count(e);
match (e:Event)--(cp:ConferenceProvider)
return distinct e.starttime.month as month, count(cp)
order by month asc;
match (e:Event) return distinct date(e.starttime), count(e) order by
count(e) desc limit 10;
Inferring new relationships
call apoc.periodic.iterate(
"match (p1:Person)-[:ATTENDS]->(e:Event)<-[:ATTENDS]-(p2:Person)
where id(p1)<id(p2)
return p1,p2,e",
"merge (p1)-[mwr:MEETS_WITH]->(p2)
on create set mwr.frequency = 1
on match set mwr.frequency = mwr.frequency + 1",
{batchsize:1000, parallel:false});
and I can see the "usual suspects" at the top:match (p1:Person)-[mw:MEETS_WITH]->(p2:Person) return p1.email,
mw.frequency, p2.email order by mw.frequency desc limit 10;
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK