3

JavaScript vs Logs

 3 years ago
source link: https://pointlessramblings.com/posts/javascript-vs-logs/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

JavaScript vs Logs Mon 06 January 2020

A war has been raging over the last few years between consumers and advertisers. A never ending cat and mouse game where each side tries to out-do the other. It’s tracking vs blocking, clicks vs privacy, readers vs publishers.

Gone are the days when adding the GA/et-al snippet to your blog captured almost every user. My theory is that, for tech oriented blogs, GA misses a significant portion of traffic. Tech readers are more aware and actually care about their privacy online.

But how big is the problem? What am I not seeing in GA? Am I talking a load of nonsense? To find out I used GoAccess to parse & track the stats for this blog from the nginx logs. The results have been quite revealing. Let’s look at February to July of 2018:

ga_vs_goaccess.png

That’s a huge difference! Much more than I ever expected. Note that this after some cleanup. GoAccess uses servers logs so all traffic counts - including crawlers, static files and bad actors. The logs were filtered and processed as so:

# GET requests, 200 responses, no static/invalid files
grep "GET" access-feb-jun.log | \
    grep '" 200 ' | \
    grep -vE "\?|\x|http" | \
    grep -vE "\.(png|php|css|ico|xml|js|jpg|zip|cgi|gif|woff|woff2|eot|svg|ttf)" \
    > access-feb-jun-clean.log

# Parse with goaccess, removing redirects and crawlers
goaccess \
    --log-format=COMBINED \
    --ignore-crawlers \
    --ignore-status=301 \
    --ignore-status=302 \
    -o report.csv \
    access-feb-jun-clean.log

The result gives a clean(ish) log of legit traffic - which is roughly what GA should be tracking. As we can see, this is not the case - not even close. Even if we account for 10% still incorrect logs - GA is missing roughly 70% of traffic to this blog.

I have removed GA as it’s both useless and a privacy concern for anyone not using a blocker. The results provided by GoAccess are more than enough, even if they contain a few false positives.


Bonus item!: using server logs means tracking (rough) RSS readership is possible!

# GET requests, 200 responses, no static/invalid files
grep "/index.xml" access-feb-jun.log \
    > access-feb-jun-rss.log

goaccess \
    --log-format=COMBINED \
    -o report.csv access-feb-jun-rss.log

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK