Enough with the broken "Java/x.y.z_nn" crawlers

I watch my web logs a lot. It's a good way to get inspired by seemingly random events. Seeing some bit of insanity arrive from the outside world can lead to a concept for a post. This is one such post.

For quite some time now, I've been seeing these boneheaded attempts at crawling sites which send a truly generic user-agent string like "Java/1.6.0_33" or similar. The version changes, but the "Java/" part and the stupidity remain the same. They are best visible in how they follow things which aren't even links and get seriously confused by things like JavaScript.

Allow me to demonstrate.

xx.xx.xx.xx - - [12/Aug/2012:12:54:31 -0700] "GET /contact/ HTTP/1.1" 200 1575 "-" "Java/1.7.0_05" "rachelbythebay.com" "-"

Here, this robot asks for my contact page. Okay, big deal, that happens all day long. However, what happens next is a little wacky.

xx.xx.xx.xx - - [12/Aug/2012:12:54:34 -0700] "GET /contact/jquery-1.7.1/jquery.min.js HTTP/1.1" 200 93868 "-" "Java/1.7.0_05" "rachelbythebay.com" "-"

Okay, so now it's gone and requested something which only occurs in a SCRIPT tag in the HEAD part of that page. This is usually the sort of thing a real web browser would do. However, unlike a browser, this thing never pulls my CSS, despite having encountered those declarations earlier in the file.

What comes next, however, reveals a kind of cluelessness I seldom see outside of these idiots:

xx.xx.xx.xx - - [12/Aug/2012:12:54:35 -0700] "GET /contact/jquery-1.7.1/,data:c,complete:function(a,b,c){c=a.responseText,a.isResolved()&&(a.done(function(a){c=a}),i.html(g?f( HTTP/1.1" 404 411 "-" "Java/1.7.0_05" "rachelbythebay.com" "-"

xx.xx.xx.xx - - [12/Aug/2012:12:54:35 -0700] "GET /contact/jquery-1.7.1/]},bh=U(c);bg.optgroup=bg.option,bg.tbody=bg.tfoot= bg.colgroup=bg.caption=bg.thead,bg.th=bg.td,f.support.htmlSerialize||(bg. _default=[1, HTTP/1.1" 404 439 "-" "Java/1.7.0_05" "rachelbythebay.com" "-"

On two nearly-simultaneous hits, this robot manages to prove just how much mind-boggling stupidity it can wield. It clearly snarfed a JavaScript file by parsing a SCRIPT tag, but then it somehow turned raw minified JavaScript gunk into URLs and tried to GET them?

What planet are these programmers on? Who is that broken in the head?

It's almost to the point where I'm thinking about blocking any UA which matches "^Java/" just to catch anyone who thinks they can build a crawler just by gluing some Java examples together. If they can't even get as far as setting a halfway interesting User-Agent, what hope is there for them parsing and crawling things properly?

Randomly, at least one of these machines has port 139 open to the world and proudly proclaims that it is "Windows (R) Web Server 2008 6001 Service Pack 1", whatever that is. Maybe they're all just owned.

Enough with the broken "Java/x.y.z_nn" crawlers

Enough with the broken "Java/x.y.z_nn" crawlers

Recommend

Learn from my MySQL woes (and discover a possible attack)

When a honk is not a honk

Short tales of running mail servers for other people

BBS frontends and how they used to work

The bad old days of messy DOS TSRs and expensive dialups

More about car horns and interfaces

I bet Northpoint wished it could be this simple

Bumming instructions and optimizing circuits

What is this team actually supposed to be doing?

Password resets, airgapped devices, and data over video

About Joyk