Archive for January 13th, 2006
Human-like web crawling & SEO
Filed under: Improving Work, Optimization
13
2006
At the Shmoo Con hacker conference, an engineer at Atlanta company SPI Dynamics called Billy Hoffman says he has a web crawler that acts like you or me. It hits links, javascript, flash, and other page elements like a human does, slowly, with indeterminate pauses. Not only that, Hoffman’s crawler acts like a browser by keeping a cache - downloading only what’s changed as it moves from page to page, killing another tale-tell crawler characteristic. Web crawlers typically ignore javascript, images, and some Flash animations to save power. It’s faster to grab a web page text and index that than to try to grab everything. This makes spiders easy to detect… and to optimize for. Analytics software can treat that traffic differently than human traffic, and our reports can divide it out.
You see, part of the analysis and optimization of web pages in current techniques depends on the spiders not looking at javascript and images. We’ve become used to looking at the site from an efficient, fast crawler’s perspective, and leveraging these elements in SEO work. But a human simulation crawler would throw a wrench into this part of SEO.
It would also make it hard to block bots that were sucking up bandwidth. If you can’t tell what the bot is about, you’ll have to assume it’s human.
I think I’ll write a crawler, too. Mine will be sixpackbot. It will stumble through websites, sometimes three and four times, in circles. It’ll stop for a nap occasionally and then start up again, stopping to hit the john a few times. I’ll ignore robots.txt files, though, cause when sixpackbot is drunk, he goes wherever he damn well pleases.
Happy Friday….

>