HomeAboutArchivesMy FirmSubscribe to my FeedContactLinked InLinked In

Human-like web crawling & SEO

Filed under: Improving Work, Optimization

Jan
13
2006

At the Shmoo Con hacker conference, an engineer at Atlanta company SPI Dynamics called Billy Hoffman says he has a web crawler that acts like you or me. It hits links, javascript, flash, and other page elements like a human does, slowly, with indeterminate pauses. Not only that, Hoffman’s crawler acts like a browser by keeping a cache - downloading only what’s changed as it moves from page to page, killing another tale-tell crawler characteristic. Web crawlers typically ignore javascript, images, and some Flash animations to save power. It’s faster to grab a web page text and index that than to try to grab everything. This makes spiders easy to detect… and to optimize for. Analytics software can treat that traffic differently than human traffic, and our reports can divide it out.

You see, part of the analysis and optimization of web pages in current techniques depends on the spiders not looking at javascript and images. We’ve become used to looking at the site from an efficient, fast crawler’s perspective, and leveraging these elements in SEO work. But a human simulation crawler would throw a wrench into this part of SEO.

It would also make it hard to block bots that were sucking up bandwidth. If you can’t tell what the bot is about, you’ll have to assume it’s human.

I think I’ll write a crawler, too. Mine will be sixpackbot. It will stumble through websites, sometimes three and four times, in circles. It’ll stop for a nap occasionally and then start up again, stopping to hit the john a few times. I’ll ignore robots.txt files, though, cause when sixpackbot is drunk, he goes wherever he damn well pleases.

Happy Friday….

Posted by Scott Clark @ 10:30 pm  


Mixx This Story

del.icio.us Digg it ma.gnolia Netscape reddit StumbleUpon Yahoo MyWeb

Leave a Reply



Original Design by Swank Revised Header Designed by Scott Clark| Powered by Wordpress 2.6.1

| Scott Clark