AdSense Bot Working Overtime
April 14, 2006
During last Tuesday’s Rockstar show, I mentioned that I had been working on a project that got a bit messed up due to the fact that Google’s Mediapartner bot (aka Mediabot) was being used to index content for Google’s database. We had setup some 301’s for Googlebot, but had neglected to redirect the Mediabot. The end result was a whole bunch of duplicate content due to the fact that we were serving Mediabot the old url, and Googlebot the new one. Both were getting indexed and added to the cache.
Matt was in the chat room when I made the original comments, and he said that he’d like to see some examples. So I thought I’d post one from this site.
The content of that post got indexed in a template that we only serve to AdSense. It has no navigation and no comments; just the actual post. We built this template to experiment with getting better ads to display. The idea being that it might be possible to get ads other than blog related products to show if we removed all the content that wasn’t part of the actual post.
The interesting thing to note about this page is that the post was originally made in January. And for quite sometime it had a cached page that was a representation of what Googlebot was given. But then Mediabot visited on April 7th. And the page it was served on that date ended up replacing the Googlebot version in the cache.
Comments
7 Responses to “AdSense Bot Working Overtime”
Before You Reply, Please Read My Comment Policy
Does the ad bot also fetch and respect robots.txt?
Yes it does, Jeremy. The problem is that you have to let the Adsense bot fetch a page if you want paid ads to display.
Greg, you’ve posted an example where the Adsense bot is putting pages into the cache. Do you have an example of that version actually being indexed? You know, like where you’ve maybe added the words “smoking gun” to the page you serve to the Adsense bot?
If you mean do I have a screen cap of that page when it was originally indexed by Googlebot, no I don’t. But I can tell you that it was originally indexed properly. We serve the pages from this site to Google in a stripped down “lite” template that is quite different than the template for the media bot.
The Googlebot template retains all the page titles and site navigation. The Mediabot template doesn’t. it just presents the content of the actual post, so it’s easy to determine which bot was indexed the page in the cache.
Here’s an example of what a post looks like when it’s serverd to Googlebot
Greg,
My question isn’t about what is being cached, it’s about what is being indexed. Caching and indexing are different things. The cached page isn’t necessarily what is being used to return search results.
Caching = storing an HTML page so that users can see how it looked
Indexing = dissecting the page and storing the word occurences in the search engine’s index
You have shown how the Adsense bot is updating the cached version of a web page. What I am wondering is whether that’s what Google is indexing.
You could determine whether or not this is happening by adding a unique word to the page that you deliver to the Adsense bot, then searching for that word to see if it appears on the page that Google has indexed.
So if you added the words “smoking gun” to this post when you deliver it to the Adsense bot (but not the regular Googlebot), you could search Google for:
inurl:overtime site:google.webguerilla.com smoking gun
My guess is that the Adsense bot isn’t grabbing pages to be indexed, just updating the cached version.
The search results are reflecting the content of the cached page. It would certainly be a bit easier to tell if the mediabot template had words on it that were not included in the googlebot template, (which isn’t the case) but you can still see that what is indexed and what is cached is the same because the googlebot does have unique words on it.
Example:
Search for godaddy sucks.
The #6 listing is a post that got hit by the mediabot. Notice that the title for the listing is being generated by the first words on the page. That’s because the mediabot template doesn’t have any page titles or heading tags.
Now do a search for “why godaddy sucks” (Using quotes)
The exact phrase is now bolded in the url, but the excerpt doesn’t show any other occurences of the exact phrase, despite the fact that the phrase appears in both the title and the heading tag of the page that was served to googlebot.
Now search for “why godaddy sucks” + “theme developed by webguerrilla”
The GoDaddy post isn’t returned because “theme developed by webguerrilla” doesn’t exist on the mediabot template. It only exists on the googlebot template. (You can see it highlighted at the bottom of the cached pages that were returned as a match).
Based on that, I think it’s a bit of a stretch to argue that the cache isn’t a representation of what was indexed.
Yeah, it looks like I guessed wrong. Thanks for posting that clarification, Greg. What a really, really stupid idea Google has had here. I misunderestimated them.
I’m sure you’re quaking in your black boots about “Matt also stated that you will gain zero advantage in search listings however if you are serving different content to MediaBot then to Googlebot then you could be in trouble.” Yep, quaking and shivering.
Nice find WG,
So will the trend be adding Adsense blended into hiding for quicker indexing of content change? (not that I see anything explicitly wrong with mediabot helping out indexing).
A while ago:
http://www.webmasterworld.com/forum89/14-2-16.htm
Googleguy: