Read my New Site
Read my new tech stock site, TechStockJungle.com.
Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.
What’s in the data set:
Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.
However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.
We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with. We have also done some further analysis of the content.
If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.

One Man's View of the Stock Market
Read my new tech stock site, TechStockJungle.com.
Posted by
Michael Comeau
at
7:37 PM
0
comments
Need a new camera? Check out my Panasonic LX5 review.
Posted by
Michael Comeau
at
1:36 PM
0
comments
Hello,
Just wanted to throw this out there - I started a new photo blog about the cheap Canon 50mm lens.
Posted by
Michael Comeau
at
7:41 PM
0
comments
Read my review of BJ Penn's new book Why I Fight on MMABookworm.com.
Posted by
Michael Comeau
at
9:12 PM
1 comments
Yes folks, I started another blog - ApertureLand, which is all about Apple's Aperture photo editing software.
Posted by
Michael Comeau
at
11:24 AM
0
comments
Okay, this is a long shot, but if anyone out there is:
1) Reading this
and
2) Living in Scarsdale
then please visit Scarsdale Patch.
Posted by
Michael Comeau
at
4:02 PM
0
comments
If anyone's still reading this, please click over to GadgetStocks.com. It's my new blog where I discuss all the latest, hottest news affecting technology investors.
Posted by
Michael Comeau
at
11:56 AM
0
comments
As you may know, BrokerSense.com has been my first foray into the Wordpress world. So far I haven't been disappointed.
Using Wordpress.org has been a lot easier than expected, especially since my host, BlueHost, makes everything pretty easy with one-click installation and all that. I'm also using the Thesis theme which automates a lot of things for me. Obviously, the design is still pretty bare bones, but for now I'm focused on content generation.
Traffic has picked up pretty quickly, and I'm ranking very high for some good keyword combinations. Overally, I'm very happy, and I imagine that I'll eventually move all of my blogs over to Wordpress.
Here are some links to some of my recent work:
Review of Financial Warnings
Thinkorswim vs. Interactive Brokers
OptionsXpress and Motley Fool Launch Options Service
Posted by
Michael Comeau
at
7:08 PM
26
comments
Labels: blogger, brokersense, wordpress
I just added a new page to Brokersense:
The Best Mouse for Trading Stocks
Enjoy!
Posted by
Michael Comeau
at
12:58 PM
0
comments
