Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.
What�s in the data set:
Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
The seed list for this crawl was a list of Alexa�s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.
However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.
We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available �warts and all� for people to experiment with. We have also done some further analysis of the content.
If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you�re hoping to do with it. We may not be able to say �yes� to all requests, since we�re just figuring out whether this is a good idea, but everyone will be considered.
January 18th, 2005 at 4:00 pm
Where will I be tomorrow evening?
Finally, a new post on H-Town detailing one of those “gathering” things. When: Wedneesday, January 19th Time: 7:00 PM�ish Where: Onion Creek Coffeehouse, 3106 White Oak Dr, The Heights. Seeing as how a certain local Artist will show up that…
January 18th, 2005 at 7:55 pm
I’d like to go!
How does one identify other bloggers?
January 19th, 2005 at 11:48 am
I’ll be there, but a litte later on. Have a basketball game at 6:30PM.
January 19th, 2005 at 2:04 pm
H-Town Blogs Happy Hour
H-Town Blogs is having a happy hour tonight around 7:00 for anyone that would like to come. It’s taking place at Onion Creek Coffeehouse, 3106 White Oak Drive, in The Heights. Hope to see you there!…
January 19th, 2005 at 4:16 pm
great to see a gathering of bloggers, bad to see it supporting such a community eye sore as the Onion Creek Cafe… the ba$tard of an owner pissed off a lot of the surrounding home owners when he originally said it was to be a breakfast place and coffee house then went and got a liquor license to change it over to a bar. Originally it became an issue of parking, between that place and Fitzgerald’s down the way, the neighborhood started having plenty of tow trucks handy. Granted he has bought the property at White Oak and Oxford to turn into parking spaces, but still this guy has yet to redeem himself in my eyes
January 20th, 2005 at 9:55 am
Aw, man, I missed it! I’d love to get “registered” or whatever and get on the mailing list so I can make it to these things. I’ve tried a couple of times, but I never seem to get my blog listed. Not sure what I’m doing wrong, but I’d love to meet y’all. Maybe next time.
January 20th, 2005 at 11:15 am
It was good see everyone last. Lot of fun not to mention some really good steak. Who would have guessed good steak at the Onion Creek.
January 20th, 2005 at 11:56 am
How did this go?
January 20th, 2005 at 12:55 pm
Note to self: Never buy steaks from gypsy grillers.
The steak was not good. Mostly fat.
Next one at Kennealy’s, eh.
January 20th, 2005 at 5:57 pm
Somewhere there’s wireless, pls.
January 20th, 2005 at 5:59 pm
okay, okay, maybe “good” was a little to much. It was cheap steak, how about that?
January 23rd, 2005 at 1:03 am
I would really like to attend one of these when IM back in Houston.
Hope everyone has a good time.
LHM
An American Expat in Southeast Asia
January 25th, 2005 at 3:37 am
Darn it! I can’t believe that I missed this! I stopped coming by to check the site because it had not been updated in a while. What makes matters worse is that I had a day off on last Wednesday. Perhaps I shall catch the next one.
Danielle
January 28th, 2005 at 1:01 am
We ought to do this monthly (Meetup.com?) and yes, a hot spot is a must.
February 28th, 2005 at 7:04 pm
i have an idea…what about the davenport? on richmond and Sheperd? that place is cool enough
August 16th, 2005 at 8:02 am
fight-fungal
Yet another happy hour!