| May | JUN | Jul |
| 24 | ||
| 2010 | 2011 | 2012 |
Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.
What’s in the data set:
Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.
However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.
We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with. We have also done some further analysis of the content.
If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.

Thank you for visiting our web site. This privacy policy tells you how we use personal information collected at this site. By using the site, you are accepting the practices described in this privacy policy. These practices may be changed, but any changes will be posted and changes will only apply to activities and information on a going forward, not retroactive basis. You are encouraged to review the privacy policy often to make sure that you understand how any personal information you provide will be used.
Note: the privacy practices set forth in this privacy policy are for this reason.com, reason.org, and reason.tv only.
Collection of
Information
We collect personally
identifiable information, like names, postal addresses, email
addresses, etc., when voluntarily submitted by our visitors. The
information you provide is used to fulfill your specific request.
Information submitted when donating to Reason Foundation or
subscribing to Reason magazine may be traded with other like-minded
organizations that we feel may be of interest to you.
We use third-party advertising companies to serve advertisements when you visit our website. These companies may use aggregated (not personal identifying) information about your visits to this website in order to provide advertisements about goods and services that may be of interest to you.
Credit Card Information
Credit Card numbers are not stored by us. They are submitted
directly to a secure third party vendor for processing.
Cookie/Tracking
Technology
The Site may use cookie
and tracking technology depending on the features offered. Cookie
and tracking technology are useful for gathering information such
as browser type and operating system, tracking the number of
visitors to the Site, and understanding how visitors use the Site.
Cookies can also help customize the Site for visitors. Personal
information cannot be collected via cookies and other tracking
technology, however, if you previously provided personally
identifiable information, cookies may be tied to such information.
Aggregate cookie and tracking information may be shared with third
parties.
Commitment to Data Security
Your personally identifiable information is kept secure. Only
authorized employees, agents and contractors (who have agreed to
keep information secure and confidential) have access to this
information. All emails and newsletters from this site allow you to
opt out of further mailings.
Privacy Contact Information
If you have any questions, concerns, or comments about our privacy
policy you may contact us using the information below:
By e-mail: feedback@reason.org
By Phone: (310) 391-2245
We reserve the right to make changes to this policy. Any changes to this policy will be posted.

Site comments/questions:
Media Inquiries and Reprint Permissions:
(310) 367-6109
Editorial & Production Offices:
3415 S. Sepulveda Blvd.
Suite 400
Los Angeles, CA 90034
(310) 391-2245