Update Feb 2014:
We seeded this data for a month but due to bandwidth costs had to disable free public access. The torrent does occasionally get seeded by other downloaders. If you are interested in this data please email email@example.com for access options.
October 2012 was the official kick off date for development of meanpath – our source code search engine. Our goal was to crawl as much of the web as we could using mostly open source software and a decent (although not Google level) financial investment. Outside of many substantial technical challenges, we also needed to acquire a sizeable list of seed domains as the starting block for our crawler. Enter Common Crawl which is an open crawl of the web that can be accessed and analysed by everyone. Of specific interest to us was the Common Crawl URL Index which we combined with raw domain zone files and domains from the Internet Census 2012 to create our master domain list.
You can access the meanpath January 2014 Front Page Index in two ways: archive.meanpath.com – you can download the raw crawl files form here. Web front end – If you are not interested in grappling with the raw crawl files you can use our web front end to do some sample searches.
Data Set Statistics:
- 149,369,860 seed domains. We started our crawl with a full zone file list of all domains in the .com (112,117,307), .net (15,226,877), .org (10,396,351), .info (5,884,505), .us (1,804,653), .biz (2,630,676), .mobi (1,197,682) and .xxx (111,809) top level domains (TLD) for a total of 149,369,860 domains. We have a much larger set of domains that cover all TLDs but very few allow you to download a zone file from the registrar so we cannot guarantee 100% coverage. For statistical purposes having a defined 100% starting point is necessary.
- 115,642,924 successfully crawled domains. Of the 149,369,860 domains only 115,642,924 were able to be crawled which is a coverage rate of 77.42%
- 476 minutes of crawling. It took us a total of 476 minutes to complete the crawl which was done in 5 passes. If a domain could not be crawled in the first pass we tried 4 more passes before giving up (those excluded by robots.txt are not retried). The most common reason domains are not able to be crawled is a lack of any valid A record for domain.com or www.domain.com
- 1,500GB of uncompressed data. This has been compressed down to 352.40gb using gzip for ease of download.
We have chosen to release this crawl in 2,330 varying size SQLite files compressed with GZip. These can be easily viewed and dumped to CSV using SQLite Browser Each file is named botXX.meanpath.com-crawl20140104-fileYY.db.gz with XX being the bot number from 1-50 and YY being a sequential file number. Please note that bot46 is missing from this export as it was not part of the crawl cluster. Also the file number is not in any particular order.
- Crawl cluster – 49 x crawlers with 1 x Intel i5 – 3570S 3.1 GHz CPU, 16GB RAM, 2 x 2TB SATA2 in RAID1, and 100mbps bandwidth. These run our custom built Haskell crawler and a heavily customised Unbound DNS resolver. Our robots.txt parser has been released as open source.
- Search cluster – 11 x index servers with 2 x Intel E5630 2.53 GHz CPU’s, 256GB RAM, 2 x 600GB SAS in RAID0, and 750mbps bandwidth. These run a fairly vanilla install of Elasticsearch.
- Other bits – Connecting the crawl and search clusters is a lot of smaller Haskell, Ruby, Python, and Go pieces that make it all function automatically.
- Hosting – All our infrastructure is located in the OVH Montreal data centre, a very reasonably priced dedicated server provider we have been extremely happy with.
We will be creating a showcase of interesting projects based on the data set we have released so email us on firstname.lastname@example.org with a short description of what you have managed to achieve and we will include it in the showcase.
Frequently Asked Questions:
- Can you give me a list of all the domains in your index? Unfortunately we are restricted from providing raw access to the zone files as part of the zone file access agreements we had to sign. If you want access to the zone files yourself you will need to sign an agreement with each registrar individually.
- How do I remove my site/s from your index? We cannot remove sites from the existing index. To be removed from any future index follow the instructions at https://meanpath.com/meanpathbot.html Our bots fully respect any site instructions issued in the robots.txt
- Why don’t your use AWS or other cloud providers? We did consider using cloud infrastructure but found the price/performance to be substantially more expensive than dedicated servers from OVH. Our usage profile is 100% CPU, 100% bandwidth, max memory for 20 hours per day which is not a great fit for shared infrastructure.
- Why can’t you crawl faster than 10,000 sites per second? We can crawl much faster than this but the level of DNS resolution traffic it would generate would be considered excessive by operators of DNS servers that host many domains. To be polite and make sure our crawling does not impact on other systems we throttle our crawl speed to 10,000 sites per second across our whole crawl cluster (200 per bot per second). As a rule more than a few hundred queries per second to an individual DNS server will be considered too much by some DNS systems administrators.
- How fast could you crawl? Theoretically with our current 50 bot cluster and no limits on our DNS resolution activity we could crawl at 30,000 sites per second. Our performance would be bound by the 100mbps network connections these servers are on at that point. Larger connections could sustain a much faster crawl speed but at that point our crawl would look like a DDOS on the external DNS servers we need to get records from.
- How many docs per second can your Elastiscsearch cluster do? In the current configuration it averages between 8,000 and 10,000 docs per second with 11 shards and no replicas. It takes 5 hours to index a crawl of 115,642,924 domains/docs. We add replicas once the indexing is complete so we can get access to the complete data for reports whilst the replicas are being created.