4.7m sites using Bootstrap vs 334k on Foundation

dev-bootstrap3

A  sample taken today by technology market share analysis company meanpath showed Bootstrap (a popular front-end Framework released in 2011) is being used by 4.7 million websites, or 1.79% of all sites. Competing framework Foundation was found on only 334k, or 0.13% of all websites.

For Bootstrap this is a massive 79% increase in just over 7 months from our last analysis in July 2013 where we found only 1% of sites using it. At this rate of growth Bootstrap will hit 2% of all sites just in time for their third birthday in August 2014.

You can view a sample of the Bootstrap results and even search within all the Bootstrap sites in our index by replacing cats OR dogs with your own search terms. The same can be done with the full results for Foundation or you can search within them.

Some notes on our methodology for creating these numbers:

  1. We sampled 120,114,791 million registered domains by analysing the source code on the front page.
  2. We found Bootstrap being used on 2,145,409 and Foundation on 151,701 which is 1.79% and 0.13% respectively.
  3. From this sample we extrapolated the total numbers for Bootstrap of 4.7m and 334k for Foundation by using the 265 million worldwide registered domain estimation shown in the Verisign Domain Name Industry Brief.

Do your own searches on meanpath.

Follow us on Twitter.

* Image courtesy of Zing Design.

meanpath Jan 2014 Torrent – 1.6TB of crawl data from 115m websites

Common Crawl Logo

 

tl;dr you can download the 376 GB compressed (1.6 TB uncompressed) crawl with this from archive.meanpath.com

October 2012 was the official kick off date for development of meanpath – our source code search engine. Our goal was to crawl as much of the web as we could using mostly open source software and a decent (although not Google level) financial investment. Outside of many substantial technical challenges, we also needed to acquire a sizeable list of seed domains as the starting block for our crawler. Enter Common Crawl which is an open crawl of the web that can be accessed and analysed by everyone. Of specific interest to us was the Common Crawl URL Index which we combined with raw domain zone files and domains from the Internet Census 2012 to create our master domain list.

We are firm supporters of open access to information which is why we have chosen to release a free crawl of over 115 million sites. This index contains only the front page HTML, robots.txt, favicons, and server headers of every crawlable .com, .net, .org, .biz, .info, .us, .mobi, and .xxx that were in the 2nd of January 2014 zone file. It does not execute or follow JavaScript or CSS so is not 100% equivalent to what you see when you click on view source in your browser. The crawl itself started at 2:00am UTC 4th of January 2014 and finished the same day.

Get Started:
You can access the meanpath January 2014 Front Page Index in two ways:

  1. archive.meanpath.com – you can download the raw crawl files form here.
  2. Web front end - If you are not interested in grappling with the raw crawl files you can use our web front end to do some sample searches.

Data Set Statistics:

  1. 149,369,860 seed domains. We started our crawl with a full zone file list of all domains in the .com (112,117,307), .net (15,226,877), .org (10,396,351), .info (5,884,505), .us (1,804,653), .biz (2,630,676), .mobi (1,197,682) and .xxx (111,809) top level domains (TLD) for a total of 149,369,860 domains. We have a much larger set of domains that cover all TLDs but very few allow you to download a zone file from the registrar so we cannot guarantee 100% coverage. For statistical purposes having a defined 100% starting point is necessary.
  2. 115,642,924 successfully crawled domains. Of the 149,369,860 domains only 115,642,924 were able to be crawled which is a coverage rate of 77.42%
  3. 476 minutes of crawling. It took us a total of 476 minutes to complete the crawl which was done in 5 passes. If a domain could not be crawled in the first pass we tried 4 more passes before giving up (those excluded by robots.txt are not retried). The most common reason domains are not able to be crawled is a lack of any valid A record for domain.com or www.domain.com
  4. 1,500GB of uncompressed data. This has been compressed down to 352.40gb using gzip for ease of download.

File Format:

We have chosen to release this crawl in 2,330 varying size SQLite files compressed with GZip. These can be easily viewed and dumped to CSV using SQLite Browser Each file is named botXX.meanpath.com-crawl20140104-fileYY.db.gz with XX being the bot number from 1-50 and YY being a sequential file number. Please note that bot46 is missing from this export as it was not part of the crawl cluster. Also the file number is not in any particular order.

Infrastructure Statistics:

  • Crawl cluster – 49 x crawlers with 1 x Intel i5 – 3570S 3.1 GHz CPU, 16GB RAM, 2 x 2TB SATA2 in RAID1, and 100mbps bandwidth. These run our custom built Haskell crawler and a heavily customised Unbound DNS resolver. Our robots.txt parser has been released as open source.
  • Search cluster – 11 x index servers with 2 x Intel E5630 2.53 GHz CPU’s, 256GB RAM, 2 x 600GB SAS in RAID0, and 750mbps bandwidth. These run a fairly vanilla install of Elasticsearch.
  • Other bits – Connecting the crawl and search clusters is a lot of smaller Haskell, Ruby, Python, and Go pieces that make it all function automatically.
  • Hosting – All our infrastructure is located in the OVH Montreal data centre, a very reasonably priced dedicated server provider we have been extremely happy with.

Project Showcase:
We will be creating a showcase of interesting projects based on the data set we have released so email us on hello@meanpath.com with a short description of what you have managed to achieve and we will include it in the showcase.

Frequently Asked Questions:

  1. Can you give me a list of all the domains in your index? Unfortunately we are restricted from providing raw access to the zone files as part of the zone file access agreements we had to sign. If you want access to the zone files yourself you will need to sign an agreement with each registrar individually.
  2. How do I remove my site/s from your index? We cannot remove sites from the existing index. To be removed from any future index follow the instructions at https://meanpath.com/meanpathbot.html Our bots fully respect any site instructions issued in the robots.txt
  3. Why don’t your use AWS or other cloud providers? We did consider using cloud infrastructure but found the price/performance to be substantially more expensive than dedicated servers from OVH. Our usage profile is 100% CPU, 100% bandwidth, max memory for 20 hours per day which is not a great fit for shared infrastructure.
  4. Why can’t you crawl faster than 10,000 sites per second? We can crawl much faster than this but the level of DNS resolution traffic it would generate would be considered excessive by operators of DNS servers that host many domains. To be polite and make sure our crawling does not impact on other systems we throttle our crawl speed to 10,000 sites per second across our whole crawl cluster (200 per bot per second). As a rule more than a few hundred queries per second to an individual DNS server will be considered too much by some DNS systems administrators.
  5. How fast could you crawl? Theoretically with our current 50 bot cluster and no limits on our DNS resolution activity we could crawl at 30,000 sites per second. Our performance would be bound by the 100mbps network connections these servers are on at that point. Larger connections could sustain a much faster crawl speed but at that point our crawl would look like a DDOS on the external DNS servers we need to get records from.
  6. How many docs per second can your Elastiscsearch cluster do? In the current configuration it averages between 8,000 and 10,000 docs per second with 11 shards and no replicas. It takes 5 hours to index a crawl of 115,642,924 domains/docs. We add replicas once the indexing is complete so we can get access to the complete data for reports whilst the replicas are being created.

Terms of Use:
Please note that use of the this site, our service and/or data constitutes your binding acceptance of the full Terms of Use and any not just of the summary. You can find the full, legally binding document at https://meanpath.com/freeterms.html

Still Got Questions?
Contact us on support@meanpath.com or discuss on Hacker News.

Churn Mitigation vs Win Back

tl;dr Find customers testing a competing service and reach out before they churn. As an example this search shows 1,136 sites using Mixpanel and KISSmetrics simultaneously. Or if you compete against both these companies approach them yourself.

Customer churn will quickly kill a startup which is why high churn is every SaaS startups worst nightmare. Churn is a key metric VC’s use to measure the health of your business. Some startups will spend over $400.00 USD to acquire a customer they must keep for a minimum of 12 months just to repay the acquisition cost. If the customer churns before the acquisition cost has been repaid this has a very detrimental effect on cash flow.

When a high value customer starts testing out a competing service and decides to churn you normally only find out after the decision to leave has been made. Your only option now is to implement a win back strategy. If you have rapid growth you can generally paper over your massive churn rate but savvy investors will uncover it and with some simple analysis work out your growth will stall and churn will eat your business.
cheating2

What you do not realise is that your customers give off clear signals that they are considering churning. If, like Mixpanel and KISSmetrics, your service is one that customers can A/B test on their site you can use a simple search on meanpath to show all your customers who are also using a competitor’s service. Reaching out with a personal email at this time offering a free consultation should be enough to find out what their reason for considering churning was and hopefully it is something you can solve.

Either of these companies could use this data to build a churn mitigation strategy. Or if you are a competitor to both of these companies you can use this data to identify dissatisfied customers and approach them about your service as an alternative. It is impossible to eliminate churn completely and a certain amount of churn can even be a positive thing. Your customers may even have a legitimate reason for using two competing products simultaneously.

Sitting on your hands whilst your competitor’s chip away at your customer base is a sure fire way of ending up in the startup scrapheap. Look for churn signals such as decreasing engagement with your service or A/B testing of competing services and reach out before the decision to churn has been made.

Discuss this post on Hacker News.
Have a play with the beta meanpath index.
Follow us on Twitter.

DigitalOcean Now Hosting Over 64,000 Web Sites

From a standing start in late 2012 DigitalOcean is now hosting over 64,000 individual web sites across its four data centres. By creating a winning combination of rock-bottom prices, enterprise level support, savvy marketing, and SSD based hardware they have snatched a substantial portion of the hosting market. The meteoric rise of DigitalOcean is unprecedented for a new entrant to the very crowded and ultra competitive hosting market.

The rapid migration of websites to DigitalOcean has at times outpaced their ability to deploy new servers and even caused them to run out of IP addresses in Amsterdam. With a recent funding round of $3.2M led by IA Ventures DigitalOcean should now have enough cash in the bank to scale well ahead of demand.
DigitalOcean Logo

To create these statistics meanpath selected a random sample of 100 million domains and flagged those that were hosted on IP addresses owned by DigitalOcean.
You can view the raw data here:
New York Region (two locations) – 22,114 domains
San Francisco Region36,344 domains
Amsterdam Region5,726 domains
Total of 64,184 domain names as of 1st of July 2013.

Discuss this post on Hacker News.
Have a play with the beta meanpath index.
Follow us on Twitter.

Twitter Bootstrap Now Powering 1% of The Web

Love it or hate it but Twitter Bootstrap is quickly taking over the web. The team at meanpath recently pulled a random selection of 100 million websites from our source code search engine and found clear Twitter Bootstrap signatures on 981,608 of them.

Image courtesy of Pink Cake Box
What started as an internal project championed by Twitter developers Mark Otto (now a designer at Github) and Jacob Thornton (who left to work at Obvious Corporation) has enabled millions of website owners to quickly launch a clean and responsive website without having to reinvent the wheel. With minimal front-end development skills we were still able to deploy a landing page for our site in under a day using Bootstrap and a $10.00 theme.

Version 3 of Bootstrap is currently at RC1 stage. Bootstrap is still the most popular project on Github with 53,916 Stargazers and 18,119 forks.

August marks the second birthday for Twitter Bootstrap so the team at meanpath would like to extend an early happy birthday!

Discuss this post on Hacker News.
Have a play with the beta meanpath index.
Follow us on Twitter.

Image courtesy of Pink Cake Box.