When the order link for the Butterfly Marketing 2.0 launch was emailed to Mike Filsaime’s “Early Bird” list yesterday, the BFM servers crashed pretty spectacularly. Ustream chatrooms were packed with unhappy punters jabbing the refresh button like woodpeckers - fearful they might miss the opportunity to be one of the first 5000 to grab what was undeniably a bona fide bargain.
Let’s dispel one myth straight away. The crash was NOT part of the BFM2.0 launch strategy. The problems and errors I saw were absolutely consistent with a database server that was nowhere near able to cope with the number of connections it was hit with.
These included:
- Unable to connect to database server.
- Error 145 - a corrupted MYISAM table file. Repairable with the command myisamchk -r *.MYI but you really need to take the DB server down first.
- The State/Province select box on the order page was populated dynamically with an AJAX request which consistently failed, making the form un-submittable.
- The AJAX routines to check the availability of BFM affiliate usernames and Instant Affiliate Website usernames failed, again making the form un-submittable.
What’s interesting is that a large amount of proprietary information was exposed by the servers - server file paths, table schema, database usernames (not passwords thankfully,) and PHP include filenames. All this information could potentially be used by a hacker to make their nasty task easier.
A cynic might still believe that the BFM server capacity was purposely kept low in the full knowledge that the technology would fail, in order to generate massive natural Social Proof (on ustream and twitter) for the offer.
No. They’d have to concede that in this case, the “juicy” error messages would have been supressed first - a trivial programming task.
Most professional web developers have been here at some point. I’ve brought down lovefilm.com, jolt.co.uk, and a few other big sites in my time. When you do that you learn pretty quickly how to fire-fight…
With website scaleability though, prevention is definitely better than cure!
So what can the next big launch do to get this right?
There are a number of links in the chain of serving a dynamic database-enabled web page to a user.
Any can become the bottle-neck that kills the site. Let’s go through them:
Hosting
You need serious hosting to handle serious traffic spikes. Shared hosting is out. Dedicated hosting is better, but you’re probably looking at one or more quad processor boxes at least. Those are the expensive ones! Nowadays services like Slicehost, Mosso and Amazon EC2 allow you to scale your hosting dynamically, which is perfect for launches, and far more economical than having a pile of serious hardware barely ticking over for 360 days a year.
Wide Area Network
Make sure the line coming into your hosting is FAT. This all depends on how much traffic you’re expecting in the spike, but talk to your host about upgrading your “pipe” to unmetered Gigabit - or at least 100 Megabit for the launch.
Also, does your host have multiple tier 1 connections? If Verizon dies for a while will your launch be dead in the water? Most hosts have at least a backup line with Cogent; better than nothing and cheap. Ideally though you want at least 2 of: AT&T, NTT Verio, Level 3, Global Crossing, Qwest, Sprint and Verizon.
Local Area Network
If you have more than one server, put them on a private LAN using Gigabit network links. Otherwise the slow links between your servers could be the bottleneck. Also, communications on your local (192.168.*.*) network probably don’t need to explain themselves to a firewall - it just slows everything down.
Server Configuration
On a dedicated platform, it goes without saying that no unnecessary software that uses CPU cycles or filesystem inodes should be running on D-Day.
Make sure there’s enough disk space for quickly-expanding log files - sticking /var on a separate (fast!) disk can help.
Are excessive file reads causing a problem? Ouch. Probably best fixed with a PHP bytecode optimiser or MySQL query optimisation (see below,) though fast RAID 10 drives will help overall disk performance. NB the cloud hosting services like Mosso all use RAID 10 storage as a default.
Apache Web Server Optimization
If your web server won’t step up to the plate with big traffic, you won’t even get as far as showing users an error. Your site will simply look offline, and users will get their browser’s standard “404″ page! Google optimize httpd.conf
Use the Apache benchmarking tool or similar to test your server before the real traffic arrives. (Google apache ab)
Discuss the possibility of Apache clustering with your host if you’re at or near that level.
MySQL Configuration Optimization
I’d love to have seen the my.cnf file on the BFM MySQL server, but my guess is optimising this alone might have saved the day. I’m sure its been edited a bit in the last 24 hours! Google optimize my.cnf
Using the Apache benchmarker on a database-heavy page should give you an idea of the concurrent DB connections and queries-per-second the server will need to service.
Switching on MySQL Query Caching is a no-brainer, and the QCache should be given as much space as possible! Turning off query logging is a must!
PHP Configuration Optimization
Editing the PHP.ini to load only the PHP modules required by the application will save disk reads and memory. Also disable error messages just in case!
Using a PHP bytecode encoder (often called a PHP Accelerator) will speed up PHP no end in most cases.
PHP Code Optimization
Optimizing PHP is really out of the scope of this article, but suffice it to say one single line of code can bring a site down (trust me, I’ve done it!)
If code will need to scale quickly, get more than one coder to look at it if possible. Look to reduce long loops, evaluation of unset variables, any access to external services and other potential bottle-necks. Usually though, the biggest bang-for-buck in terms of performance gain can be gotten from optimizing the database queries:
MySQL Query Optimization
Again, there are books on this. These are likely to be the quickest wins though:
Indices - Does every column that can be found after the word “WHERE” in a query have an index? Is querying against “text” fields avoided?
Joins - Will a well-written query employing 1 or more joins replace an entire loop of code where essentially the same query is sent dozens or hundreds of times?
Query Caching - Which queries use ever-changing functions like NOW()? These will never be cached. Does the application require the use of NOW() or will today’s date - generated in PHP - work just as well?
Consider implementing Memcached too.
Front End Architecture
I possibly made that term up, but there are a few things that would have helped the BFM servers:
- Stick ALL the images AND all the javascript AND all the movies - in fact everything that isn’t a PHP file, on Amazon S3
- Consider load-management as part of overall usability. For example:
- Did the state dropdown need to use AJAX or could it just have been a blank text field?
- Does the user need to pick affiliate etc usernames at order-time, or could that be done later once they’re safely converted?
- Did the OTO need to access the user table, or would it have made sense to “downgrade” the OTO’s “security” to just a cookie for a while? Doing that would probably have enabled every buyer to see the OTO at the correct time.
Scaling a LAMP web application requires planning, investments in both time and cash, and a rather “holistic” approach to all the technical building blocks that go into making it work.
Someone (usually its the CTO,) needs to take ownership of the “bullet-proofing” process rather than everyone assuming people will have “done their bit” before D-Day.
I hope the points above help you plan your next launch successfully! If you need any help, please get in touch with me.
EDIT
Please see the comment below from Rick at MikeFilsaime.com Inc.
One point Rick makes which I missed out from the article above is DNS load balancing. When a browser makes a request of a web server, it must ask a Domain Name Server where it can find the IP address for that site, a bit like looking up a phone number in a telephone directory. To save the internet slowing down to a crawl, DNS servers have a “time to live” (TTL) setting, which tells the browser that the IP address for the domain will remain unchannged for the next n seconds. This is usually set to 24 hours or even more.
In cases of possible high load, it makes sense to set the TTL on the DNS server to a very small number, perhaps as low as 5 seconds. This way you can quickly route web traffic through a DNS load balancer (or just to a beefier server) and you won’t have to wait overnight for the changes to take effect.
As a sidenote, Google do most of their traffic management (as far as I can tell) by using very low TTLs for their domains - that way a dynamic DNS server can route traffic to the IP address most able to cope with it.
Thanks for commenting Rick; I can only imagine how busy you guys have been the last few weeks!
Tags: bfm · butterfly marketing · mike filsaime · mysql · php4 Comments

Excellent info, Alex.
If marketers read and implement these tricks, we would see less of the “whoops, my server crashed” emails. Despite what people think about it being done on purpose, there’s no way that’s true… downtime during launch costs thousands of dollars in lost revenue for top marketers.
Neil.
[...] Alex Poole over at his blog has made an excellent post about how to keep a server up during a product launch. [...]
Hi Alex,
Very good article you point out a great deal, there where alot of
Should have done’s but the most Important thing was the actions that was taken at time of crises for example 1 line of code I was
Able to remove droped server load by 50%, we also dropped usless processes that should have been taken care of by our host
but we did. At one point the load was at 1000.58 avg before code drop was able to get down to 400 we quicky optimized the cnf and httpd but it was a combination of factors that was killing us
Anyhow after some quick action and DNS load balanceing techniques we were able to get contol of the load.
But you are 100% right. Thank you for your input
Cheers
Rick Mataka
MikeFilsaime.com Inc.
Excellent article, now when you are looking to implement these things be sure you call a real professional. Marketers who work closely with their webmaster (or systems administration) support get all the money from their launch and have happier customers.
I know because I’ve been helping commercial firms with these types of issues for more than a decade. Imagine dropping a postcard campaign tied to a catalog of 1 million units, then your website goes down. It just doesn’t happen.
What ever your launch is, use this article as a checklist for what MUST be done before you say go. You’ll also want to add a quality assurance session where you run a few orders to make sure credit card processing is up to par.
Sincerely,
Justin