Archive for the ‘linux’ Category

LinkedIn and Hadoop

Interesting post about LinkedIn’s Data Infrastructure

Much of LinkedIn’s important data is offline – it moves fairly slowly. So they use daily batch processing withHadoop as an important part of their calculations. For example, they pre-compute data for their “People You May Know” product this way, scoring 120 billion relationships per day in a mapreduce pipeline of 82 Hadoop jobs that requires 16 TB of intermediate data. This job uses a statistical model to predict the probability of two people knowing each other. Interestingly they use bloom filters to speed up large joins, yielding a 10x performance improvement.

They have two engineers who work on this pipeline, and are able to test five new algorithms per week. To achieve this rate of change, they rely on A/B testing to compare new approach to old approaches, using a “fly by instruments” approach to optimize results. To achieve performance improvements, they also need to operate on large scale data – they rely on large scale cluster processing. To achieve that they moved from custom graph processing code to Hadoop mapreduce code – this required some thoughtful design since many graph algorithms don’t translate into mapreduce in a straightforward manner.

The workflow management systems are starting to become interesting as jobs with this many steps and intermediate data can become confusing quickly.

I have to admit that FlashCache for Linux looks pretty cool.

It basically lets you use a block device SSD as cache.

Another hack is to mount the SSD as swap and tell InnoDB to use say 100GB of memory. I haven’t tested this but it might be a fun hack :)

We’ve actually migrated away from using SSD in production – at least for now.

The performance just wasn’t THAT great in our configuration. I’m still semi optimistic but not as much as I was a year ago.

I think if I were to do it again I would drop RAID everywhere with SSD and make sure my database layer can correctly route queries to the right MySQL instance.

Each SSD would need to be on its own database server with its own replication thread.

We’re also ditching use of caches and instead requiring that all of the applications memory reside in memory. It’s way cheaper on a per-IOP basis this way and just a lot easier to admin.

I spend the last couple days playing with InnoDB page compression on the latest Percona build.

I’m pretty happy so far with Percona and the latest InnoDB changes.

Compression wasn’t living up to my expectations though.

I think the biggest problem is that the compression can only use one core in replication and ALTER TABLE statements.

We have an 80GB database that was running on 96GB boxes filled with RAM.

I wanted to try to run this on much smaller instances (32GB-48GB boxes) by compressing the database.

Unfortunately, after 24 hours of running an ALTER TABLE which would only use one core per table, the SQL replication thread went to 100% and started falling behind fast.

I think what might be happening is that the InnoDB page buffer is full because it can’t write to the disk fast enough which causes the insert thread to force compression of the pages in the foreground.

Having InnoDB only use one core / thread to compress pages seems like a very bad idea (especially on 8-16 core boxes, I’m testing on an 8 core box now but we have 16 core boxes in production).

The InnoDB page compression documentation doesn’t seem to yield any hints about when InnoDB pages are compressed and in which thread. Nor does there seem to be any configuration variables that we can change in this regard.

Perhaps a ‘compressed buffer pool only’ option could be interesting.

This way InnoDB does not have to maintain an LRU for compressed/decompressed pages. Further, it can read pages off disk, decompress them, and then leave the pages decompressed in a small buffer. Then a worker thread (executing on another core) can compress the pages and move them back into the buffer pool where they can be stored and placed back on disk.

This process could still become disk bottlenecked but at least it would use multiple cores.

Spinn3r is growing fast. Time to hire another engineer. Actually, we’re hiring for like four people right now so I’ll probably be blogging more on this topic.

My older post on this subject still applies for requirements.

If you’re a Linux or MySQL geek we’d love to have your help.

Did I mention we just moved to an awesome office on 2nd and Howard in downtown SF?


About two weeks ago we completed a pretty big project to migrate Spinn3r’s operations from ServerBeach over to Softlayer.

The entire project, from start to finish, too just over one month.

I’m also proud to note that not a single customer noticed any downtime or any issue with our migration. It cost a bit more money and more time but we were able to perform the migration live and in place.


ServerBeach had been a great dedicated hosting provider for the last two years but we were clearly outgrowing them. Their sweet spot (at least from our perspective) seems to be from the lower end to mid end market. We’re now in the mid to high end dedicated hosting market and Softlayer seems to be the best choice here…

I think my main suggestion to ServerBeach is that they are going to have to come down in price. If they want to compete with Softlayer they’re going to have to do so on price as Softlayer clearly has them beat on feature set.

I did an exhaustive comparison of using cloud providers like Amazon/EC2 vs Colo vs a Dedicated Hosting provider.

My initial thinking was that we’d just throw down $50k-100k and go the colo route. The problem is that the numbers just didn’t add up. It was still far too expensive to go down this route. Internally, there was also a push to look at EC2 and Amazon but they just couldn’t compete on pricing either.

Further, Amazon’s pricing only looks decent if you go down the reserved route. This requires you to put down $30k or so and pay for half of your servers up front. That’s great and all but if you factor in that their servers are a bit underpowered, then you have to buy more of them and your operations costs rise.

I decided to take another look at Softlayer as I evaluated them in mid-2008 and they came in a bit high on pricing. Fortunately, I was able to get an introduction to someone in their Sales department who could appreciate closing a 30-40 server purchase overnight.

Once we had a decent price quote in hand we took a dive into their hardware and pricing model.

I think if it wasn’t for Softlayer we would have had to go down the colo route. IMO they’re by far the best provider in the dedicated hosting space. Serverbeach, Rackspace, etc need to really get their act together if they don’t want Softlayer to eat their lunch.

Further, I have NO idea why these guys aren’t providing Debian. If you want the big Web 2.0 deals you’re going to have to ship Debian. Most of the good Operations Engineers I deal with flat out refuse to work on anything that’s not Debian.


The first thing we evaluated was their hardware. They’re using Supermicro 1/2U servers throughout their datacenter (in fact I think they only use Supermicro) which were the same machines we were thinking of purchasing if we went down the colo route.

These are great boxes and I know quite a few large scale shops that have clusters of these machines well into the hundreds of nodes.

Further, their configurations seem to be pretty solid. We can get boxes up to 96MB of RAM and 12TB of storage.

We’re currently using three configurations. API servers are single disk boxes with 4GB of RAM. InnoDB/memory boxes have 32GB of RAM. Our bulk storage and archive boxes have 8GB or RAM, a Adaptec RAID controller and 12TB of storage across 1TBx12 disks.

They also have the Intel X-25 SSDs in stock in which we’ve been interested as well.



Their network is pretty impressive as well. They have three datacenters (SEA, WDC, DAL) online now with a 10Gbit interconnect between them. Latency is realistic as well – not as fast as being on gigabit ethernet but certainly reasonable for running replication or serial IO applications over their backend.

Their in-datacenter network uses an all Cisco network with 20Gbit between racks and each machine comes with a 1Gb link. In our benchmarks we were easily able to get full gigabit connectivity to all of our machines. This was something we could not get from Serverbeach.

API and Feature Set

I think the biggest differentiator (and what sold us from my perspective) is the Softlayer API.

There are some pretty interesting features here…

For example, we can shutdown a switch port by calling their API.

If a database is misbehaving, and we need to kill it so that we can do a master promotion, we can just shoot it in the head at the switch port and then promote another master.

This way I can verify that all clients timeout and reconnect to the new master. I can then SSH in via another ethernet port (which has a firewall only allowing SSH) and debug the problem.

We can also check their inventory, reboot a machine, look at billing, change reverse DNS configuration, etc.

This would be a real pain to setup if we were going down the colo route.

Combine that with KVM over IP and you have a winner.

I had an interesting idea today to find bugs in networking code.

Design a VPN that deliberately introduces network packet corruption.

One could introduce a tunable to corrupt a certain % of packets.

For example, you could bring up a MySQL master/slave on your ethernet network and then launch the VPN to transfer the replication binary log across the corrupted network link.

Then you could wait and find that MySQL will break in a few minutes.

Now you could implement a patch to hashcode and resend the binary log packets on error.

Then just launch the code on your corrupting VPN and verify that it works.

Could be a great way to find data corruption bugs in protocols that were originally designed to be resilient.

Ideally it would be able to build packets that can find collisions in TCP checksums. Either that or create a new packet with a new TCP checksum.

Using PPP and a network pipe could yield an easy proof of concept.

Update: I imagine a tool like this already exists as I haven’t tested. If that is the case then the only change I would think would be to introduce this tool into normal protocol testing.

When you have petabytes of data even a small data corruption can be dangerous because tracking it down can be exceedingly problematic.

One of the things that has always bothered me about replication is that the binary logs are written to disk and then read from disk.

There is are two threads which are for the most part, unaware of each other.

One thread reads the remote binary logs, and the other writes them to disk.

While the Linux page buffer CAN work to buffer these logs, the first write will cause additional disk load.

One strategy, which could seriously boost performance in some situations, would be to pre-read say 10-50MB of data and just keep it in memory.

If a slave is catching up, it could have GIGABYTES of binary log data from the master. It would then write this to disk. These reads would then NOT come from cache.

Simply using a small buffer could solve this problem.

One HACK would be to use a ram drive or tmpfs for logs. I assume that the log thread will block if the disk fills up… if it does so intelligently, one could just create a 50MB tmpfs to store binary logs. MySQL would then read these off tmpfs, and execute them.

50MB-250MB should be fine for a pre-read buffer. Once one of the files is removed, the thread would continue reading data.

This might be a bit cutting edge, but the new fallocate() call in > Linux 2.6.23 might be able to improve InnoDB performance.

When InnoDB needs more space it auto-extends the current data file by 8MB. If this is writing out zeros to the new data (to initialize it) then using fallocate() would certainly be faster.

Apparently, XFS supports this too but needs an ioctl. XFS could support fallocate in the future as well…

I’m enamored by the middle path.

Basically, the idea is that extremism is an evil and often ideological perspectives are non-optimial solutions.

The Dalai Lama has pursued a middle path solution to the issue of Tibetan independence.

The two opposing philosophies in this situation are total and complete control of Tibet by the Chinese or complete political freedom by the Tibetan people and self governance. The Dalai Lama proposes an alternative whereby China and Tibet pursue stability and co-existence between the Tibetan and Chinese peoples based on equality and mutual co-operation.

How does this apply to Linux?

The current question of swap revolves around using NO swap at all or using swap to page out some portion of your application onto disk.

Even with the SplitLRU patch we’re still seeing problems with Linux swap.

The entire question is Mu

The solution isn’t to disable swap. The solution is to use a SMALL amount of swap (100-200MB) and monitor your application to detect when it is swapping and then tune back the memory usage of your application.

The difficulty is that you often want to use the maximum about of memory in the system. Imagine a theoretical system that efficiently uses 100% of system memory. An overcommitted application might allocate a BIT more and cause the OOM killer to kick in.

This would be horrible. Instead, why not just page 10-20 bytes.. It’s not going to have devastating effects on the system and if you’re monitoring it you can decide to tune back your memory usage by 100MB or so.

We have this happen from time to time with MySQL. We allocate slightly too much memory only to have the OOM killer kick in and kill MySQL – not fun.

Using a large swap file isn’t the solution either. If you overtune you can swap out a large portion of your application. This can essentially yield a denial of service as the load on your server becomes so high that you can’t even SSH in after hours of usage. The only solution is to reboot the box.

Using a small swap file avoids this. If you’re swapping more than say 50MB you can have your monitoring system kick in and alert you before the limit is hit and your OOM killer kicks in.

We’re experimenting with this idea this week and I’m optimistic about the results.

Apple Is Getting Lazy

For a while I was using Forget Me Not to sync up my laptop when I disconnected from my 30″ cinema display.

Users of portable Macs, how many times have you encountered this scenario? You’ve connected your laptop to a nice big monitor, you have your windows arranged for optimal creativity and productivity, then you realize that there’s a meeting in 10 minutes and you need your Mac for a presentation. Suddenly, you break out into a cold sweat, knowing that the window arrangement you spent countless hours on will be lost! (Ok, maybe not hours, but still…)

For the life of me I can’t imagine why Apple hadn’t implemented this in OS X.

A few months back it stopped working. Or at least I *thought* it stopped working as I don’t disconnect from my LCD as much as I once did.

Now I check back and notice this:

UPDATE: A while back I was contacted by an Apple representative who expressed interest in providing an API suitable for FMN. If you want to revive FMN, please file a bug report at (requires a free ADC membership) and say you want to support the enhancement in bug report 6018339. Apple considers duplicate bug reports “votes”, so every report is valuable!

RIP FMN: Unfortunately, Mac OS X 10.5 Leopard has killed FMN dead. FMN relied on being able to query the properties of windows during a “screen configuration is about to change” notification. Apple broke that ability in Leopard and doesn’t seem interested in restoring it. (They have closed the bug that I filed as “behaves as intended”.) I’m afraid I’m not realistically going to have the time to work around this problem. So as much as it pains me to say it, I’m going to have to declare FMN dead as of Leopard. Thanks to everybody who wrote telling me how they enjoyed the app — it was fun while it lasted!

Great job guys. People depended on this behavior. You should ship another API or compatible functionality.

To add insult to injury – I also use Spaces in OS X.

What a piece of garbage. Unfortunately, I’m stuck using it as I depend too highly on virtual desktop.

Spaces is an amazingly stupid design. When you switch virtual desktops it actually slides the whole UI across the screen in front of you with a vertigo inducing and CPU hogging animation.

Honestly, this feature alone is going to push me to use Linux/Ubuntu again. I’ll probably end up using OS X for applications that aren’t available on Linux but then switch back to Linux for my development (Emacs, GCC, xterm, etc).

I’ve been thinking about SaaS (in the form of Spinn3r) and how it relates to Open Source for the past few months and I think I’ve come to some interesting conclusions. I think SaaS might be a strong competitor to Open Source in that it’s cheaper and higher quality in a number of situations.

Apparently, I’m no the only one:

Open source is always driven by some organisation – a central body that leads community development efforts to support developers and build revenue streams. In essence, that body gives away the base code and knowledge of the community version to encourage development of the service and expand distribution; to make the software go ‘viral’.

However, I believe software as a service (SAAS) has undermined this model.

SAAS offers ready access to beautifully crafted applications and services through the browser for little or no initial cost. These applications supersede centrally-held open source projects since a. they are finished products (rather than base codes, which must be developed into end-user services) and b. can be easily found, used and shared by the end users of the application/service.

More thoughts on the subject are floating in the blogosphere as well:

Opensource tends to build passionate users that consider themselves, to a certain extent “owners” and “developers” of the product in question. These communities tend to be rabidly loyal and have a tendency towards evangelisation. This is clearly a hugely powerful aspect of OSS and should be harnessed.

SaaS on the other hand tends to build networks or communities of individuals that share a commonality – be it use, interest whatever. SaaS users tend to be loyal to a point, but not nearly as loyal as opensource-ers.

To a certain extent SaaS enterprises have attempted to create the opensource level of community by embracing the concepts of beta-testing and user feedback and development. This however has been reasonably limited (mainly due to the fact that opensource is free, at some point a free beta-test of a SaaS product will generally swing over to a subscription based service).

My experience running Spinn3r has be coming to similar conclusions.

First, we don’t compete with Open Source crawlers in our interactions with customers. Why? they’re amazingly expensive in comparison.

We run a cluster of 20-30 boxes and handle all aspects of running the crawler. We’re about 1/10th of the cost vs doing it yourself since we can amortize our operations across dozens of customers.

In our situation. Open Source isn’t free. It’s 10x the cost of using Spinn3r. It seems counter intuitive but TCO really comes into play here.

Second. We’re profitable and have no problem paying our developers, buying hardware, outsourcing development, buying tools, etc. Open Source (at least in its purest form) has traditionally had problems raising capital and has often depended on the patronage model. What’s worse, if they follow the MySQL/RedHat model they often put themselves at odds with their original community which can lead to tension.

This isn’t to say that Open Source is going to go away. We’re big fans of Open Source. Most or our architecture is OSS. Heck. Even our reference client is Open Source.

It just seems that SaaS is going to grow to push Open Source out of certain areas due to price, efficiency, and quality issues.

In the end I think this is good for the market and the industry as a whole.

Certainly, our customers are very happy.

PS. As an aside. I’ve always felt that free market economics and Open Source were always hand in hand. When I was doing News Monster (which was both Open Source and Free Software) I would joke that it wasn’t “free as in beer” it was “free as in $29.95.”

We made it easy to checkout NewsMonster directly from CVS and build your own version if you wanted but if you wanted the easy one click install (which included support) then you needed to pay $29.95. Most of our users (99%) opted to pay…

If you’re running MySQL with InnoDB and an in-memory buffer pool, and having paging issues, you probably should upgrade to 2.6.28 ASAP.

We’ve been running it for about three months now. We’re still not able to use 100% of the memory without paging – more like 95%. However, this is a vast improvement over only being able to use 70% of our memory a few months back.

Here is the changelog:

Split the LRU lists in two, one set for pages that are backed by real file
systems (“file”) and one for pages that are backed by memory and swap
(“anon”). The latter includes tmpfs.

The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.

This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.

Btrfs Merged in Linux 2.6.29

Looks like Btrfs made it into 2.6.29.

Which is pretty sweet because as Btrfs stabilizes it means that a large future port isn’t necessary.

What’s Btrfs? It’s a compelling ZFS alternative for Linux designed to work with Linux from day zero (as opposed to being ported from Open Solaris).

The most compelling feature for me is online snapshots. Maybe compression. Thought to be honest LVM snapshots seem to be working for us in production.

Experience with Infiniband

Anyone out there using Infiniband in production?

I’m curious if this stuff is coming down in price when compared to 10gE. The switch costs for 10gE are just insane so I imagine Infiniband is in a similar situation.

Another major advantage of Infiniband is the reduction in latency. Of course the Linux stack will almost certainly add this back in – it’s apparently terrible at low packet latency applications.

Apparently, Linus has a new Intel drive he’s happy with:

In contrast, the Intel SSD does about 8,500 4kB random writes per second. Yeah, that’s over eight thousand IOps on random write accesses with a relevant block size, rather than some silly and unrealistic contiguous write test. That’s what I call solid-state media.

They are almost certainly using an internal log structured filesystem to see this type of performance.

Nice that we’re seeing actual number on this drive now.

For the last few weeks we’ve been running with the Split-LRU patch and 2.6.25.

So far so good…. our boxes don’t seem to page as much anymore and we’re no longer suffering from swap insanity – which essentially fixes the Linux swap problem.

However, it now seems that we’ve been hit by two additional swap related bugs which may or may not be influenced by the Split-LRU patch.

Linux lies about swap utilization:

This is a bit annoying as opposed to a hard stop… When you run free you end up seeing a few megabytes in swap. However, Linux is lying and the data isn’t actually being used.

What I think is happening is that there’s just an accounting bug. Linux is adding pages, then removing them, but not decrementing the counter.

Enhanced load in low memory configurations:

When increasing memory from 87% to 96% we saw radically increased load on the machine. So much so that SSH would not respond and the only solution was to physically reboot the machine.

Bumping the used memory back down to 87% fixed the problem. Of course on 32-128GB machines this is a non-trivial amount of memory to simply waste.

This might be an existing bug which we weren’t able to see before because Linux would swap so fast that this quickly hurt performance.

Any advice on resolving this load issue?

There doesn’t appear to be any documentation on how to figure XFS to stride across a RAID array.

After about two hours of google searches I finally figured it out.

There are two variables – sunit and swidth:


This is used to specify the stripe unit for a RAID device or a logical volume. The value has to be specified in 512-byte block units. Use the su suboption to specify the stripe unit size in bytes. This suboption ensures that data allocations will be stripe unit aligned when the current end of file is being extended and the file size is larger than 512KiB. Also inode allocations and the internal log will be stripe unit aligned.


This is used to specify the stripe width for a RAID device or a striped logical volume. The value has to be specified in 512-byte block units. Use the sw suboption to specify the stripe width size in bytes. This suboption is required if -d sunit has been specified and it has to be a multiple of the -d sunit suboption.

So based on this here’s how to configure both values:

sunit = strip_size / 512
swidth = sunit * nr_raid_disks;

In my situation I’m using RAID 0 across 5 disks so my sunit is 128 and my swidth is 640.

In hindsight I think the su and sw options (expressed in bytes) would have been easier to compute.

su would just be 65536 and sw would just be 5 * su.

In our tests this yielded a 30-40% performance boost so certainly worth the trouble.

XFS is supposed to do this internally by calling an ioctl on the device to obtain the stripe settings. It doesn’t appear that the megaraid driver that we’re using supports this and it came back as zero/zero for both sunit and swidth. I imagine if you’re using Linux software RAID that it will work just fine.

200805271727Spinn3r is hiring for an experienced Senior Systems Administrator with solid Linux and MySQL skills and a passion for building scalable and high performance infrastructure.

About Spinn3r:

Spinn3r is a licensed weblog crawler used by search engines, weblog analytic companies, and generally anyone who needs access to high quality weblog data.

We crawl the entire blogosphere in realtime, remove spam, rank, and classifying blogs, and provide this information to our customers.

Spinn3r is rare in the startup world in that we’re actually profitable. We’ve proven our business model which gives us a significant advantage in future product design and expanding our current customer base and feature set.

We’ve also been smart and haven’t raised a dime of external VC funding which gives us a lot more flexibility in terms how how we want to grow the company moving forward.


In this role you’ll be responsible for maintaining performance and availability of our cluster as well as future architecture design.

You’re going to need to have a high level overview of our architecture but shouldn’t be shy about diving into MySQL and/or Linux internals.

This is a great opportunity for the right candidate. You’re going to be working in a very challenging environment with a lot of fun toys.

You’re also going to be a core member of the team and will be given a great deal of responsibility.

We have a number of unique scalability challenges including high write throughput and massive backend database requirements.

We’re also testing some cutting edge technology including SSD storage, distributed database technology and distributed crawler design.


  • Maintaining 24 x 7 x 365 operation of our cluster
  • Tuning our MySQL/InnoDB database environment
  • Maintaining our current crawler operations
  • Monitoring application availability and historical performance tracking
  • Maintaining our hardware and linux environment
  • Maintaining backups, testing failure scenarios, suggesting database changes


  • Experience in managing servers in large scale environments
  • Advanced understandling of Linux (preferably Debian). You need to grok the kernel, filesystem layout, memory model, swap, tuning, etc.
  • Advanced understanding of MySQL including replication and the InnoDB storage engine
  • Knowledge of scripting languages (Bash and PHP are desirable)
  • Maintaining software configuration within a large cluster of servers.
  • Network protocols including HTTP, SSH, and DNS
  • BS in Computer Science (or comparable experience)

Further Reading:

We’ve had our SSDs in production for more than 72 hours now. We’ve had them in a slave role for nearly a week but they’ve now replaced existing hardware including the master.

The drives are FAST. In our production roles they’re reading at about 45MB/s and writing to disk at about 15MB/s and using only about 22% of disk utilization.

Not too bad.

We also have about 70GB free on these drives so that leaves plenty of room to grow.

There was small problem that I didn’t anticipate.

When we were running our entire disks out of memory we would only use one or two indexes per column. We had a set of reporting tasks which ran some queries once every 5 minutes.

The columns these queries were using didn’t have any indexes so InnoDB CPU would spike for a moment and continue.

Modern machines have memory bandwidth of about 15GB/s so these queries were mostly CPU bound but completed in a few seconds.

When we switched over to SSDs all of a sudden these queries needed to perform full table scans and were reading at about 100MB/s for two minutes at a time.

Fortunately, an ALTER TABLE later and a few more indexes fixed the problem.

We dropped the indexes when we were running out of memory because the queries could be resolved so quickly. Now that they were on disk again we had to revert to the olde school way of doing things.

Fully RAM based databases are being used in more and more places. For a lot of use cases throwing ALL of your data into memory will have a major performance benefit.

But when should you use RAM vs SSD?

RAM is about $100/GB. SSD is about $30/GB.

SSDs have a finite performance of about 100MB/s for reads. The only time you should throw money at all RAM based scenarios is when you ned to get more performance than a SATA disk can give you.

Of course, this is obviously going to depend on your application.


Another thought. 100MB is close to 800Mbit which is close to 1Gbit. Gigabit ethernet is pretty much the only solution for networking at the moment.

Even if you DID have a box that could saturate the gigabit link you’re not going to get any better performance by putting it all in RAM.


One could use dual ports, but that’s only going to allow you to use two SSDs.

You could use 10GbE but that’s going to cost you an arm and a leg!

We’re going to deploy with 3 SSDs per box. This gives us a cheaper footprint as we can run with a larger DB however I’ll never come anywhere near the full throughput of the drive if I’m on GbE.