Archive for the ‘open source’ Category

I’m not sure why REST needs defending but apparently it does

Dare steps in and provides a solid background and Tim follows up

What is really interesting about REST from my perspective (and not everyone will agree) is that since it’s REST you can actually solve real problems without getting permission from a standards board.

It’s pretty easy to distrust standards bodies. Especially new standards bodies. The RSS wars were a joke. Atom took far too long to become standardized.

At Wordcamp this weekend one of the developers was complaining about brain damage in XMLRPC – god only knows why we’re still using this train wreck.

I noted that WordPress should abandon XMLRPC and just use REST. They seem to be headed in that direction anyway (by their own admission).

Their feedback was that they didn’t want to go through the Atom standardization process to extend the format to their specific needs.

You know what? You don’t need permission. Just write a documented protocol taking into consideration current REST best practices and ship it.

If people are using the spec and find value then it will eventually become a standard.

I’m a bit biased of course because Spinn3r is based on REST.

We burn about 40-50Mbit 24/7 indexing RSS and HTML via GET. We have tuned our own protocol stack within our crawler to be fast as hell and just as efficient.

Could I do this with SOAP? No way! I’ve had to gut and rewrite our HTTP and XML support a number of times now and while it’s not fun at least with REST it’s possible.

REST is complicated enough as it is… UTF-8 is not as straight forward as you would like. XML encoding issues do arise in production when you’re trying to squeeze maximum performance out of your code.

… and while REST is easy we STILL have customers who have problems getting up and running with Spinn3r.

We had to ship a reference client about six months ago that implements REST for our customers directly.

These guys are smart too… if they’re having problems with REST then SOAP/XMLRPC would be impossible.

In the past I’ve often used in-memory data structures (vs on disk) in situations where allocating say 5-10MB of data in the local VM is much better than continually hitting a database.

This has only worked because I’m not using very large data structures. My language classification codes uses about 8MB of memory stored in a TreeMap for it’s trained corpus based on Wikipedia.

Recently, I’ve been playing with a much larger corpus for a new feature for Spinn3r.

It originally started off using 500MB which is WAY too much. Bumping it down to 350MB by using a TreeMap was a start but that’s no where NEAR far enough.

What’s happening is that Java itself has per-object overhead. In my analysis it takes 38 bytes for just a basic java.lang.Object.

This is way too much data. If you’re using 8 byte keys (64bit ints) then using 46 bytes to represent them is nearly 6x additional overhead.

In my situation I’m nearly seeing 10x overhead. This 350MB memory image of my data structure can be represented as just 35MB on disk. This would be more than fine for my application.

Which is when I had an idea – why am I using a pre-indexed data structure at all? Why am I keeping millions of copies of java.lang.String and java.lang.Integer instances?

Here’s what I’m going to do. I’m going to write an in-memory memory mapped file implementation of java.util.Map. Using the new NIO support for mmap files I’m going to write out a file with all the keys sorted and followed by its data.

So the on-disk format will like:

magic_number
nr_keys
key:data
key:data
...

I read the number of keys and then a perform a binary search on the key region for my key base on it’s byte[] encoding.

The first pass will only work on fixed width keys and values but it should be easy to add variable width (though this might require a bit more overhead.

This should yield a 10x memory savings with only a slight CPU hit while converting the key to a byte array and parsing the data byte array into an actual java primitive.

I just spend a few hours today setting up RAID on SSD to see what the performance boost would look like.

I’m very happy with the results but they’re not perfect.

Here’s the results from comparing a single Mtron SSD vs a RAID 0 array running with a 64k chunk size.

200802141939

Figure 1: Mtron SSD performance advantages with RAID.

The clear loser here is rndwr (random writes).

I’m pretty sure this has to with the 64k stripe/chunk size. I’m willing to be the RAID controller is deciding to write the entire chunk for small updates which would really hurt SSD performance since the ideal block size is 4k.

This is a MegaRAID controller so I need to figure out how to rebuild the array with a 4k stripe size.

I suspect I’ll see a 300% performance boost over a single Mtron drive but not much more. The random read numbers give us a 550% performance boost but I suspect this has to do with the buffer on the drive since we now have 3x the on disk memory buffer.

200802112133It looks like Fusion IO has published more numbers about their SSD flash storage devices.

For starters, the price is totally reasonable. $2400 for an 80G device. This is $30/GB which puts it roughly 2x the price of the Mtron at $/GB. Not to bad.

The raw performance numbers seem amazing:

200802112138

However, these can be highly misleading. They don’t cite random write numbers and only quote 8k packets.

Here’s how you break apart these numbers.

Sequentially just means they’re doing ‘random’ 8k packets back to back. AT 600MB/s this works out to 73K random IOPS. The size can be 512B and up to 8k. So let’s assume we’re FULLY random and doing 512B writes.

This puts us at only 35MB/s.

Still pretty impressive. The Mtron can only do 7.5MB/s random writes which puts it at 4.6x faster for only 2x the cost.

The REAL question is how you access the device. I assume they wrote their own kernel driver and it exposes itself as some time of MTD, memory, or block device.

I’d want that driver to be OSS though.

Big DBA Head has run some independent MySQL benchmarks with the Mtron SSD drives that I’ve been playing with.

Great to see that we’re coming to the same conclusions. It’s nice to have your research validated.

Run time dropped from 1979 seconds on a single raptor drive to 379 Seconds on the standalone Mtron drive. An improvement of over 5X. Based on the generic disk benchmarks I would have thought I would have seen slightly lower runtimes.

I think he might be missing out on one advantage. If you’re more than 50% writes you would probably see lower throughput but you can have a MUCH larger database and when you NEED to do random reads you can easily get access to your data – and quickly!

HDD prevents this since you’re saturating the disk IOPS with writes which prevents you from doing any reads.

This allows you to take the money you’re paying for memory and get 10x more for a slight performance hit.

I suspect these problems will be resolved in the next six months.

There are a few possibilities to solving this issue:

* Someone will write a Linux block driver that implements a log structured filesystem. You can then export the SSD as a LSFS and re-mount it as XFS. The random write performance will then soar and you’ll have features of XFS like constant time snapshots.

* Log structured databases like PBXT will be tuned for SSD which will increase the performance.

* Someone could patch InnoDB to handle 4k pages. I tried to compile InnoDB with 8k pages and it just dumped core. I think the performance of InnoDB will really shine on SSD once this has been fixed. One other potential problem is that during my tests InnoDB became CPU bound when working with the buffer pool. I’m not sure what the problem here is though.

* SSD vendors will implement native LSFS on the drives themselves. This will also help with wear leveling and help negate the problems with the flash translation layers in the drives. I suspect STEC is already doing this.

* No Flash Translation Layer. Instead of a block device they could be exported as MTD devices. This could boost performance with filesystems that were MTD aware.

* Raw random write IOPS performance upgrades on the drives themselves. Instead of only 180 random write IOPS they we could see drives doing 1k random write IOPS in Q2.

I spent some more time today comparing InnoDB and MyISAM on SSD.

I increased the data/cache ratio by about 5X. I allocated 1G of memory for MyISAM or InnoDB (restating MySQL after each test). Resulting on disk images are 6G for MyISAM and 7G for InnoDB.

This is on a 30M row DB which stores ints and char data.

I’m primarily testing the theory that SSD could be used to get near in-memory performance by using cheap SSDs since they can do lots of random reads.

MyISAM would clearly outpace InnoDB if I would perform the initial ‘prepare’ (bulk insert) stage in multiple processes. MyISAM became CPU bottlenecked which ended up slowing the write rate.

InnoDB on the other end had the SSD at 100% utilization. I’m not sure why. It could either be an issue with the 16k page size or the write ahead log.

A 3x performance boost is more than acceptable here especially when you consider you can create a 96G SSD RAID array for the same price as 8G of RAM.

I might end up recompiling MySQL with an 8k and 4k page size for InnoDB just to see if it makes any difference.

Further, I might spend some time trying to figure out why InnoDB is so slow at performing the initial INSERTs.

200802031658

Figure 1: MySQL performance time for sysbench for inserting and performing OLTP queries on the data. Times are in minutes (lower is better)

200802011723-1I’ve been reviewing the random write performance with SSDs over the last few days and have a few updates on their performance numbers.

It turns out that SSDs themselves need to handle random write IO to obtain ideal performance numbers. Due to the erase block latency on NAND flash, performance can start to suffer when your database does lots of random writes. OLTP applications REALLY suffer from this problem since databases tend to think their underlying storage system is a normal hard drive.

Some vendors like STEC claim that their SSDs can do high random write IOPS natively. This certainly has nothing to do with the underlying NAND flash but rather their use of an intelligent write algorithm.

So really it’s not a resource problem as much as it is an IP problem.

The NAND on these drives is pretty much the same it’s just that we’re not doing a good job of interfacing with them.

Log structured filesystems can come into play here and seriously increase performance by turning all random writes into sequential writes. The drawback is that IO will then be fully random. For SSD this isn’t as much of an issue because random reads are free. The Mtron we’ve been benchmarking can do 70MB/s random reads.

To test this theory I threw both nilfs and jffs2 at the Mtron to see what performance looked like.

It turns out that nilfs performs really well in the sysbench random write benchmark. It completed all IO at 10x faster than the internal HDD which was a welcome sign. In practice it was continually writing at 50MB/s and was able to complete tests in 13 seconds vs 8.5 minutes for our HDD.

While random writes look tood it failed at the sysbench OLTP benchmark. I’m not sure why. In theory it should work fine since all blocks should be read quickly and then written to the end of the disk sequentially. This could be a problem with the nilfs implementation, the fact that erase blocks weren’t properly aligned, or a strange interaction issue with their continuous snapshot system.

Jffs2 also looked interesting but this is designed to work with the Linux MTD driver and not a block driver. It has a number of issues including bugs in implementation and the fact that it doesn’t log numbers into /proc which breaks iostat (at least with the block2mtd driver).

The biggest problem I think is the fact that all existing Linux log filesystems are designed for use with MTD devices not block devices. This means all existing code won’t work out of the box.

To add insult to injury many of these filesystems require custom patched kernels which makes testing a bit difficult.

Update: Bigtable and append only databases would FLY on flash. Not only that but databases can easily be bigger than core memory because the random reads would be fast as hell.

I’m very jealous.

I’ve now had about 24 hours to play with the Mtron SSDs and had some time to benchmark them.

The good news is that the benchmarks look really solid. The drive is very competitive in terms of performance. I’m seeing about 100MB/s sequential read throughput and 80MB/s sequential write throughput.

I’ve had some time to benchmark them and they’re really holding up.

The bad news is that they can only do about 180 random writes per second. Here’s are the raw performance numbers from Mtron’s data sheet:

200801301443

I spent a lot of time reviewing this drive and didn’t notice this before.

The Battleship Mtron review went over this as well but didn’t spend much time on it:

Although they do perform astounding in random read operation, random write is still very sub-par on flash technology. Even though we are not benchmarking random write IOP’s I will give you some quick insight. Write performance is not yet a perfect and refined process using NAND flash and you will not have a drive that is going to write file operations as well as a high end U320 15K SCSI or SATA 10K setup. There is a company that I have been talking with directly about this NAND flash write issue called EasyCo in PA, USA. They are working on a process called MFT technology and they offer a simple MFT driver that is claiming to increase random write IOP’s on a single drive up to 15,000 IOP’s. Doug Dumitru had explained to me this technology will take your standard Mtron 16GB Professional drive and turn it into an enterprise behemoth.

I spent some time to see what EasyCo was up to and came across their Managed Flash Technology:

Managed Flash Technology (MFT) is a patent pending invention that accelerates the random write performance of both Flash Disks and Hard Disks by as much as a thousand fold.

It does this by converting random writes into chained linear writes. These writes are then done at the composite linear write speed of all the drives present in the file volume, subject only to the bandwidth limits of the disk control mechanism. In practice, even with as few as three drives present, this can result in the writing of as many as 75,000 4-kilobyte blocks a second.

As a result, MFT can dramatically improve the real-time performance of asymmetric storage devices such as Flash disks by making reads and writes symmetrical. Here, flash disk performance is typically improved 10 to 30 times, making some of these 60 times as fast as the fastest hard disk. Finally, it is possible to make clusters of as few as 20 flash drives run collectively as fast as RAM does but with a much larger storage space than RAM can practically have.

The question is what are they doing to get such substantial performance?

Here’s what I think is happening.

From what I’ve read they take a normal Mtron drive and install a new Linux kernel module which they use to interface with the drive. They then use a normal write ahead log and keep data in memory (probably something like a 500M buffer) and a binary tree of the block offsets. When the buffer fills they then take the data in memory, sort the results by offset, and apply the buffer to disk sequentially.

If the box crashes they have an on disk log that they apply. Probably when the drive is first mounted.

Basically a flash aware write ahead log.

Fortunately, InnoDB has a write ahead log internally so this should save us from needing to run a custom kernel module. Any database with a write ahead log should be more than competitive.

I wrote a simple benchmarking utility (see Figure 1 below) to simulate an InnoDB box performing thousands of random reads and one sequential write.

The benchmark consists of 3500 dd process running in the background reading from the SSD and writing to /dev/null. I then have one sequential write performing in the foreground writing out about 5G of data to the SSD.

The HDD holds up really well when compared to the SSD which should have an unfair advantage. So much so that I think the Linux scheduler is interfering with my benchmark. I think that’s happening is that the first few dd’s start reading in parallel and block the remaining process. This continues with 5-10 concurrent readers until the entire chain of 3500 completes.

I’m going to rewrite the benchmark to create one large 10G file and randomly read 10G from misc locations.

As you can see while SSD is very fast but it’s only about 2.5x faster than HDD. I’d expect it to be about 20-40x faster.

200801301441

Figure 1. Performance of SSD vs HDD (measured in seconds)

I’ve been reviewing our settings for innodb prior to testing our new SSDs drives later this week.

Here are some initial thoughts:

* Both sync_binlog and innodb_flush_log_at_trx_commit should both be enabled. The extra seeks required isn’t really an issue on SSD and the extra reliability is worth the slight performance hit.

* Read ahead for Innodb and the underlying block driver should probably be disabled. There’s no sense reading another 512 blocks in SSD. You can get the IO quick enough so why slow yourself down potentially reading content you don’t need? Innodb use a heuristic algorithm for read ahead but the best it can do is equal the same performance as SSD. At the very minimum disabling disk based read ahead is probably a good idea.

* If your database is primarily small fixed size rows it might make sense to recompile using a smaller block size. SSD performance seems to be a function of write size. If you constantly need to write 16k where 90% is re-written content you’re going to see an effective 4x slowdown. Jeremy Cole mentioned that changing this would bloat the DB. I’ll have to experiment. I’m also going to have to figure out if O_DIRECT can be used with less than 16k block sizes. I don’t think it can.

* The new thread concurrency stuff in MySQL 5.1 is probably going to be very important. There’s no reason multiple concurrent threads shouldn’t be able to mutate and access the DB in parallel since we’re no longer really bound by disk seeks. Letting the DB go full out seems like a big win. This is going to require MySQL 5.1 though which should be available any year now (*cough*).

Given all this, I think performance will still be outstanding for Innodb on SSD but probably a good deal of variability in performance.

It looks like there’s another competitive SSD on the market. The Stec Zeus IOPS.

I foolishly dismissed this drive before because I thought they weren’t disclosing their write rate (which all the other vendors are doing to lie about their performance).

Turns out they’re claiming 200MB/s with 100MB/s write throughput. If these numbers are accurate the then this would be 2x faster then the Mtron SSDs.

Storage Mojo has additional commentary. They’re comparing these drives to the RamSan which is not a fair comparison since this is a DRAM based SAN device. The RamSan-500 should trounce everything on the market but the pricing is astronomical.

The key win for SSDs is that they’re cheap and will soon be commodity. By mid-2008 I imagine 20% of the laptop market will be using SSDs and vendors like Toshiba, Samsung, Stec, and Mtron will be feverishly attacking each other in the enterprise market.

The key here with the Zeus will be price per GB. The Mtrons are about $15 per GB which is the price point I’m looking at for real world horizontal/diagnol scaled applications.

The Mailinator guys blogged about how they were using a modified Aho-Corasick style multiple string pattern matching algorithm to index 185 emails/s.

Aho-Corasick takes all the search strings and builds them into a Trie so that it can scan the whole document for N strings in once pass.

The problem is that Aho-Corasick doesn’t support more advanced constructs such as string jumping made possible by Boyer-Moore.

Boyer-Moore says that it would be more intelligent to search from the end of a word. This way when you find a mismatch it can figure out a ‘jump’ and skip characters! This has the counter intuitive property of being able to index the document FASTER when the word your searching for gets longer.

Crazy huh.

Now if only you could support Boyer-Moore jumping with Aho-Corasick style single pass indexing.

Turns out that’s already done.

Wu-Manber is another algorithm which combines the advantages of both Aho-Corasick and can index the document once as well as skip over words in a jump table.

Spinn3r uses Aho-Corasick for one of our HTML indexing component. It turns out that Wu-Manber won’t give us much of a speed boost because HTML comments begin and end with only three characters. It’s a 3x performance gain to migrate to Wu-Manber but our Aho-Corasick code is proven and in production.

We’re also no longer CPU bound so this isn’t on my agenda to fix anytime soon.

The entire Internet is buzzing today about Sun buying MySQL.

… we’re putting a billion dollars behind the M in LAMP. If you’re an industry insider, you’ll know what that means – we’re acquiring MySQL AB, the company behind MySQL, the world’s most popular open source database.

Both sets of customers confirmed what we’ve known for years – that MySQL is by far the most popular platform on which modern developers are creating network services. From Facebook, Google and Sina.com to banks and telecommunications companies, architects looking for performance, productivity and innovation have turned to MySQL. In high schools and college campuses, at startups, at high performance computing labs and in the Global 2000. The adoption of MySQL across the globe is nothing short of breathtaking. They are the root stock from which an enormous portion of the web economy springs.

This was a good play by Sun. MySQL is used by a number of major companies and Web 2.0 startups (including Spinn3r, Digg, Technorati, Google, I could go on).

This is going to yield a great halo effect for Sun. MySQL customers need a solid OS and now they’re going to have one – Open Solaris. They’re going to need a solid filesystem – ZFS. They’re going to need a decent storage array. Sun just happens to have one they’re interested in selling you.

Rumors have it that Sun is even going to ship a new SSD device which I’d love to purchase but it doesn’t look like it will be released on our timeframe.

This isn’t all bells and roses though. I’m a bit concerned that this could cause MySQL to defocus their support for Linux. Of course I’m sure that more than 50% of their customer base is running Linux so this might not be too easy.

This also puts MySQL in a good home. I’ve been worried that the rumored IPO would hurt them by changing their focus. Oracle was always a looming threat and a purchase by RedHat would be a disaster.

Schwartz is doing a great job running Sun. Since he’s taken the reigns he’s really turned the ship around and brought it under control.

If they keep this up I might just end up becoming a customer.

I bought an Ergotron desktop mount for my laptop the other day and finally had time to install it last night.

Here’s my resulting setup:

200801101435-1

So far I’m pretty happy with the results. The hardware was pretty affordable – $150 with shipping.

It allows me to keep my laptop closer, and at eye level, and I actually have my mouse under my laptop which is a bit confusing at times.

The only pet peeve I have is that occasionally the laptop can shake while Im typing. I wish they had a screw tightening system so that I could fix the arm in one position at the hinge to prevent it moving while I type.

I haven’t fully figured out what I want to do with the second display. Sometimes it’s nice to play an old DVD like Star Wars while I hack on code. It would also be nice to have some sort of just in time source code visualization.

I’m also going to have to upgrade to the 30″ LCD soon which should be interesting as well.

I’m very excited to announce that Spinn3r 2.1 is now available.200712311502

A number of major new features have been implemented in this release which has taken us more than three months of hard work to get out the door.

We’ve also finished up another stage of our backend and are planning on buying a few more toys in 2008 which should make things interesting moving forward.

Read the full post on the Spinn3r blog.

Today, I’m announcing a new meta podcast about designing scalable systems named Scalecast.

I’ve been seeing more and more conference interesting presentations online but I can’t get them to work with my iphone/ipod since they require streaming flash video.

I now have a simple script that can fetch the flash video from youtube, transcode it to mp4 video, including AAC audio, and publish it to WordPress. It’s then available for use on any Apple device including legacy ipods and more modern iphones.

If you have a suggestion for a video to include in this podcast, just add it as a comment on this post and I’ll try to transcode for you.

I primarily did this just for myself. I need to be able to view these videos for my work and my primary mechanism for doing so is my iphone.

This LA Times article is a great xmas present for the MacAskill Family:

The company now employs 28 people — all MacAskills, family friends and SmugMug customers they hired — in five countries. The MacAskills have signed up more than 100,000 paying subscribers despite mounting competition from free services, in part by emphasizing their family-friendly approach. They post their own family photos and home videos on the website, spend countless hours chatting up their users in the company’s online forum and send lively customer service e-mails such as “Who loves you, baby?”

Our industry is far to monoculture. There are only two types of companies – VC funded startups and BigCos like Google Yahoo and Microsoft.

We don’t talk about it much but there’s too much corruption here. We need more companies like SmugMug that pop up and break the mold.

I’d like to think Spinn3r is help solving the problem but I won’t be that arrogant …

200712201803Our talk on the Spinn3r web crawler architecture entitled “Scaling MySQL and Java in High Write Throughput Environments” has been accepted at the 2008 MySQL Conference.

This is really exciting because we’re also hoping to Open Source more components before April.

We present the backend architecture behind Spinn3r – our scalable web and blog crawler.

Most existing work in scaling MySQL has been around high read throughput environments similar to web applications. In contrast, at Spinn3r we needed to complete thousands of write transactions per second in order to index the blogosphere at full speed.

We have achieved this through our ground up development of a fault tolerant distributed database and compute infstructure all built on top of cheap commodity hardware.

We’ve built out a number of technologies on top of MySQL that help enable us to easily scale operations.

We’ve implemented an Open Source load balancing JDBC driver named lbpool. Lbpool allows us to loosely couple our MySQL slaves which allow us to gracefully handle system failures. It also supports load balancing, reprovisioning, slave lag, and other advanced features not available in the stock MySQL JDBC driver.

We’ve also built out a sharded database similar to infrastructure built at other companies such as Google (Adwords) and Yahoo (Flickr). Our sharded DB has a number of interesting properties including ultra high throughput requirements (we process 52TB per month), distributed sequence generation, and distributed query execution.

200712141754Nick Carr is predicting Google will open it’s infrastructure to the world:

The article also includes an interesting, if ambiguous, passage in which Eric Schmidt implies that Google will rent out its supercomputer to outside developers and businesses the way that Amazon.com does through Amazon Web Services:

“Schmidt won’t say how much of its own capacity Google will offer to outsiders, or under what conditions or at what prices. “Typically, we like to start with free,” he says, adding that power users “should probably bear some of the costs.” And how big will these clouds grow? “There’s no limit,” Schmidt says. As this strategy unfolds, more people are starting to see that Google is poised to become a dominant force in the next stage of computing. “Google aspires to be a large portion of the cloud, or a cloud that you would interact with every day,” the CEO says.”

You can read this full article over on Business Week.

Where have we heard that before? Oh, that’s right. I told you about this back in September:

An audience member went up to the microphone and asked if Google had plans to provide BigTable, GFS, and MapReduce to the public as a web service. Larry looked RIGHT at Jeff Dean as if to say “if only they knew what we know”. I was in Larry’s direct line of sight so the look was plain as day.

It seem inevitable that Google will provide a similar feature (especially with Amazon doing it) but I think the main issue is a question of time.

Not to be out done, Amazon announced SimpleDB today.

Amazon SimpleDB is a web service for running queries on structured data in real time. This service works in close conjunction with Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2), collectively providing the ability to store, process and query data sets in the cloud. These services are designed to make web-scale computing easier and more cost-effective for developers.

Techcrunch thinks you should fire all your DBAs. Nitin is impressed too.

I’m still very skeptical about ALL of this.

First. Last time I checked, Amazon’s bandwidth pricing was insane. It would literally cost us 3x more to host Spinn3r. Granted, we process a LOT of data (from 60-160Mbits per month) but when your startup is successful you don’t want to burn it all your AdSense revenues on bandwidth invoices courtesy of Amazon.

Second. Real world applications are VERY complex. These systems are going to work out very well when you’re inside the firewall. What if EC2 starts to fall over for you? Now you’re STUCK at Amazon because you can’t port to alternative database. API calls will also be highly latent.

This is going to work out for very early stage startups though. It’s also going to work out well for startups like Powerset who have a high work unit / computational time ratio. (small work unit, tons of compute time). They can just send EC2 a small bit of work and wait until the result comes back.

However, what if you’re the next YouTube? You decide to host on Amazon and then all of a sudden Google comes knocking to acquire you. Now what? It’s going to be a VERY hard sell for Google. All your data is already on Amazon. They’re going to have to move it off. Second. Amazon can come in and low bid Google because the application is PERFECT for them (it’s already on their infrastructure) and they know you can’t really switch.

200712141830No. What this does is put a lot more pressure on Sun.

Are you listening Jonathan Schwartz?

I don’t want Google Web Services. I don’t want Amazon Web Services.

I want raw machine power. I want root. I want to run my own databases. I want my machines racked together on 10Gbit. I want real HDDs (or even SSD). I want new machines provisioned within hours and MOST of all I want to LEASE the hardware.

Sun’s Startup Essentials program is GREAT but I have to pay cash for the hardware. You know what? I’d rather pay 5-15% more and lease my machines? Why? I’m a startup – limited resources. I want to use this cash to hire hot shot Java engineers.

The historical problem with leasing is that companies like Dell and Sun have to run credit checks on your new startup. News flash. Small startups that have been in business for 3 months don’t have credit yet.

Solution? Simple. Just lease me the machines but don’t EVER give me physical access to them. If I don’t pay – yank access and give them to another customer.

Who does this now? Serverbeach is doing a great job for us. They don’t have a lot of competition though. Rackspace is their only major competitor but in their infinite stupidity they have refused to support Debian.

The truth here is that there’s still a huge market in hardware. Companies like Technorati, Digg, Powerset, Spinn3r, etc will NEVER trust the majority of their compute infrastructure to a large and potential competitor.

Check out this performance analysis of the recent Mtron 16G drives

Here are some key things to take away:

* You can fully saturate the drive with 4k reads and still get 100MBps throughput.
* The disk can STILL do 80MB/s write throughput
* It’s cheap – $400

So, if you were to buy a new server, would you rather spend $1500 on 16G of memory or or 60G of SSD?

Think about it. With ~4 SSDs you can get about 400MB/s sequential reads.

Even if you’re on a 4 core system the CPU is going to have a HARD time keeping up with whatever task you want to do.

You’re no longer seek bound.

So why do you even need memory any more?

Just buy 4-8 core machines with 2G of ram and put the rest of the cash into SSD.

Granted, there are some applications with HUGE data requirements. Say you need a 1T database. Would you rather buffer it with 16G of memory or 60G of SSD?

At 400MB/s I’m willing to bet the SSD benchmarks would look pretty solid.

How would you accomplish this though? Mount the SSDs as swap :)

Update: One problem here. The SSD could only be used as a disk device which means that Linux buffers and the page cache can’t use the device so DBs that rely on filesystem caching would really suffer (though innodb would work just fine).

Maybe the next step is to KEEP these as ram devices to begin with.

We’ve been doing a lot of tuning to Spinn3r lately and find that we’ve seriously improved the CPU performance of our application. So much so that we’re 100% disk bound and find that we have 40 1.8Ghz Opteron processors free to do any additional work.

Of course, I’m sure we’ll find something. There are plenty of other information retrieval tasks we could perform or our clients.

Until then maybe we’ll play with folding@home. :)

Check out the performance boost in the graph below. This is just for one portion of our robot cluster. The graph shows percentage free for each core. So 3000 is 30 cores.

200712052012