
This chapter aims to help prioritize spending when planning out the purchase of a new server intended to run PostgreSQL. If you already have a running database server, following the instructions in Chapter 3, Database Hardware Benchmarking and Chapter 8, Database Benchmarking might be a more appropriate place to start at than here. Ultimately, you may end up taking a few round trips alternating through that material and what's covered here. For example, if you benchmark your server, and the disks seem slow, the background here might give you an idea what hardware change you could make to improve that. Once that's done, it's back to benchmarking again; repeat until performance matches expectations.
One of the reasons that working with open-source databases such as PostgreSQL can be so effective is that every dollar you save on software licensing can be put toward better hardware instead. The three main components you'll need to balance in your budget are CPUs, memory, and disks, with the disk controller as a related and critical part too.
Currently, available processors are bundling at least two and possibly as many as eight cores into each CPU, making the core count the figure of merit for most database applications. There are two basic decisions you need to make while deciding which CPU solution would best match your database application:
- Which processor family? Nowadays, this normally boils down to choosing among the various 64 bit product lines from Intel or AMD, although there are some other less popular choices still floating around (Itanium, SPARC, and so on).
- Do you get more cores or faster cores?
These choices are sometimes more tied together than you might think. Currently, Intel has a lead in delivering individual processor cores that are the fastest around, often due to faster transfers between the processor and system RAM. But the processors and related parts are more expensive too. AMD still is competitive at providing more cores per dollar, and their server class designs normally do a good job making the best of the memory available to each core. But if what you want is many more affordable cores instead, that's where AMD is stronger. AMD also has a better history of making its fastest processors available in configurations with many sockets, when you want to put more than two physical CPUs into a single server.
The best way to figure out which class of database app you have—more cores or faster cores—is to monitor an existing server using tools such as top
. If there's a small number of processes running using a single CPU each, that's the sort of workload where faster cores are better. That tends to happen if you have giant queries running in a batch fashion, for example when large quantities of data need to be sorted to deliver any single report. But if all the CPUs are active with many more concurrent processes instead, then you'd likely benefit better from more cores. That's normally what you'll see in applications with a larger user count, such as databases backing web applications.
If you don't have the benefit of a working system to inspect, you can try to guess which type of situation you're in by noting the limitations of the database. PostgreSQL does not allow splitting a single query across more than one core, what's called parallel query by some other databases that support it. That means that if you have any one query or small number of queries that must run as fast as possible, the only way to do that is to prioritize getting faster cores.
Another situation where getting a faster core is the better choice is if you need to prioritize data loading or export situations. PostgreSQL's best performing data import method, COPY
, can easily become (but isn't always) limited by CPU performance, where that turns into the bottleneck for operations. While it's possible to split input files into pieces and load them in parallel, that's something you'll need to build or acquire yourself, rather than something the server knows how to do for you. Exporting a copy of the database using the pg_dump
utility is another example of something that can become CPU limited on some systems.
How much to prioritize memory for your application really depends on the size of the working set of data needed to handle the most common operations. Generally, adding more RAM will provide a significant performance boost. There are a few situations where you'd be better served doing something else instead:
- If your data set is small enough to fit into a smaller amount of RAM, adding more won't help you much. You probably want faster processors instead.
- When running applications that scan tables much larger than what you can feasibly purchase as RAM, such as in many data warehouse situations, you might be better served by getting faster disks rather than more memory.
The normal situation where more memory helps most is when the data you access frequently will fit with the larger amount, but not with the smaller. This happens more often than you might think because of the way database B-tree indexes are stored. Even if you can't fit the entirety of a table or even its index in memory, being able to store a good sized fraction of the index can mean that index-based data lookups will be significantly sped up. Having the most "popular" blocks from the top of the tree structure cached is helpful even if you can't fit all the leaves into memory too.
Once you have an application running, you can usually get a much better idea how memory is being used by looking inside the PostgreSQL buffer cache (and potentially inside the operating system one as well) and seeing what data it prefers to keep around. The section on using pg_buffercache
in this book shows how to monitor that data.
While it's always possible to run into situations where the CPU in your database server is its bottleneck, it's downright likely that you'll run into a disk bottleneck—particularly if you only have a drive or two in the system. A few years ago, the basic two choices in hard drives were the inexpensive ATA (also known as IDE) drives used in desktops versus the more serious SCSI drives aimed at servers.
Both technologies have marched forward, and the current choice you're most likely to run into when configuring a database server is whether to use Serial ATA (SATA) or Serial Attached SCSI (SAS). It's possible to find nearly identical drives available in both interfaces, and there are even drive controllers that allow attaching either kind of drive. Combined with a narrowing performance difference between the two, picking between them is harder than ever.
The broad parameters of each technology are straightforward to compare. Here's the state of things as this is being written:
- SAS disks:
- The maximum available RPM is higher: 10,000 or 15,000
- Not as much drive capacity: 73 GB-1 TB are popular sizes
- Cost per MB is higher
- SATA disks:
- Drives typically have a slower RPM: 7200 is standard, some 10,000 designs exist such as the Western Digital VelociRaptor
- Higher drive capacity: 2 TB available
- Cost per MB is lower
Generally, you'll find individual SAS disks to be faster even than SATA ones with similar specifications. In particular, you're likely to see better seek performance on random I/O due to faster drive mechanics in SAS, and sometimes a faster transfer rate from the disk too. Also, because the SAS drives have supported advanced features such as command queuing for longer, it's more likely your operating system will have matching support to take advantage of them.
The Redundant Array of Inexpensive Disks (RAID) approach is the standard way to handle both the performance and reliability limitations of individual disk drives. A RAID array puts many disks, typically of exactly the same configuration, into a set that acts like a single disk—but with either enhanced performance, reliability, or both. In some cases the extra reliability comes from computing what's called parity information for writes to the array. Parity is a form of checksum on the data, which allows reconstructing it even if some of the information is lost. RAID levels that use parity are efficient from a space perspective at writing data in a way that will survive drive failures, but the parity computation overhead can be significant for database applications.
The most common basic forms of RAID arrays used are:
- RAID 0: It is also called as Striping. Multiple disks are used at the same time, spreading reads and writes over each of them in parallel. This can be almost a linear improvement (two disks reading twice as fast as a single one), but a failure on any volume in the set will lose all the data.
- RAID 1: It is also called as Mirroring. Here more copies of the same data are put onto multiple disks. This can sometimes improve performance—a good RAID 1 mirroring across two disks might handle two reads by sending one to each drive. Reads executed in parallel against both drives can effectively double average seeks per second. But generally, the reason for RAID 1 is redundancy: if a single drive fails, the system will continue operating using the other one.
- RAID 10 or 1+0: This first takes pairs of disks and mirrors then using RAID 1. Then, the resulting set is striped using RAID 0. The result provides both high performance and the ability to tolerate any single disk failure, without as many ways for speed to suffer in the average and worst case as RAID 5/6. RAID 10 is particularly appropriate for write-heavy environments, where the parity computation overhead of RAID 5/6 can cause disk performance to suffer. Accordingly, it's the preferred RAID level for high-performance database systems.
- RAID 5: It is also called as Striped with Parity. This approach sits midway between 0 and 1. You stripe data across multiple drives similarly to RAID 0, which improves read performance. But some redundant data is added to a parity drive. If one of the disks in the array is lost, the missing data can be recomputed from the ones left using that parity information. While this is efficient in terms of how little space is wasted relative to the tolerance for disk failures provided, write performance in particular can suffer in RAID 5.
- RAID 6: Similar to RAID 5, except with more parity information, enabling survival even with two disk failures. It has the same fundamental advantages and disadvantages. RAID 6 is an increasingly common way to cope with the fact that rebuilding a RAID 5 array after a disk loss can take a really long time on modern, high capacity drives. The array has no additional fault tolerance during that period, and seeing a second drive failure before that rebuild finishes is not that unlikely when it takes many hours of intense disk activity to rebuild. Disks manufactured in the same batch are surprisingly likely to fail in groups.
To be fair in any disk performance comparison, you need to consider that most systems are going to have a net performance from several disks, such as in a RAID array. Since SATA disks are individually cheaper, you might be able to purchase considerably more of them for the same budget than had you picked SAS instead. If your believe your application will get faster if it is spread over more disks, being able to buy more of them per dollar spent can result in an overall faster system. Note that the upper limit here will often be your server's physical hardware. You only have so many storage bays or controllers ports available, and larger enclosures can cost more both up-front and over their lifetime. It's easy to find situations where smaller numbers of faster drives—which SAS provides—is the better way to go. This is why it's so important to constantly benchmark both hardware and your database application, to get a feel of how well it improves as the disk count increases.
Just because it's possible to buy really inexpensive SATA drives and get good performance from them, that doesn't necessarily mean you want to put them in a database server. It doesn't matter how fast your system normally runs at if a broken hard drive has taken it down and made the server unusable.
The first step towards reliable hard drive operation is for the drive to accurately report the errors it does run into. This happens through two mechanisms: error codes reported during read and write operations, and drive status reporting through the SMART protocol. SMART provides all sorts of information about your drive, such as its temperature and results of any self-tests run.
When they find bad data, consumer SATA drives are configured to be aggressive in retrying and attempting to correct that error automatically. This makes sense given that there's typically only one copy of that data around, and it's fine to spend a while to retry rather than report the data lost. But in a RAID configuration, as often found in a database environment, you don't want that at all. It can lead to a timeout and generally makes things more difficult for the RAID controller. Instead, you want the drive to report the error quickly, so that an alternate copy or copies can be used instead. This form of error handling change is usually the main difference in SATA drives labeled "enterprise" or "RAID edition". Drives labeled that way will report errors quickly, where the non-RAID versions will not. Having to purchase the enterprise version of a SATA hard drive to get reliable server error handling operation does close some of the price gap between them and SAS models. This can be adjustable in drive firmware. However, see http://en.wikipedia.org/wiki/TLER for more information about drive features in this area.
Generally, all SAS disks will favor returning errors so data can be reconstructed rather than trying to self-repair. Additional information about this topic is available as part of a commentary from Network Appliance about the drive reliability studies mentioned in the next section: http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/
Tip
Using an external drive for a database
External drives connected over USB or Firewire can be quite crippled in their abilities to report SMART and other error information, due to both the limitations of the common USB/Firewire bridge chipsets used to connect them and the associated driver software. They may not properly handle write caching for similar reasons. You should avoid putting a database on an external drive using one of those connection methods. Newer external drives using external SATA (eSATA) are much better in this regard, because they're no different from directly attaching the SATA device.
General expected reliability is also an important thing to prioritize. There have been three excellent studies of large numbers of disk drives published in the last few years:
- Google: Failure Trends in a Large Disk Drive Population http://research.google.com/archive/disk_failures.pdf
- Carnegie Mellon Study: Disk failures in the real world http://www.cs.cmu.edu/~bianca/fast07.pdf
- University of Wisconsin-Madison and Network Appliance: An Analysis of Data Corruption in the Storage Stack http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf (long version) or http://www.usenix.org/publications/login/2008-06/openpdfs/bairavasundaram.pdf (shorter version)
The data in the Google and Carnegie Mellon studies don't show any significant bias toward the SCSI/SAS family of disks being more reliable. But the U of W/Netapp study suggests "SATA disks have an order of magnitude higher probability of developing checksum mismatches than Fibre Channel disks". That matches the idea suggested above, that error handling under SAS is usually more robust than on similar SATA drives. Since they're more expensive, too, whether this improved error handling is worth paying for depends on your business requirements for reliability. This may not even be the same for every database server you run. Systems where a single master database feeds multiple slaves will obviously favor using better components in the master as one example of that.
You can find both statistically reliable and unreliable hard drives with either type of connection. One good practice is to only deploy drives that have been on the market long enough that you can get good data from actual customers on the failure rate. Newer drives using newer technology usually fail more often than slightly older designs that have been proven in the field, so if you can't find any reliability surveys that's reason to be suspicious.
In some disk array configurations, it can be important to match the firmware version of all the drives used for reliable performance. SATA disks oriented at consumers regularly have large changes made to the drive firmware, sometimes without a model number change. In some cases it's impossible to revert to an earlier version once you've upgraded a drive's firmware, and newer replacement drives may not support running an earlier firmware either. It's easy to end up where you can't purchase a replacement for a damaged drive even if the model is still on the market.
If you're buying a SAS drive, or one of the RAID oriented enterprise SATA ones, these tend to have much better firmware stability. This is part of the reason these drives lag behind consumer ones in terms of maximum storage capacity. Anyone who regularly buys the latest, largest drives available in the market can tell you how perilous that is—new drive technology is rather unreliable. It's fair to say that the consumer market is testing out the new technology. Only once the hardware and associated firmware has stabilized does work on the more expensive, business oriented versions such as SAS versions begin. This makes it easier for the manufacturer to keep firmware revisions stable as new drive revisions are released—the beta testing by consumers has already worked out many of the possible bugs.
The latest drive technology on the market right now are Solid State Drives (SSD), also called flash disks. These provide permanent memory storage without any moving parts. They can vastly outperform disks with moving parts, particularly for applications where disk seeking is important—often the case for databases and they should be more reliable too.
There are three major reasons why more databases don't use SSD technology yet:
- Maximum storage capacity is low, and databases tend to be big
- Cost for the storage you do get is fairly high
- Most SSD designs do not have a well designed write-cache on them
Due to how the erasing mechanism on a SSD works, you must have a write cache—typically a few kilobytes, to match the block size of the flash cells—for them to operate a way that they will last a long time. Writes are cached until a full block of data is queued up, then the flash is erased and that new data written.
While small, these are still effectively a write-back cache, with all the potential data corruption issues any such design has for database use (as discussed in detail later in this chapter). Some SSDs include a capacitor or similar battery backup mechanism to work around this issue; the ones that do not may have very rare but still real corruption concerns.
Until SSD manufacturers get better about describing exactly what conditions the write-cache in their device can be lost, this technology remains an unknown risk for database use, and should be approached with caution. The window for data loss and therefore database corruption is very small, but it's often there. Also you usually can't resolve that issue by using a controller card with its own battery-backed cache. In many cases, current generation cards won't talk to SSDs at all. And even if they can, the SSD may not accept and honor the same commands to disable its write cache that a normal drive would—because a working write cache is so central to the longevity aspects of the drive.
One of the most critical aspects of PostgreSQL performance is also one of the easiest to overlook. Several hardware trends have made the disk controller seem less important now:
- Increasing CPU speeds and improved Direct Memory Access (DMA) approaches make off-loading disk work to a controller card seemingly less valuable. There is little chance that your controller will be any faster at previously expensive tasks such as RAID parity computation than your main system processor(s).
- Recent operating-system's file system technology such as Linux Software RAID and ZFS make hardware RAID, with its often proprietary storage, seem less desirable.
- Virtual machines and cloud deployments make the underlying disk hardware almost completely disconnected from the interface that your operating system sees.
- Faster drives such as SSD seem to eliminate the need for an intermediate intelligent controller.
Do not be fooled however. When it comes to committing information into a database, spinning disk media—and even flash media such as SSDs—has some limitations that no amount of software cleverness can overcome.
When implementing a RAID array, you can do so with special hardware intended for that purpose. Many operating systems nowadays, from Windows to Linux, include software RAID that doesn't require anything beyond the disk controller on your motherboard.
There are some advantages to hardware RAID. In many software RAID setups, such as the Linux implementation, you have to be careful to ensure the system will boot off of either drive in case of a failure. The BIOS provided by hardware cards normally takes care of this for you. Also in cases where a failed drive has been replaced, it's usually easier to setup an automatic rebuild with a hardware RAID card.
When hard drives fail, they can take down the entire system in the process, if they start sending bad data to the motherboard. Hardware RAID controllers tend to be tested for that scenario, motherboard drive controllers aren't necessarily.
The main disadvantage to hardware RAID, beyond cost, is that if the controller fails you may not be able to access the drives anymore using a different controller. There is an emerging standard for storing the RAID metadata in this situation, the SNIA Raid Disk Data Format (DDF). It's not well supported by controller card vendors yet though.
The biggest advantage to hardware RAID in many systems is the reliable write caching they provide. This topic is covered in detail later in this chapter.
There are plenty of disk controllers on the market that don't do well at database tasks. Here are a few products that are known to work well with the sort of hardware that PostgreSQL is deployed on:
- LSI's MegaRAID line has been a source for reliable, medium performance SCSI and now SAS/SATA controllers for many years. They tend to have smaller cache sizes and their older products in particular were not always the fastest choice available, but their technology is mature and the drivers you get tend to be quite stable. The current SAS products perform extremely well in RAID10, the usual preferred topology for databases. Its RAID5 performance is still not very impressive.
- Dell has offered a rebranded LSI MegaRAID card as their PowerEdge RAID Controller (PERC) for some time now. The PERC6 is based on the LSI SAS design mentioned above, as is its replacement the PERC H700 and H800 (avoid the H200, which has no write cache at all). The PERC5 and earlier models tended to be slow, and the Dell customized firmware often didn't work as well as the ones in the genuine LSI models. These issues are all cleared up in the PERC6 and later models, which can easily clear 1GB/s of reads from a properly configured 24 disk array.
- 3ware was one of the first companies to offer SATA RAID solutions, and they're particularly well known for providing excellent Linux drivers. Some of the earlier 3ware models had unimpressive performance, the current 9690SA is a solid midrange performer if configured correctly. 3ware has gone through some company transitions; they were bought by AMCC, who were then bought by LSI. Eventually you can expect that 3ware will be just another LSI line.
- HP provides a few RAID cards in their Smart Array series of products, including the P400, P600, and P800. The main difference between these cards is performance. The P800 is well respected as a card with high performance, while the P400 and P600 are considered at best medium speed performers.
- Emulex and QLogic provide the most popular high-performance cards for attaching Fiber Channel disk arrays to a server.
- Areca is a less well known company than the rest on this list, but they've gained a following among fans of high-performance SATA cards; some models support SAS as well. Areca cards are featured in a few "white-box" vendor systems provided by resellers, for those who prefer not to deal with the big vendors mentioned above. One concern with Areca is getting a management utility that is compatible with your system. The more expensive models that include a built-in management network port, what they call their "Out-of-Band Manager", are easiest to deal with here—just access the card over the network via its web console.
Note
Driver support for Areca cards depends heavily upon the OS you're using, so be sure to check this carefully. Under Linux for example, you may have to experiment a bit to get a kernel whose Areca driver is extremely reliable, because this driver isn't popular enough to get a large amount of testing. The 2.6.22 kernel works well for several heavy PostgreSQL users with these cards.
Typically, the cheapest of the cards you'll find on the above list sells currently for around $300 USD. If you find a card that's cheaper than that, it's not likely to work well. Most of these are what's referred to as Fake RAID. These are cards that don't actually include a dedicated storage processor on them, which is one part that jumps the price up substantially.
Instead, Fake RAID cards use your system's CPU to handle these tasks. That's not necessarily bad from a performance perspective, but you'd be better off using a simple operating system RAID (such as the ones provided with Linux or even Windows) directly. Fake RAID tends to be buggy, have low quality drivers, and you'll still have concerns about the volume not being portable to another type of RAID controller. They won't have a battery-backed cache, either, which is another major component worth paying for in many cases.
Prominent vendors of Fake RAID cards include Promise and HighPoint. The RAID support you'll find on most motherboards, such as Intel's RAID, also falls into the fake category. There are some real RAID cards available from Intel though, and they manufacture the I/O processor chips used in several of the cards mentioned above.
Even just considering the real hardware RAID options here, it's impossible to recommend any one specific card because business purchasing limitations tend to reduce the practical choices. If your company likes to buy hardware from HP, the fact that Areca might be a better choice is unlikely to matter; the best you can do is know that the P800 is a good card, while their E200 is absent from the above list for good reason—it's slow. Similarly, if you have a big Dell purchasing contract already, you're likely to end up with a PERC6 or H700/800 as the only practical choice. There are too many business-oriented requirements that filter down what hardware is practical to provide a much narrower list of suggestions than what's above.
If you are connecting hard drives directly to your server through the motherboard or add-in cards, without leaving the case itself, that's referred to as direct-attached storage (DAS). The other alternative is to use an external interface, usually Fiber Channel or Ethernet, and connect a Storage Array Network (SAN) or Network Attached Storage (NAS) to hold the database disks. SAN and NAS hardware is typically much more expensive than DAS, and easier to manage in complicated ways. Beyond that, comparing the two is somewhat controversial.
A SAN or NAS (when SAN is used below in this section, it's intended to refer to both) has a few clear advantages over direct storage:
- Easier to use many drives. It's hard to get direct storage to go much over 24 drives without moving into multiple external storage units, but that's only a medium sized SAN that can be managed as a single component.
- Read/write caching is often much larger. 16 GB of cache is not unusual in a SAN, while direct storage will normally top out at closer to 1 GB. Also, SAN designs can include a way to dump even that large cache to disk cleanly in the event of any power failure, making it very hard to end up with a corrupted database in the process no matter how long the outage.
- SANs are easier to make redundant across multiple servers. Typically you'll have at least two fiber-channel ports in a SAN, making it easy to connect to two systems if you want. Then either server can access the data, as long as you partition access between the two properly.
- The management interface for a SAN will usually include fancier rebuild possibilities, and features such as snapshots that can make backup and mirroring operations much easier.
There are a few potential drawbacks as well:
- Performance can suffer compared to direct storage. Everything else being equal, going through the SAN/NAS interface (Fiber Channel, Ethernet, and so on) adds latency no matter what, and can introduce a write bottleneck too. This can really slow your system down if, for example, you're using an interface like Gigabit Ethernet with many disks behind it. You can easily fill the network capacity long before you've reached the speed of all available drives in that case.
- SANs are very complicated. It's common to need to hire a consultant just to get the system working as you expect, for your workload.
- Costs are much higher in terms of performance per dollar on a SAN compared to direct storage. If you really need the redundancy and management features a SAN provides, they can make perfect sense. This is particularly true if the SAN is so large it's serving out work for many systems at once, which can make the whole system easier to cost justify.
If you want performance at a reasonable price, direct storage is where you'll end up at. If you need a SAN or NAS, the reasons why are likely related to your business rather than its performance.
Note
If you follow the disk benchmarking practices recommended in this book, you should be able to nail down results quickly enough to make good performance a required component of the sales cycle for the SAN. Considering their price tag, you certainly can make the vendor of the SAN do some work as part of selling it if you benchmark before the system is completely paid for, in a standard way whose results aren't likely to be disputed. This can sidestep the need to hire an external consultant, by leveraging the resources inside the vendor instead.