Table for Operating System – File Cache Alignment

Geeking Out on SSD Hardware Developments (Linux Magazine, February 9th, 2010)

Intel and Smaller Cells
“According to this link the theoretical lower limit on NAND cells appears to be around 20nm. One of the big reasons for this limitation is that as cells shrink in size they are, naturally, closer together. However, the voltage required to program a cell remains about the same (typically 12V). This means that the power density increases (amount of power in a given area) increasing the probability that the voltage will disturb the neighboring cells causing data corruption. Consequently, increasing the density of cells can be a dicey proposition, hence the lower limit.

Recently, IMFT (Intel-Micron Flash Technologies) LLC announced that it had begun sampling 2 bits per cell (MLC) NAND flash chips that have been manufactured using a 25nm process. This announcement is significant because of the increased density and how close the density is getting to the theoretical limit. Plus the fact that one of the best performing SSD drives is from Intel, one of the participating companies in the LLC, and we can see a fairly significant shift in technology.

IMFT is a joint project by Intel and Micro to develop new technologies around Flash storage (primarily NAND). The project started with production several years ago with a 72nm process. They then moved to a 50nm process in 2008 followed by a further reduction to a 34nm process in 2009. It is the 34nm process that current Intel SSDs utilize (Intel X25-M G2). The 34nm process produces a 4GB MLC NAND chip with a die size of 172 mm2.The new 25nm process is targeted for a first product which is an 8GB MLC NAND flash chip with a die size of 167 mm2. So going from 34nm to 25nm doubles the die density.

In addition to the doubling of die density the new chips will have some other changes. The current 34nm generation of chips have a page size of 4KB and 128 pages per block resulting in a block size of 512KB. The new chip will have a page size of 8KB and 256 pages per block. The means that the new block size is 8KB * 256 = 2,048KB (2MB). This change in block size can have a significant impact on performance.

Recall that a block is the smallest amount of storage that goes through an erase/write cycle when any single bit of the block is changed. For example, if any bit within the block is changed then the entire block has to first have the unchanged data copied from the block to cache and then the block is erased. Finally the updated information is merged with the cache data (unchanged data) and the entire block is written to the erased block. This process can take a great deal of time to complete and also uses a rewrite cycle for all of the cells in the block (remember that NAND cells have a limited number of rewrite cycles before they can no longer hold data).

The new 25nm chip will switch from 512KB blocks to 2MB blocks (2,048KB) increasing the amount of data that has to go through the read/erase/write cycle compared to the 34nm chips. To adjust to this change, Intel will have to make adjustments to the firmware to better handle the larger blocks. In addition, it is likely that Intel will have to at least double the cache size to accommodate the larger block sizes. It may have to increase the number of spare pages as well since a single bit change could cause a greater number of blocks to be tagged for updating. However, on the plus side, with larger pages, the controller can do much more optimization for writes including a much greater amount of write coalescing. But this too could increase the amount of cache needed.”

Chatwin’s Table

Conventional Allocation
4 KiB
Alignment
1 MB

Improved Allocation
64 KiB
Alignment
2 MB

OS – 32bit
File Level Cache

Page – 32 K
Size – 128 MB

Page – 64 K
Size – 256 MB

OS – 64bit
File Level Cache

Page – 64 K
Size – 256 MB

Page – 128 K
Size – 512 MB

OS – 32bit
Storage Level Cache

Page – 64 K
Size – 256 MB

Page – 128 K
Size – 512 MB

OS – 64bit
Storage Level Cache

Page – 128 K
Size – 512 MB

Page – 256 K stripe unit
Size – 1 GB

Cluster Unit Aligned
256 Kibit (32 KB)

Windows XP x32
(8*4)
System RAM: 2 GB

Physical Page Size:
32 Kibit (4 KB)
Logical Block Size:
128 KBytes

Sector Size – 512 bytes

Cluster Unit Aligned
512 Kibit (64 KB)

Windows XP x64
(8*8)
System RAM: 4 GB

Physical Page Size:
64 Kibit (8 KB)
Logical Block Size:
256 KBytes

Sector Size – 1024 bytes

Partition Alignment for SSD’s

Microsoft Support: Article ID: 929491 – Last Review: June 8, 2009 – Revision: 4.0 (Summary)

“Disk performance may be slower than expected when you use multiple disks in Microsoft Windows Server 2003, in Microsoft Windows XP, and in Microsoft Windows 2000. For example, performance may slow when you use a hardware-based redundant array of independent disks (RAID) or a software-based RAID.

This issue may occur if the starting location of the partition is not aligned with a stripe unit boundary in the disk partition that is created on the RAID. A volume cluster may be created over a stripe unit boundary instead of next to the stripe unit boundary. This is because Windows uses a factor of 512 bytes to create volume clusters. This behavior causes a misaligned partition. Two disk groups are accessed when a single volume cluster is updated on a misaligned partition.

To verify that an existing partition is aligned, divide the size of the stripe unit by the starting offset of the RAID disk group. Use the following syntax:

((Partition offset) * (Disk sector size)) / (Stripe unit size)

Example of alignment calculations in bytes for a 256-KB stripe unit size:

(63 * 512) / 262144 = 0.123046875
(64 * 512) / 262144 = 0.125
(128 * 512) / 262144 = 0.25
(256 * 512) / 262144 = 0.5
(512 * 512) / 262144 = 1

Example of alignment calculations in kilobytes for a 256-KB stripe unit size:

(63 * .5) / 256 = 0.123046875
(64 * .5) / 256 = 0.125
(128 * .5) / 256 = 0.25
(256 * .5) / 256 = 0.5
(512 * .5) / 256 = 1

These examples shows that the partition is not aligned correctly for a 256-KB stripe unit size until the partition is created by using an offset of 512 sectors (512 bytes per sector).”

SMART Modular Technologies (author: Esther Spanjer), April 2010

“NAND flash devices are divided into erasable blocks composed of multiple pages (up to 256 pages per block, and up to 8KB per page). A flash block must be fully erased prior to re-writing, and a single-block erase process can take up to several milliseconds. The write speed may suffer a great deal if the SSD controller has to perform unnecessary block erase operations due to partition misalignment. Proper partition alignment is one of the most critical attributes that can greatly boost the I/O performance of an SSD due to reduced read modify‐write operations.

Windows XP or Windows Server 2000/2003 start partition offset at 31.5KB (32,256 bytes). Due to this misalignment, clusters of data are spread across physical memory block boundaries, incurring read-modify-write penalty. As a result, the host ends up writing at least 2X more I/O for every write as illustrated in this figure:

Alignment
When choosing a partition starting offset, SMART Modular recommends system integrators to correlate the partition offset with the RAID stripe size and cluster size to achieve the optimal SSD I/O performance. The figure shows an example of a misaligned partition offset and an example of an aligned partition offset for Windows Server.”

Chatwin says: Based on the conclusions as mentioned above, I would recommend the following alignment and cluster settings for NAND flash devices (unit = 512 bytes):

((Partition offset / Flash unit) * (Disk sector size)) / (Erase block)

((2048 / 2) * .5) / 512K = 1
((4096 / 4) * .5) / 512K = 1

With a flash unit of 2 for SLC flash (aka. 1K, 1 byte/cell group [32×32]) and 4 for MLC flash (aka. 2K, 2 bytes/cell group [32×32]), this results in an offset of 1 MB (2048) for SSD’s with Single Layer Cell memory, and 2 MB (4096) for Multi Layer Cell memory, when you want to align your partitions on erase block boundaries. The best cluster size (allocation unit) can be calculated with this formula:

((Single erase block / Flash unit) * (Disk sector size)) = Cluster Size

((128 / 2) * .5) = 32K (SLC),
((512 / 4) * .5) = 64K (MLC, “The Force“)

With 8K as the ideal sector size for flash memory (4K SLC), this is in my opinion the best choice when you want the fastest performance for your SSD (Indilinx, Samsung, JMicron controllers), unless you’re more concerned about maximizing your storage capacity at any price. And not to forget: a longer lifespan, for reducing a lot of overhead and fragmentation compared to the standard NTFS formatting of 4096 bytes. Keep in mind that this setting was default in a time that hard disks were limited to 2-4 GB, instead of 2 TB in this time and age.

AS SSD Benchmark

Benchmark Reviews (© 2010):

“Solid State Drives have traveled a long winding course to finally get where they are today. Up to this point in technology, there have been several key differences separating Solid State Drives from magnetic rotational Hard Disk Drives. While the DRAM-based buffer size on desktop HDDs has recently reached 32 MB and is ever-increasing, there is still a hefty delay in the initial response time. This is one key area in which flash-based Solid State Drives continually dominates because they lack moving parts to “get up to speed”. However the benefits inherent to SSDs have traditionally fallen off once the throughput begins, even though data reads or writes are executed at a high constant rate whereas the HDD tapers off in performance. This makes the average transaction speed of a SSD comparable to the data burst rate mentioned in HDD tests, albeit usually lower than the HDD’s speed. Comparing a Solid State Disk to a standard Hard Disk Drives is always relative; even if you’re comparing the fastest rotational spindle speeds. One is going to be many times faster in response (SSDs), while the other is usually going to have higher throughput bandwidth (HDDs).”

Patriot Inferno SSD Kit PI100GS25SSDR (24 May 2010)

“The biggest mistake PC hardware enthusiast make with SSDs is grading them by their speed. File transfer speed is important, but only so long as the operational IOPS performance can sustain that bandwidth under load. Benchmark Reviews tests the 100GB Patriot Inferno SSD, model PI100GS25SSDR, against some of the most popular storage devices available and demonstrates that 4K IOPS performance is more important than speed. For decades, the slowest component in any computer system was the hard drive. Most modern processors operate within approximately 1 ns (nanosecond = one billionth of one second) response time, while system memory responds between 30-90 ns. Traditional Hard Disk Drive (HDD) technology utilizes magnetic spinning media, and even the fastest spinning desktop storage products exhibit a 9,000,000 ns – or 9 ms (millisecond = one thousandth of one second) initial response time. In more relevant terms, The processor receives the command and waits for system memory to fetch related data from the storage drive. The difference a SSD makes to operational reaction times and program speeds is dramatic, and takes the storage drive from a slow ‘walking’ speed to a much faster ‘driving’ speed. Solid State Drive technology improves initial response times by more than 450x (45,000%) for applications and Operating System software, when compared to their HDD counterparts. Alex Schepeljanski of Alex Intelligent Software develops the free AS SSD Benchmark utility for testing storage devices. The AS SSD Benchmark tests sequential read and write speeds, input/output operational performance, and response times. Because this software receives frequent updates, Benchmark Reviews recommends that you compare results only within the same version family. Beginning with sequential read and write performance, the Patriot Inferno Solid State Drive produced 207.95 MB/s read speed, and 130.60 MB/s write performance. The sequential file transfer speeds have traditionally been low with this benchmark tool, especially for SandForce controllers, which is why we will concentrate on the operational IOPS performance for this section. Single-threaded 4K IOPS performance delivers 21.54 MB/s read and 61.18 MB/s write, which is among the highest results we’ve recorded. Similarly, the 64-thread 4K reads recorded 124.05 MB/s while write performance was 94.46… both earning the Patriot Inferno SSD a spot at the very top of our charts.”

Chatwin’s opinion: The Patriot Inferno (on the left) has excellent random R/W values, thanks to the SandForce controller, which uses complex algorithms to compress small user-data into 4K flash pages. The manufacturer also implements some kind of data redundancy, to ensure data integrity. All of this increases the IOPS and efficiency of the SSD and reduces the need for high quality (SLC) NAND flash or large DRAM buffers. The Indilinx controller of the Vertex Turbo (on the right) has a straight forward page level mapping (with 64 MB memory cache to combine write requests). In combination with a logical/physical 1:1 mapping (LBA/SSD), the Vertex (RAID 0, specifications: “The Force“) outperforms every other MLC SSD I’ve seen so far in access time (except the Intel X25-M G2 Postville, with a narrowed write bandwidth of 95 MB/s and 32K clustering on-the-fly, these SSD’s are simply unbeatable…). The sequential read/write speed is quite impressive. The two OCZ Vertex Turbo’s score lower IOPS values than the Patriot, but the bandwidth is much higher, even for RAID configurations.

Besides an exceptional response time, the result is better write coalescing when using my workstation for day-to-day stuff: email checking, editing office documents, internet browsing (caching redirected to hard disk of course). An average of 30 MB/hour at the most (zero to 3 when left for idle). I use Diskeeper HyperFast for maintenance, but so far it never initiated an automatic (free space) defragmentation. Even after several months of installing software and restoring system backups.

Note: AS SSD Benchmark was run in 2 sessions (first: Seq./Acc. Time, then 4K (-64) Thrd). No tricks or manipulations involved…
as-ssd-bench_Patriot-Inferno-AHCIas-ssd-bench Vertex0 18-6-2010

%d bloggers like this: