Flash : Write Amplification, Bit Error Rate & ECC Algorithms

Just like any new technology, along with the “wow” factor comes the limitations, “flash storage” in its native state also has issues which need to be dealt by every vendor to make it more reliable, provide better endurance and thus increase the life of a flash chip. I am hereby using this blog post as a center-stage to discuss the concerns and the methods which every vendor accustoms to increase the longevity of their flash chip.

Write Amplification

“Write Amplification”, as the name implies, is a phenomenon which increase the number of “writes’, where the actual amount of physical information written is a multiple of the logical amount intended to be written. It plays a critical role in increasing the endurance of a flash chip.

The lower the write amplification, the longer the flash will last.  Flash architects pay special attention to this aspect of controller design.Picture22

From what I explained so far, many operations like garbage collection, wear leveling, etc. keep happening at the background and the process to perform these operations results in moving (or rewriting) user data and metadata more than once.

Thus, rewriting some (one cell) data requires an already used portion of flash to be read, updated and written to a new location, together with initially erasing the new location if it was previously used at some point in time; due to the way flash works, much larger portions of flash must be erased and rewritten than actually required by the amount of new data.

This multiplying effect increases the number of “writes” required over the life of the flash which shortens the time it can reliably operate.

Vendors use various techniques like compression, deduplication (it has other implications where garbage collection algorithm or wear leveling algorithm has to communicate with hash algorithm to do the erase, thus increasing the load on processor) etc. to drop the write amplification and increase the life span of the chip.

Bit – Error Rate

Bit error rate as the definition says, is the number of “bit” errors occured in a particular interval of time. Errors can occur because of a number of reasons. There are two main reasons because of which “bit” errors occur.

  • Read / Write disturb
  • Charge getting trapped

Program (or write) or Read Disturb – Due to small size of flash gates in MLC, when you apply threshold voltage (as explained in “Understanding flash at its core”) to read or program a cell there is every chance the cells nearby (within the same block) might get disturbed because of which their program state might change or get slightly programmed which otherwise would get nullified when an erase operation happens or error correction code (ECC) algorithms are used to correct the same.

Picture23Charge Getting TrappedWhen you are programming a cell or erasing a cell there is chance that electrons might get trapped in the tunnel oxide layer between the floating gate and the semiconductor while tunneling. This usually happens when cells have been programmed and erased quite a few times and tunnel oxide layer has become weak. Advanced error correction code algorithms are stored alongside user data to ensure that incorrect information is spotted and dealt with while any underlying pages are marked as unusable.

NAND Flash errors can also be caused by elevated heat, manufacturing defects, or even simply repeated use, also known as wear out. Hence, the error correction code (ECC) algorithms used and at what level the vendor has programmed to handle these errors becomes important while measuring the endurance of a flash system.

Error Correction Code (ECC) Algorithms

Error correction code algorithms have a big impact on the endurance of a flash chip.Picture18

All NAND flash requires ECC to correct random “bit” errors. In an attempt to make NAND flash cheap, the voltage passed through the NAND flash has become very narrow because of which the errors like read disturb, program disturb, errors due to wear, etc. occur.

Error correction code algorithm also has two types of code,

  1. Predictable errors check, which helps to correct errors that are caused by internal mechanisms that are inherent to the design of the chip.  A prime example of such an error would be adjacent cell disturb.
  2. Unpredictable errors check, which help in correcting more unpredictable wear that occurs due to charge getting trapped, elevated heat, etc. which are not expected to happen due to measures already in place. In short this algorithm handles more complex errors.

More sophisticated ECC requires more processing power in the controller and may be slower than less sophisticated algorithms.  Also, the number of errors that can be corrected can depend upon how large a segment of memory is being corrected.  A controller with elaborate ECC capabilities is likely to use more compute resources and more internal RAM than would one with simpler ECC.  These enhancements will make the controller more expensive hence the increase in cost of a flash device.

Another thing, I did not mention is the math behind the error correction code algorithm. I am mesmerized by the aptitude of the folks who have mastered ECC, and their ability to extract more life out of a flash block than any mere mortal would think possible. I am not attempting to explain the same in this blog as I want this blog to be simple for anyone to understand.

Last but not the least, every error correction code algorithm is designed to correct only a limited (this usually depends on research that an organization has done on the flash chip) number of errors which actually determines the lifespan of a flash chip.

References: SAN Disk, Toshiba, Micron, IBM Redbooks

Flash: P/E Cycles, Wear Leveling & Garbage Collection

Starting this page, I am going to explain the real deal in the functioning of the flash storage, which in turn, would change your outlook on how you would evaluate a flash product.

Program Erase (P/E) Cycles

As explained in my previous blog, when you write the data onto the flash it is called a program state that is when you hold the electrons in the floating gate. Writing operations happen at the page level (typically 8-16KB in size). Read operations also happen at the page level.

An Interesting part of the operations in a flash chip is when you have to update the data already written, unlike a disk storage you cannot just perform update operations or undo or change a particular data. In a flash chip if you want to update or change the data already written you will have to erase the old data first and rewrite the whole data again and erase operations happen at a block level.

Picture14Picture13

 

 

 

Every time you have to erase a page or update data in a page you will have to erase the whole block even if you don’t want to update the other pages in the block. Erase operation takes longer than read operations as you have to change the whole block. This is why the life of flash chip is measured in Program Erase cycles, (also referred to as PE cycles) because both program and erase happen simultaneously and this, in turn, leads to the damage of oxide layer (please refer to my blog understanding flash at its core), and each flash chip has only a limited number of program & erase cycles it can take.

An alternative to this “erase operation” would be to mark the page as invalid and write the new data in a new page. In this way you can avoid the obscure erase cycle and increase the life of a chip but when you write data to another location you will have to redirect the reads of that page marked as invalid to the location. This is where flash translation layer would kick in.

So to make flash a friendly medium for storing our data, we have an abstraction layer (Flash translation layer) which will:

  1. Write updated information to a new empty page and then divert all subsequent read requests to its new address
  2. Ensure that newly-programmed pages are evenly distributed across all of the available the flash so that it wears evenly
  3. Keep a list of all the old invalid pages so that at some point, later on, they can all be recycled for reuse

Wear Leveling

Picture12Wear leveling sounds pretty simple and easy when you hear it first. You have a flash with defined set of blocks and PE cycles, (program/ write happens at page level and erase happens at block level,) as constant program and erase cycles wear out the flash blocks, instead of erasing a block every time you have to update a page within a block you mark that page as invalid and write into a new page. These invalid pages at some time have to be erased to reutilize the space. This helps the flash blocks to wear out evenly than a few blocks wearing out early and thus reducing the capacity promised to the customer.

There also another part of wear leveling which we don’t look at i.e. within the flash storage there would be some blocks which would be only frequently read but not updated or where data doesn’t change. These would be cold blocks while other are being updated which would be hot blocks. These cold blocks would never wear out. This would again lead to uneven wearing of flash blocks. So to avoid such situation of the system we take steps to manually relocate that cold data otherwise those blocks won’t ever wear… and that means we are actually adding write workload to the system, which ultimately means increasing the wear.

In other words, the more aggressive we are at wear leveling the early we would wear out the system but if we don’t do wear leveling thinking of the cons … we would end up with hot and cold spots and it would lead to uneven wearing of the system. Hence, it is a question of right balance.

Garbage collection

We have so far talked about marking the pages invalid and writing the new data on a fresh page. These invalid pages have to be recycled i.e. they have to be erased. Of course erase would be a big operation to do as you have many pages in the same block which are being used and in flash you have to erase a complete block and cannot just recycle a page.

Let me explain the tricky part

Picture10

In the above picture, you will see that 30% of the blocks are written and rest are empty.

Picture11

Now if the data has to be updated then instead of erasing the whole block and re-writing it they mark those page as invalid or stale and write the data in another page in the same block or another block.

Picture9

Picture8

In the above diagram, there is 50% free space in each block which garbage collection algorithm can use to copy the data from a second block and erase the second block completely and reclaim the space.

What if the block is 50%-70% full? Like in the diagram below

Picture7

How will the garbage collection algorithm erase the invalid pages without being able to copy the complete block data into other blocks?

This situation is a disaster because at this point it can never free up the stale blocks, which means I’ve effectively just turned my flash system into a read-only device and if you look at the capacity graph I have used only 70% of the capacity. Does this mean I can never use my flash system to 100%?

This is the reason why all flash vendors over-provision (as below) the storage for free to help you utilize 100% of the Flash storage.

Picture6

Yes, there’s more flash in your device than you can actually see. This extra area is not pinned, by the way, it’s not a dedicated set of blocks or pages. It’s just mandatory headroom that stops situations like the one we described above from ever occurring.

Having explained three important concepts of flash I will take a break now.

I will explain few more interesting features of NAND flash functioning in coming few blogs hence………to be continued.

NAND Flash: SLC or MLC or eMLC or TLC

A big question that I came across all the time is, “Why should a customer go with MLC (or cMLC) or eMLC flash?”

Storage Market is so dominant with MLC NAND flash (I will not mention NAND further in this article but I would be talking about NAND flash only in this post) that no one even looks at SLC or TLC and many are not even aware that there is something like SLC and TLC exist too. So when I decided to write this blog post, I thought I should cover the complete picture of flash, to give my readers an end to end view of flash solutions, which will help them in planning the storage needs of their organization.

Picture15

Picture used from Toshiba’s document

As discussed in my previous post (Understanding flash at its core), a cell is programmed by applying a voltage to the control gate and erased by applying a negative voltage to the control gate. If programming of the cell is either “0” or “1” level, i.e. for example if you are applying a maximum of 5V at control gate and the charge of the cell is below 2.5V (i.e. 50% of the maximum charge of 5V), then we take it as “0” and any charge above 2.5V or above 50% is taken as “1”, this is called Single Level Cell (SLC).

SLC flash is always in one of two states, programmed (0) or erased (1).  As there are only two choices, zero or one, the state of the cell can be interpreted very quickly and the chances of bit error due to varying voltage is reduced. Hence, each SLC cell can be programmed at a very less voltage or erased easily. This increases the endurance of the cell hence the program-erase cycles.

SLC flash is generally used in commercial and industrial applications and embedded systems that require high performance and long-term reliability. SLC uses a high-grade of flash media which provides good performance and endurance, but the trade-off is its high price.  SLC flash is typically more than twice the price of multi-level cell (MLC) flash.

Multi Level Cell (MLC), on the other hand, uses more states or levels of the cell than just “0” or “1” i.e. as used in the example above of 5V. If we break the voltage further into three or four levels, (we will use four levels for example) 0V – 1.25V as 00 (level 1), 1.25V – 2.50V as 01 (level 2), 2.50V – 3.75V as 10 (level 3) and 3.75V – 5V as 11(level 4). Hence, a more precise voltage has to be measured. This increased density gives MLC a lower cost per bit stored but also creates a higher probability of bit errors due to the very precise voltage used.

Therefore, the time taken for read, write and erase in MLC is much longer than SLC as now voltage has to be much more precise to read, write and erase. Hence program and erase cycles decrease too thus decreasing the lifetime of the cell.

Picture17Triple Level cell (TLC) takes it a step further and stores three bits per cell, or eight voltage states (000, 001, 010, 011, 100, 101, 110, and 111). Using 4V as an example to make it easy to understand, 0V – 0.5V as 000 (level 1), 0.5V – 1V as 001 (level 2), 1V – 1.5V as 010 (level 3),1.5V – 2V as 100 (level 4), 2V – 2.5V as 011 (Level 5), 2.5V – 3V as 101 (Level 6), 3V – 3.5V as 110 (Level 7), 3.5V – 4 V as 111 (Level 8).

From the above example, you can make out how difficult or precise the measurement of voltage would become, therefore increasing the time of read, write and erase. Same die as SLC or MLC would become denser but wear levels and endurance of the cell drop down a lot thus decreasing the program and erase cycles of the cell.

TLC is targeted towards environments with predominant read uses and has not been commonly used.

You might be wondering that why I haven’t touched eMLC till now and moved on to explain TLC? I kept eMLC to the last because I want to take some time to explain in detail what is eMLC (lot of vendors in the market are trying to convince customers that they need eMLC and MLC just don’t cut their need) and how vendors like IBM, Violin, etc today have built technologies around MLC and made it equivalent to eMLC.

Enterprise Multi Level Cell (eMLC)

Long back MLC was never considered for enterprise applications, as MLC on its own can take only 3,000 to 10,000 program and erase cycles leading to decrease in the endurance and reliability of an MLC flash chip. Customers use to depend on SLC for enterprise applications. MLC was used only in consumer devices, such as cameras, smartphones, media players and USB sticks.

SLC being very expensive, corporates use to find it difficult to adopt SLC though it has higher (100,000) program and erase cycles. Hence, vendors in the market started finding out a mid-way between MLC and SLC.

To try to address reliability issues of MLC, NAND flash manufacturers have created a grade of MLC called eMLC. In eMLC they have decreased the density of the data that can be written to the cell (i.e. increased the difference of voltage between the two states) and slowed down the speed with which the data is written (or programmed) to the flash device to increase the program and erase cycles of MLC by 3X and thus increasing the endurance of the MLC chip.

Following are benefits that were achieved because of this;

  • Decrease in bit error rate (increases the amount of margin separating the states and lower errors)
  • MLC program & erase (P/E) cycles have increased to 30,000.
  • Lower cost than SLC.

Following are cons that were achieved because of this;

  • By decreasing the density of the data that can be written to a cell, they have also increased the number of cells to be written when compared to the same amount of data in MLC.
  • Decrease in the write speeds lead to a decrease in the performance of an MLC chip (which comes at a high cost).
Picture16

Data taken from tests run on eMLC and MLC by “SSD guy”

The write parameters in the table above show how different these two technologies are;

  • 4K write IOPS are only 75% of the write IOPS of the MLC version
  • Sequential write speed is 74% as fast as MLC
  • 70/30 read/write IOPS (similar to a standard workload), has a lot of reads.  Since the read speed of the eMLC is equal to that of the MLC, the speed gap for this test is smaller. The eMLC SSD is 85% as fast as the MLC SSD
  • In the case of write latency, MLC has only 83% of the latency of the eMLC version.

In short while eMLC provides you 3X increases in endurance than and MLC chip, your performance drops down by 15-25%.

Also, as eMLC drives are sold 100x less often, they are tested less in the field and, in turn, contain a much higher frequency of bugs in their firmware. MLC on the other hand, is found in millions of consumer based devices, and in turn manufacturers rigorously test MLC drives to avoid early failures and widespread issues in millions of devices. In a few tests conducted by industry leading MLC vendors, it was found that failure rates on eMLC are 10x worse than MLC due to firmware issues.

In the present era, storage vendors like IBM, Violin memory, etc., have taken MLC chips and improved the program & erase cycles (or endurance) of MLC itself up to 9 times by over provisioning, improving the intelligence of controllers used and by including proprietary Error Correction Code Algorithms.

Hence, I would suggest my readers to understand or do through research on below features (and how they would work with their applications) of any enterprise flash device (while purchasing) than just alone considering the superficial knowledge given by storage vendors in the market.

  • Program Erase Cycles
  • Wear Leveling
  • Garbage Collection
  • Write Amplification
  • Error Correction Code Algorithms
  • Bit Error Rate.

Having done a marathon job of explaining the difference between SLC, MLC, eMLC and TLC, I would want to end this post with a note that I will explore the above mentioned topics as a basis of my next post to help you understand flash further more.

References: Toshiba, Micron, TechTarget