Flash : Write Amplification, Bit Error Rate & ECC Algorithms

Just like any new technology, along with the “wow” factor comes the limitations, “flash storage” in its native state also has issues which need to be dealt by every vendor to make it more reliable, provide better endurance and thus increase the life of a flash chip. I am hereby using this blog post as a center-stage to discuss the concerns and the methods which every vendor accustoms to increase the longevity of their flash chip.

Write Amplification

“Write Amplification”, as the name implies, is a phenomenon which increase the number of “writes’, where the actual amount of physical information written is a multiple of the logical amount intended to be written. It plays a critical role in increasing the endurance of a flash chip.

The lower the write amplification, the longer the flash will last.  Flash architects pay special attention to this aspect of controller design.Picture22

From what I explained so far, many operations like garbage collection, wear leveling, etc. keep happening at the background and the process to perform these operations results in moving (or rewriting) user data and metadata more than once.

Thus, rewriting some (one cell) data requires an already used portion of flash to be read, updated and written to a new location, together with initially erasing the new location if it was previously used at some point in time; due to the way flash works, much larger portions of flash must be erased and rewritten than actually required by the amount of new data.

This multiplying effect increases the number of “writes” required over the life of the flash which shortens the time it can reliably operate.

Vendors use various techniques like compression, deduplication (it has other implications where garbage collection algorithm or wear leveling algorithm has to communicate with hash algorithm to do the erase, thus increasing the load on processor) etc. to drop the write amplification and increase the life span of the chip.

Bit – Error Rate

Bit error rate as the definition says, is the number of “bit” errors occured in a particular interval of time. Errors can occur because of a number of reasons. There are two main reasons because of which “bit” errors occur.

  • Read / Write disturb
  • Charge getting trapped

Program (or write) or Read Disturb – Due to small size of flash gates in MLC, when you apply threshold voltage (as explained in “Understanding flash at its core”) to read or program a cell there is every chance the cells nearby (within the same block) might get disturbed because of which their program state might change or get slightly programmed which otherwise would get nullified when an erase operation happens or error correction code (ECC) algorithms are used to correct the same.

Picture23Charge Getting TrappedWhen you are programming a cell or erasing a cell there is chance that electrons might get trapped in the tunnel oxide layer between the floating gate and the semiconductor while tunneling. This usually happens when cells have been programmed and erased quite a few times and tunnel oxide layer has become weak. Advanced error correction code algorithms are stored alongside user data to ensure that incorrect information is spotted and dealt with while any underlying pages are marked as unusable.

NAND Flash errors can also be caused by elevated heat, manufacturing defects, or even simply repeated use, also known as wear out. Hence, the error correction code (ECC) algorithms used and at what level the vendor has programmed to handle these errors becomes important while measuring the endurance of a flash system.

Error Correction Code (ECC) Algorithms

Error correction code algorithms have a big impact on the endurance of a flash chip.Picture18

All NAND flash requires ECC to correct random “bit” errors. In an attempt to make NAND flash cheap, the voltage passed through the NAND flash has become very narrow because of which the errors like read disturb, program disturb, errors due to wear, etc. occur.

Error correction code algorithm also has two types of code,

  1. Predictable errors check, which helps to correct errors that are caused by internal mechanisms that are inherent to the design of the chip.  A prime example of such an error would be adjacent cell disturb.
  2. Unpredictable errors check, which help in correcting more unpredictable wear that occurs due to charge getting trapped, elevated heat, etc. which are not expected to happen due to measures already in place. In short this algorithm handles more complex errors.

More sophisticated ECC requires more processing power in the controller and may be slower than less sophisticated algorithms.  Also, the number of errors that can be corrected can depend upon how large a segment of memory is being corrected.  A controller with elaborate ECC capabilities is likely to use more compute resources and more internal RAM than would one with simpler ECC.  These enhancements will make the controller more expensive hence the increase in cost of a flash device.

Another thing, I did not mention is the math behind the error correction code algorithm. I am mesmerized by the aptitude of the folks who have mastered ECC, and their ability to extract more life out of a flash block than any mere mortal would think possible. I am not attempting to explain the same in this blog as I want this blog to be simple for anyone to understand.

Last but not the least, every error correction code algorithm is designed to correct only a limited (this usually depends on research that an organization has done on the flash chip) number of errors which actually determines the lifespan of a flash chip.

References: SAN Disk, Toshiba, Micron, IBM Redbooks

Advertisements

Flash: P/E Cycles, Wear Leveling & Garbage Collection

Starting this page, I am going to explain the real deal in the functioning of the flash storage, which in turn, would change your outlook on how you would evaluate a flash product.

Program Erase (P/E) Cycles

As explained in my previous blog, when you write the data onto the flash it is called a program state that is when you hold the electrons in the floating gate. Writing operations happen at the page level (typically 8-16KB in size). Read operations also happen at the page level.

An Interesting part of the operations in a flash chip is when you have to update the data already written, unlike a disk storage you cannot just perform update operations or undo or change a particular data. In a flash chip if you want to update or change the data already written you will have to erase the old data first and rewrite the whole data again and erase operations happen at a block level.

Picture14Picture13

 

 

 

Every time you have to erase a page or update data in a page you will have to erase the whole block even if you don’t want to update the other pages in the block. Erase operation takes longer than read operations as you have to change the whole block. This is why the life of flash chip is measured in Program Erase cycles, (also referred to as PE cycles) because both program and erase happen simultaneously and this, in turn, leads to the damage of oxide layer (please refer to my blog understanding flash at its core), and each flash chip has only a limited number of program & erase cycles it can take.

An alternative to this “erase operation” would be to mark the page as invalid and write the new data in a new page. In this way you can avoid the obscure erase cycle and increase the life of a chip but when you write data to another location you will have to redirect the reads of that page marked as invalid to the location. This is where flash translation layer would kick in.

So to make flash a friendly medium for storing our data, we have an abstraction layer (Flash translation layer) which will:

  1. Write updated information to a new empty page and then divert all subsequent read requests to its new address
  2. Ensure that newly-programmed pages are evenly distributed across all of the available the flash so that it wears evenly
  3. Keep a list of all the old invalid pages so that at some point, later on, they can all be recycled for reuse

Wear Leveling

Picture12Wear leveling sounds pretty simple and easy when you hear it first. You have a flash with defined set of blocks and PE cycles, (program/ write happens at page level and erase happens at block level,) as constant program and erase cycles wear out the flash blocks, instead of erasing a block every time you have to update a page within a block you mark that page as invalid and write into a new page. These invalid pages at some time have to be erased to reutilize the space. This helps the flash blocks to wear out evenly than a few blocks wearing out early and thus reducing the capacity promised to the customer.

There also another part of wear leveling which we don’t look at i.e. within the flash storage there would be some blocks which would be only frequently read but not updated or where data doesn’t change. These would be cold blocks while other are being updated which would be hot blocks. These cold blocks would never wear out. This would again lead to uneven wearing of flash blocks. So to avoid such situation of the system we take steps to manually relocate that cold data otherwise those blocks won’t ever wear… and that means we are actually adding write workload to the system, which ultimately means increasing the wear.

In other words, the more aggressive we are at wear leveling the early we would wear out the system but if we don’t do wear leveling thinking of the cons … we would end up with hot and cold spots and it would lead to uneven wearing of the system. Hence, it is a question of right balance.

Garbage collection

We have so far talked about marking the pages invalid and writing the new data on a fresh page. These invalid pages have to be recycled i.e. they have to be erased. Of course erase would be a big operation to do as you have many pages in the same block which are being used and in flash you have to erase a complete block and cannot just recycle a page.

Let me explain the tricky part

Picture10

In the above picture, you will see that 30% of the blocks are written and rest are empty.

Picture11

Now if the data has to be updated then instead of erasing the whole block and re-writing it they mark those page as invalid or stale and write the data in another page in the same block or another block.

Picture9

Picture8

In the above diagram, there is 50% free space in each block which garbage collection algorithm can use to copy the data from a second block and erase the second block completely and reclaim the space.

What if the block is 50%-70% full? Like in the diagram below

Picture7

How will the garbage collection algorithm erase the invalid pages without being able to copy the complete block data into other blocks?

This situation is a disaster because at this point it can never free up the stale blocks, which means I’ve effectively just turned my flash system into a read-only device and if you look at the capacity graph I have used only 70% of the capacity. Does this mean I can never use my flash system to 100%?

This is the reason why all flash vendors over-provision (as below) the storage for free to help you utilize 100% of the Flash storage.

Picture6

Yes, there’s more flash in your device than you can actually see. This extra area is not pinned, by the way, it’s not a dedicated set of blocks or pages. It’s just mandatory headroom that stops situations like the one we described above from ever occurring.

Having explained three important concepts of flash I will take a break now.

I will explain few more interesting features of NAND flash functioning in coming few blogs hence………to be continued.

NAND Flash: SLC or MLC or eMLC or TLC

A big question that I came across all the time is, “Why should a customer go with MLC (or cMLC) or eMLC flash?”

Storage Market is so dominant with MLC NAND flash (I will not mention NAND further in this article but I would be talking about NAND flash only in this post) that no one even looks at SLC or TLC and many are not even aware that there is something like SLC and TLC exist too. So when I decided to write this blog post, I thought I should cover the complete picture of flash, to give my readers an end to end view of flash solutions, which will help them in planning the storage needs of their organization.

Picture15

Picture used from Toshiba’s document

As discussed in my previous post (Understanding flash at its core), a cell is programmed by applying a voltage to the control gate and erased by applying a negative voltage to the control gate. If programming of the cell is either “0” or “1” level, i.e. for example if you are applying a maximum of 5V at control gate and the charge of the cell is below 2.5V (i.e. 50% of the maximum charge of 5V), then we take it as “0” and any charge above 2.5V or above 50% is taken as “1”, this is called Single Level Cell (SLC).

SLC flash is always in one of two states, programmed (0) or erased (1).  As there are only two choices, zero or one, the state of the cell can be interpreted very quickly and the chances of bit error due to varying voltage is reduced. Hence, each SLC cell can be programmed at a very less voltage or erased easily. This increases the endurance of the cell hence the program-erase cycles.

SLC flash is generally used in commercial and industrial applications and embedded systems that require high performance and long-term reliability. SLC uses a high-grade of flash media which provides good performance and endurance, but the trade-off is its high price.  SLC flash is typically more than twice the price of multi-level cell (MLC) flash.

Multi Level Cell (MLC), on the other hand, uses more states or levels of the cell than just “0” or “1” i.e. as used in the example above of 5V. If we break the voltage further into three or four levels, (we will use four levels for example) 0V – 1.25V as 00 (level 1), 1.25V – 2.50V as 01 (level 2), 2.50V – 3.75V as 10 (level 3) and 3.75V – 5V as 11(level 4). Hence, a more precise voltage has to be measured. This increased density gives MLC a lower cost per bit stored but also creates a higher probability of bit errors due to the very precise voltage used.

Therefore, the time taken for read, write and erase in MLC is much longer than SLC as now voltage has to be much more precise to read, write and erase. Hence program and erase cycles decrease too thus decreasing the lifetime of the cell.

Picture17Triple Level cell (TLC) takes it a step further and stores three bits per cell, or eight voltage states (000, 001, 010, 011, 100, 101, 110, and 111). Using 4V as an example to make it easy to understand, 0V – 0.5V as 000 (level 1), 0.5V – 1V as 001 (level 2), 1V – 1.5V as 010 (level 3),1.5V – 2V as 100 (level 4), 2V – 2.5V as 011 (Level 5), 2.5V – 3V as 101 (Level 6), 3V – 3.5V as 110 (Level 7), 3.5V – 4 V as 111 (Level 8).

From the above example, you can make out how difficult or precise the measurement of voltage would become, therefore increasing the time of read, write and erase. Same die as SLC or MLC would become denser but wear levels and endurance of the cell drop down a lot thus decreasing the program and erase cycles of the cell.

TLC is targeted towards environments with predominant read uses and has not been commonly used.

You might be wondering that why I haven’t touched eMLC till now and moved on to explain TLC? I kept eMLC to the last because I want to take some time to explain in detail what is eMLC (lot of vendors in the market are trying to convince customers that they need eMLC and MLC just don’t cut their need) and how vendors like IBM, Violin, etc today have built technologies around MLC and made it equivalent to eMLC.

Enterprise Multi Level Cell (eMLC)

Long back MLC was never considered for enterprise applications, as MLC on its own can take only 3,000 to 10,000 program and erase cycles leading to decrease in the endurance and reliability of an MLC flash chip. Customers use to depend on SLC for enterprise applications. MLC was used only in consumer devices, such as cameras, smartphones, media players and USB sticks.

SLC being very expensive, corporates use to find it difficult to adopt SLC though it has higher (100,000) program and erase cycles. Hence, vendors in the market started finding out a mid-way between MLC and SLC.

To try to address reliability issues of MLC, NAND flash manufacturers have created a grade of MLC called eMLC. In eMLC they have decreased the density of the data that can be written to the cell (i.e. increased the difference of voltage between the two states) and slowed down the speed with which the data is written (or programmed) to the flash device to increase the program and erase cycles of MLC by 3X and thus increasing the endurance of the MLC chip.

Following are benefits that were achieved because of this;

  • Decrease in bit error rate (increases the amount of margin separating the states and lower errors)
  • MLC program & erase (P/E) cycles have increased to 30,000.
  • Lower cost than SLC.

Following are cons that were achieved because of this;

  • By decreasing the density of the data that can be written to a cell, they have also increased the number of cells to be written when compared to the same amount of data in MLC.
  • Decrease in the write speeds lead to a decrease in the performance of an MLC chip (which comes at a high cost).
Picture16

Data taken from tests run on eMLC and MLC by “SSD guy”

The write parameters in the table above show how different these two technologies are;

  • 4K write IOPS are only 75% of the write IOPS of the MLC version
  • Sequential write speed is 74% as fast as MLC
  • 70/30 read/write IOPS (similar to a standard workload), has a lot of reads.  Since the read speed of the eMLC is equal to that of the MLC, the speed gap for this test is smaller. The eMLC SSD is 85% as fast as the MLC SSD
  • In the case of write latency, MLC has only 83% of the latency of the eMLC version.

In short while eMLC provides you 3X increases in endurance than and MLC chip, your performance drops down by 15-25%.

Also, as eMLC drives are sold 100x less often, they are tested less in the field and, in turn, contain a much higher frequency of bugs in their firmware. MLC on the other hand, is found in millions of consumer based devices, and in turn manufacturers rigorously test MLC drives to avoid early failures and widespread issues in millions of devices. In a few tests conducted by industry leading MLC vendors, it was found that failure rates on eMLC are 10x worse than MLC due to firmware issues.

In the present era, storage vendors like IBM, Violin memory, etc., have taken MLC chips and improved the program & erase cycles (or endurance) of MLC itself up to 9 times by over provisioning, improving the intelligence of controllers used and by including proprietary Error Correction Code Algorithms.

Hence, I would suggest my readers to understand or do through research on below features (and how they would work with their applications) of any enterprise flash device (while purchasing) than just alone considering the superficial knowledge given by storage vendors in the market.

  • Program Erase Cycles
  • Wear Leveling
  • Garbage Collection
  • Write Amplification
  • Error Correction Code Algorithms
  • Bit Error Rate.

Having done a marathon job of explaining the difference between SLC, MLC, eMLC and TLC, I would want to end this post with a note that I will explore the above mentioned topics as a basis of my next post to help you understand flash further more.

References: Toshiba, Micron, TechTarget

Understanding flash chip at its core

When I started researching about flash, I delved into flash technology so deep that I felt it would be difficult for my reader to understand the SLC, MLC, eMLC, etc. of flash technology without explaining the base of flash chip construction, and the physics behind it. Hence I have decided to write a bit on the basics first, and then jump into much-detailed intended topics.

Flash Architecture

Flash is a type of non-volatile, solid-state storage technology. In enterprise applications, multiple Flash chips are used together to produce modules in the form of external rack mount systems or internal cards or drives.

A Flash memory chip is divided into multiple nested entities as below.

Picture5

  • The flash chip is the black box or rectangle you would see in every picture online. If you look at an SSD, a flash card or the internals of a flash array you will see many flash chips, each of whicPicture4h is produced by one of the big flash manufacturers like Toshiba, Samsung, Micron, Intel, SanDisk, etc.
  • Each flash chip contains eight dies. The die is the smallest unit that can independently execute commands or report status.
  • Each die contains two planes. Identical, concurrent operations can take place on each plane although with some restrictions.
  • Each plane contains 2048 blocks, which are the smallest unit that can be erased.
  • Each block contains 64 pages, which are the smallest unit that can be programmed (i.e. written to) and this is where error correction code algorithms are applied. A page is a collection of cells (which I would be talking about while explaining how flash works)

The write operations that take place to a page, are typically 8-16KB in size while erase operations take place to a block, are 4-8MB in size.

How does a flash chip function?

This is a bit complicated for those who are from a non-physics background. I shall simplify and explain it’s working in layman terms.

Picture2

In the picture above imagine source as the starting point of the flow of electrons and drain is the destination. Control Gate is where the charge is applied to make the semiconductor move the electrons from source to drain. You will see in the picture above that there is an insulator, which is nothing but the oxide layer which prevents the control gate from directly attaching to the source or drain. This way of working of a transistor is called MOSFET (Metal Oxide Semiconductor Field Effect Transistor).

In case of a flash cell what you find is FGMOSFET (Floating Gate Metal Oxide Semiconductor Field Effect Transistor)

Picture3

If you compare both the pictures above, you would see an additional gate called floating gate (as it is completely separated) in the picture of a flash cell which is between the control gate and the conductor. You will also notice an additional oxide layer called tunnel oxide layer which is thinner than the blocking oxide layer.

How the floating gate functions is what tells us how a flash cell works? When you apply high charge at control gate electrons flowing from source to drain will tunnel or jump (this is called tunneling) through the tunnel oxide layer to floating gate and thus retaining the charge there, this is called programming state of the flash cell.

To erase the charge stored on the floating gate a high voltage is applied from source to drain and a negative voltage is applied to the control gate which makes the electrons stored on the floating gate to tunnel or move back their original path.

With the electrons in the floating gate in a program state, control gate has to apply a higher charge to make the semiconductor conduct the cells.

After understanding how a programming/write operation & erase operation happens it is also important for us to understand how a read operation happens.

For a reading operation a voltage (VT) which is intermediate between the threshold voltages of program state (VT0) and a voltage (VT1) which the control gPicture1ate applies to make the semiconductor conduct is applied. If there are no electrons in the floating gate it would make the semiconductor to conduct and thus return a logical value “1” but if there are electrons present in the floating gate it does not conduct and returns a logical value “0”. This again varies in different cells (SLC, MLC, etc.), which I will explain in further posts.

Having explained all this complicated story of how a flash chip works there are few points I want you to remember which actually would form the basis of my further posts as I explain the difference between various cells and how they are used.

  • In an FGMOSFET the tunnel oxide layer which isolates the floating gate from the semiconductor is designed to be thin enough to allow tunneling of electrons when a high enough charge is applied, but this process gradually damages the layer.
  • Reads are not a problem because only lower voltages are used and no electron tunneling takes place.
  • In the case of program and erase operations it’s a different story, which is why wear is measured by the number of program/erase cycles.
  • As the layer gets more damaged, the isolation of the floating gate is increasingly affected and the probability of electrons leaking out will increase.

To conclude, in this article I have tried to explain how the flash chip works in the most simplest of means. Having covered the basics, the next chapter will embark on explaining the different cells, and how their design is modified to achieve significant purposes (and operations).

References : Micron; SAN Disk