Firm and Steady Wins the Race (Fibre Channel Vs iSCSI)

Fibre Channel (FC) Vs Internet Small Computer Systems Interface (iSCSI) has been one of the most disputed topics in storage area networks (SAN) for at least a decade.  It kind of reminds me of Tortoise and Hare story which we use to hear as kids. Most people still have misconception that iSCSI is low cost, less performance and easy to deploy because it works on the same kind of ethernet network that servers and clients are already running on. In this blog I am going to explain about FC and ISCSI and how each has developed over the years.

Figure 1: Block Protocol Summary

Figure 1: Block Protocol Summary

Please take a clear look at the Block Protocol layout, in Figure 1, which I am going to refer throughout my blog as a basis of my discussion.

Fibre Channel (FC)

Fibre Channel is a set of related physical layer networking standards. It was developed to transport data at the high speed with low overhead.

There are two basic types of data communication;

  1. Between processors – Channels
  2. Between processors and peripherals – Networks.

A channel provides a direct or switched point-to-point connection between the communicating devices. A channel is typically hardware-intensive and transports data at the high speed with low overhead.

In contrast, a Network is an aggregation of distributed nodes with its own protocol that supports interaction among these nodes. A network has relatively high overhead since it is software-intensive, and consequently slower than a channel. Networks can handle a more extensive range of tasks than channels as they operate in an environment of unanticipated connections, while channels operate amongst only a few devices with predefined addresses.

Fibre Channel attempts to combine the best of these two methods of communication into a new I/O interface that meets the needs of channel users and also network users.

Like Ethernet, its main competitor, Fibre Channel can utilize copper wiring. However, copper limits Fibre Channel to a maximum recommended reach of 30 meters, whereas with more expensive fiber optic cables, it reaches up to 10 kilometers.

The technology was specifically named Fibre Channel rather than Fiber Channel to distinguish it as supporting both fiber and copper cabling.

Fibre Channel does not follow the OSI model layering and is split into five layers:

  • FC-4 – Protocol-mapping layer, in which upper level protocols such as SCSI, IP or FICON, are encapsulated into Information Units (IUs) for delivery to FC-2. Current FC-4s include FCP-4, FC-SB-5, and FC-NVMe.
  • FC-3 – Common services layer, a thin layer that could eventually implement functions like encryption or RAID redundancy algorithms; multiport connections.
  • FC-2 – Signaling Protocol, defined by the Fibre Channel Framing and Signaling standard, consists of the low level Fibre Channel protocols & port to port connections.
  • FC-1 – Transmission Protocol, which implements line coding of signals;
  • FC-0 – Physical layer, includes cabling, connectors etc.
Layers of Fiber Channel Protocol

Figure 2: Layers of Fiber Channel Protocol (Reference used from BytePile.com)

When it was first introduced, Fibre-channel enabled the campus wide consolidation of high throughput storage. In fact network attached storage using ethernet had existed for many years before the introduction of Fibre-channel, but its low throughput (based on the 10 Mbps rates at the time) made it unsuitable for many applications compared to directly attached storage which was about 20x faster.

Following are the advantages that made Fibre Channel popular in the past:

  • Price Performance Leadership – Fibre Channel delivers cost-effective solutions for storage and networks.
  • Solutions Leadership – Fibre Channel provides versatile connectivity with scalable performance.
  • Reliability – Fibre Channel, a most reliable form of communications, sustains an enterprise with assured information delivery. It is a lossless network. It also came with reduced amount of internal cabling within the servers and storage systems, because Fibre-channel was a serial interface, compared to SCSI which was a parallel interface before (now it has both options of serial and parallel interface) and because of which it had a very high cable core and connector count.
  • Multiple Topologies – Dedicated point-to-point, shared loops, and scaled switched topologies meet application requirements.
  • Multiple Protocols – Fibre Channel delivers data. SCSI, TCP/IP, video, or raw data can all take advantage of high- performance, reliable Fibre Channel technology.
  • Scalable – From single point-to-point gigabit links to integrated enterprises with hundreds of servers, Fibre Channel delivers unmatched performance.
  • Fault Tolerance – Fibre-channel included fault tolerant mechanisms for rerouting around failed cable loops – although it was many years before these were properly supported by transparent failover software.
  • High Efficiency – Real price performance is directly correlated to the efficiency of the technology. Fibre Channel has very little transmission overhead. Most important, the Fibre Channel protocol, is specifically designed for highly efficient operation using hardware.

As decades passed by the original advantages of Fibre-channel have been blurred as other technologies have adopted similar attributes. We are used to the concept of networking data at high speed over large distances using many different types of technologies. At the building or campus network level, IP technology in the form of iSCSI offers identical performance these days for most common tasks like backup, replication, etc. Also the term SAN, which originally referred just to a Fibre-channel connected network has become defocused by common usage. So we get IP-SAN which usually refers to iSCSI, and which uses Ethernet (not Fibre-channel) as the underlying transport in a local area network.

Over the years Fibre Channel protocol has also meta-morphed into many new protocols to provide FC over Ethernet (FCoE), Fibre Channel over ethernet using IP (FCIP) and internet Fibre Channel (iFCP) because of various needs of the IT industry but each has its own pros and cons which I would discuss in the coming blogs. This blog I would want to stick to just FC and iSCSI.

Internet Small Computer Systems Interface (iSCSI)

iSCSI, which stands for Internet Small Computer System Interface, an Internet Protocol (IP)-based storage networking standard for linking data storage facilities. It provides networked block-level shared access to storage devices by carrying SCSI commands over a TCP/IP network. iSCSI is used to facilitate data transfers over intranets and to manage storage over long distances. It can be used to transmit data over local-area networks (LANs), wide-area networks (WANs) or the Internet.

IBM & Cisco developed iSCSI as a proof of concept in 1998, and presented the first draft of the iSCSI standard to the Internet Engineering Task Force (IETF) in 2000. The protocol was ratified in 2003.

How iSCSI Works ?

Before I explain how iSCSI works you need to understand how SCSI works as it is part of iSCSI.

The Small Computer Systems Interface (SCSI) is a popular family of protocols for communicating with I/O devices, especially storage devices.

There are two types of devices in SCSI protocol;

  • The SCSI Initiators (Mostly OS) to start the communications

The initiators are devices that request commands be executed.

  • The Targets (file servers or storage) to respond.

Targets are devices that carry out the commands. The endpoint, within the target, that executes the command is referred to as a “logical unit” (LU). A target is a collection of logical units, in general of the same type, and are directly addressable.

The structure used to communicate a command from an application client to a device server is referred to as a Command Descriptor Block (CDB). An SCSI command or a linked set of commands is referred to as a “task.” Only one command in a task can be outstanding at any given time.

SCSI command execution results in an optional;

  • Data phase and

In the data phase, data travels either from the initiator to the target, as in a WRITE command, or from the target to the initiator, as in a READ command.

  • Status phase

In the status phase, the target returns the final status of the operation. The status response terminates an SCSI command or task.

The basic function of the SCSI driver is to build SCSI Command Descriptor Blocks (CDB) from requests issued by the application, and forwards them to the iSCSI layer. The SCSI driver also receives CDBs from the iSCSI layer and forwarding the data to the application layer.

iSCSI_Works

Figure 3: iSCSI Network connectivity

iSCSI provides initiators and targets with unique names as well as a discovery method  (described below). The iSCSI protocol establishes communication sessions between initiators and targets, and provides methods for them to authenticate one another. An iSCSI session may contain one or more TCP connections and provides recovery in the event connections fail. Following is the way data is transmitted over iSCSI;

  • SCSI CDBs are passed from the SCSI generic layer to the iSCSI transport layer.
  • The iSCSI transport layer encapsulates the SCSI CDB into an iSCSI Protocol Data Unit (PDU) and forwards it to the Transmission Control Protocol (TCP) layer.
  • On a read, the iSCSI transport layer extracts the CDB from the iSCSI PDU, received from the TCP layer, and
  • Forwards the CDB to the SCSI generic layer.
  • iSCSI provides the SCSI generic command layer with a reliable transport

The following diagram illustrates the layering of the various SCSI command sets and data over different transport and physical layers.

Layering of SCSI command sets and data over different transport and physical layers

Figure 4: Layering of SCSI command sets and data over different transport and physical layers (Reference used from diskdrive.com)

iSCSI Naming & Addressing

In iSCSI network, each component (initiator or target) has its unique name. Let’s have a look at naming types. iSCSI provides three name formats;

1. iSCSI Qualified Name (IQN)

Briefly, the fields are:

  • Literal iqn (iSCSI Qualified Name)
  • Date (yyyy-mm) that the naming authority took ownership of the domain
  • Reversed domain name of the authority (e.g. org.alpinelinux, com.example, to.yp.cr)
  • Optional “:” prefixing a storage target name specified by the naming authority.
IQN_Naming_Convention

Figure 5: IQN Naming Format

2. Extended Unique Identifier (EUI)Format: eui. {EUI-64 bit address} (e.g. eui.02004567A425678D)

3. T11 Network Address Authority (NAA)

Format: naa. {NAA 64 or 128 bit identifier} (e.g. naa.52004567BA64678D)

IQN format addresses occur most commonly. They are qualified by a date (yyyy-mm) because domain names can expire or be acquired by another entity.

NAA name formats were added to iSCSI, to provide compatibility with naming conventions used in Fibre Channel and Serial Attached SCSI (SAS) storage technologies.

Usually, an iSCSI participant can be defined by three or four fields:

  1. Hostname or IP Address (e.g., “iscsi.example.com”)
  2. Port Number (e.g., 3260)
  3. iSCSI Name (e.g., the IQN “iqn.2003-01.com.ibm:00.fcd0ab21.shark128”)
  4. An optionalCHAP Secret (e.g., “secretsarefun”)

iSCSI Discovery

An iSCSI initiator can discover an iSCSI target in the following different ways:

  • By configuring the target’s address on the initiator.
  • By configuring a default target address on the initiator and the initiator connects to the target and requests a list of iSCSI Names, via a separate SendTargets command.
  • By issuing Service Location Protocol (SLP) multicast requests, to which the targets may respond.
  • By querying a storage name server (iSNS) for a list of targets that it can access.

iSNS: Internet Storage Name Server , is centralized server that has iSCSI configurations of initiators and targets.

SLP : Service Location Protocol, is not widely implemented, but anyway it helps computers to find iSCSI services across network.

iSCSI Security

iSCSI supports two separate security mechanisms:

  1. In-band authentication between initiator and target at the iSCSI connection level.(like CHAP authentication), occurs during login into storage
  2. Packet protection by IPsec at the IP level (all packets are secured).

 iSCSI will login to storage for the first time (here we implement in-band authentication if needed ),after successful login it will start exchanging  packets ( here we can implement IPSec if needed).

By now you might have understood that iSCSI protocol enables universal access to storage devices and storage-area networks (SANs) over standard Ethernet-based TCP/IP networks.

IP Ethernet network infrastructures provide major advantages for interconnection of servers to block-oriented storage devices. Following are the advantages that you would get using iSCSI;

  • Easy installation and maintenance of iSCSI SANs – Skills developed in the design and management of IP local-area network (LAN) networks can be applied to native IP SANs. Trained and experienced IP networking staffs are available to install and operate these networks.
  • Low cost – iSCSI uses the existing network infrastructure so there is no need to buy expensive equipment. Economies achieved from using a standard IP infrastructure, products, and service across the organization
  • Excellent performance– it is a very good alternative to the more expensive Fibre Channel technology. Gigabit Ethernet switches and routers have advanced capabilities including: ultra-low error rates, flow control, link aggregation, and full duplex operation. – Transfer data at optimal data rates over LAN, WAN, and metropolitan-area networks (MANs).
  • No distance limitation– using IP networking solves the problem with data replication to remote sites.
  • Interoperability and flexibility– iSCSI uses standard Ethernet switches so there is no need to install special cabling and switches required with Fibre Channel, it can also run at different Ethernet speed (users can choose between Gigabit and 10GbE or higher)
  • Compatibility– It is compatible with many commonly used standards, respected and recognized by the Internet Engineering Task Force (IETF). iSCSI is compatible with existing Ethernet and IP WAN infrastructures. iSCSI will coexist with other IP protocols on a network infrastructure.
  • Multipathing– iSCSI supports multi-pathing to improve network resiliency.
  • Security– iSCSI offers security features such as Challenge Handshake Authentication Protocol (CHAP) and Internet Protocol Security (IPsec).

After reading this blog you might have understood that neither FC nor iSCSI is a lower performance or lower cost protocol at this point and each comes with its own advantages and disadvantages and each is contesting by mutating itself into different forms to meet the market demand and price. So it’s actually up to you to choose what best fits for which purpose and deploy accordingly.

References: BytePile.com, Wikipedia, http://www.tomshardware.com, www,diskdrive.com

Exploiting Point in Time Copy Services

Offlate, I had been working on a lot of cases where copies are inherent. I also noticed that data occupied by copies is a lot more than the actual production data in many cases. Hence I decided to rebound back into writing my blog with “Point in Time Copy Services “(which many people would identify as snapshots or flashcopy or copies).

Point in Time Copies

The Storage Networking Industry Association (SNIA) defines a point-in-time copy as:

A fully usable copy of a defined collection of data that contains an image of the data as it appeared at a single point-in-time. The copy is considered to have logically occurred at that point-in-time, but implementations may perform part or all of the copy at other times, as long as the result is a consistent copy of the data that appeared at a particular point-in-time.

Before the invention of point-in-time copy, to create a consistent copy of the data, the application had to be stopped while the data was physically copied. For large data sets, this could easily involve a stoppage of several hours, this overhead meant that there were practical limits on making copies. Today’s point-in-time copy facilities allow a copy to be created with almost no impact on the application; in other words, other than perhaps a very brief period of seconds or minutes while the copy or bitmap is established, the application can continue running.

Over the years many point in time copies have been developed. In these blog series, I am going to explain the benefits and drawbacks of various point in time copy and the architecture and operations considerations and how they are used in the industry.

Snapshots

Snapshot is a commonly used industry term (also called space-efficient copies) is an ability to capture or record the state of data at any given moment or point in time and preserve that snapshot as a guide to restore the data in case of a failure of the storage device. Typically, snapshot copy is done instantly and made available for use by other applications such as backup, UAT, reporting, and data replication applications. The original copy of the data continues to be available to the applications without interruption, while the snapshot copy is used to perform other functions on the data.

Snapshot1

Figure 1: Snapshot Copies

Snapshots enable better application availability, faster recovery, easier back up management of large volumes of data, reduces exposure to data loss, virtual elimination of backup windows, and lowers total cost of ownership (TCO) because it doesn’t occupy any space unless the production data is changed.

There different approaches to the way snapshots are implemented or are made. Each approach has its own benefits and drawbacks. It is very important to understand the various approaches and how they fit in your need or for various applications. Below mentioned are the most commonly used methodologies;

Copy on Write (CoW):

When the snapshot is first created, only the meta-data about where original data is stored is copied. No physical copy of the data is done at the time the snapshot is created. Therefore, the creation of the snapshot is time and space efficient.

Snapshot2

Figure 2: Snapshots when first created (Only Bitmap or metadata gets copied)

As blocks on the original volume change, the original data is copied (moved over) into the pre-designated space (reserved storage capacity) set aside for the snapshot prior to the original data being overwritten. The original data blocks are copied just once at the first write request (after the snapshot was taken; this technique is also called copy-on-first-write). This process ensures that snapshot data is consistent with the exact time the snapshot was taken, and is why the process is called “copy-on-write.”

Sanpshot4

Figure 3: Copy on Write (CoW) Snapshots

After the initial creation of a snapshot, the snapshot copy tracks the changing blocks on the original volume as writes to the original volume are performed. Hence the implementation of “copy-on-write” snapshots requires the configuration of a pre-designated space (typically 10-20% of the size of volume/LUN) to store the snapshots.

Pros of Copy on Write:

  • In workloads that use easy tiering, where the hot data is sitting on the SSD or Flash and a snapshot is taken and when a change or update happens on the hot data, then the snapshot algorithm moves the old data block on SSD to SAS and writes the new data on SSD thus maintaining the performance even when data is updated. Hence it becomes very important to choose the right snapshot implementation approach based on your workload.

Cons of Copy on Write:

Any changes/updates to the original volume are performed have double write penalty. Following is what that happens:

  • File system reads in original data blocks (1 x read I/O penalty) in preparation for the copy. In the figure 3 blocks B and D will be updated with new data.
  • Once original data (B, D) is read by the production LUN, data is copied (1 x write I/O penalty) into the designated storage pool that is set aside for the snapshot before original data is overwritten, hence the name “copy-on-write”. (Figure 3)
  • It then writes the new and modified data blocks (B (deleted), D+) to original data block location (1 x write I/O penalty) and re-link the blocks to the original snapshot. (Figure 3)

In short, a write (change/update) to a volume having copy on write snapshot takes:

  • 1 read (1 x read I/O) and
  • 2 writes (2x write I/O)

Thus, it becomes very important to see what kind of workload you are using it for.

Redirect on Write (RoW):

Like copy on write, when the snapshot is first created, even in redirect on write only meta-data about where original data is stored is copied. Redirect on write is also time and space efficient. By design a redirect on write (RoW) snapshot is optimized for write performance so any changes/updates are redirected to new blocks. Instead of writing one copy of the original data to a snapshot reserved space (cache, LUN reserve, or snapshot pool as described by various vendors) plus a copy of the changed data that is required with copy on write (CoW), redirect on write (RoW) writes or redirects only the changed data to new blocks.

RoW

Figure 4: Redirect on Write (RoW) Snapshots

Pros of Redirect on Write:

Any changes/updates to the original volume are performed as follows:

  • The filesystem writes updates to new blocks. Filesystem keeps track of available blocks, which allows for changes to be done very efficiently. For example, as data blocks (B, D) are changed/updated, pointers in the active file system or the original copy are redirected to new blocks (B(deleted), D+); however the snapshot pointers still point to the original blocks to preserve that point-in-time image. (Figure 4)

In short, a write to a volume takes:

  • 1 write (1x write I/O)

Cons of Redirect on Write:

  • With redirect-on-write, the original copy contains the point-in-time data, that is, snapshot, and the changed data reside on the snapshot storage. When a snapshot is deleted, the data from the snapshot storage must be reconciled back into the original volume.
  • As multiple snapshots are created, access to the original data, tracking of the data in snapshots and original volume, and reconciliation upon snapshot deletion is further complicated.
  • The snapshot relies on the original copy of the data and the original data set can quickly become fragmented.
  • While working in tandem with solutions like easy tiering, if the hot data is on SSD or Flash and it has to be edited then the new update will be moved to SAS drives. Hence causing the performance impact.

Therefore, it becomes very important to understand the type of workloads and the type of storage being used to plan your snapshot implementation approach accordingly.

After reading so much about snapshots, I am sure you might already be thinking of when and why we should be using each of the snapshots. I am ending this blog with snapshots but please don’t come to a conclusion that I have finished explaining about point in time copies. There is lot more to understand about point in time copies and many more types of copies to help you in planning your storage implementation accordingly.

References: SNIA, IBM Developer Works IBM Redbooks

Flash : Write Amplification, Bit Error Rate & ECC Algorithms

Just like any new technology, along with the “wow” factor comes the limitations, “flash storage” in its native state also has issues which need to be dealt by every vendor to make it more reliable, provide better endurance and thus increase the life of a flash chip. I am hereby using this blog post as a center-stage to discuss the concerns and the methods which every vendor accustoms to increase the longevity of their flash chip.

Write Amplification

“Write Amplification”, as the name implies, is a phenomenon which increase the number of “writes’, where the actual amount of physical information written is a multiple of the logical amount intended to be written. It plays a critical role in increasing the endurance of a flash chip.

The lower the write amplification, the longer the flash will last.  Flash architects pay special attention to this aspect of controller design.Picture22

From what I explained so far, many operations like garbage collection, wear leveling, etc. keep happening at the background and the process to perform these operations results in moving (or rewriting) user data and metadata more than once.

Thus, rewriting some (one cell) data requires an already used portion of flash to be read, updated and written to a new location, together with initially erasing the new location if it was previously used at some point in time; due to the way flash works, much larger portions of flash must be erased and rewritten than actually required by the amount of new data.

This multiplying effect increases the number of “writes” required over the life of the flash which shortens the time it can reliably operate.

Vendors use various techniques like compression, deduplication (it has other implications where garbage collection algorithm or wear leveling algorithm has to communicate with hash algorithm to do the erase, thus increasing the load on processor) etc. to drop the write amplification and increase the life span of the chip.

Bit – Error Rate

Bit error rate as the definition says, is the number of “bit” errors occured in a particular interval of time. Errors can occur because of a number of reasons. There are two main reasons because of which “bit” errors occur.

  • Read / Write disturb
  • Charge getting trapped

Program (or write) or Read Disturb – Due to small size of flash gates in MLC, when you apply threshold voltage (as explained in “Understanding flash at its core”) to read or program a cell there is every chance the cells nearby (within the same block) might get disturbed because of which their program state might change or get slightly programmed which otherwise would get nullified when an erase operation happens or error correction code (ECC) algorithms are used to correct the same.

Picture23Charge Getting TrappedWhen you are programming a cell or erasing a cell there is chance that electrons might get trapped in the tunnel oxide layer between the floating gate and the semiconductor while tunneling. This usually happens when cells have been programmed and erased quite a few times and tunnel oxide layer has become weak. Advanced error correction code algorithms are stored alongside user data to ensure that incorrect information is spotted and dealt with while any underlying pages are marked as unusable.

NAND Flash errors can also be caused by elevated heat, manufacturing defects, or even simply repeated use, also known as wear out. Hence, the error correction code (ECC) algorithms used and at what level the vendor has programmed to handle these errors becomes important while measuring the endurance of a flash system.

Error Correction Code (ECC) Algorithms

Error correction code algorithms have a big impact on the endurance of a flash chip.Picture18

All NAND flash requires ECC to correct random “bit” errors. In an attempt to make NAND flash cheap, the voltage passed through the NAND flash has become very narrow because of which the errors like read disturb, program disturb, errors due to wear, etc. occur.

Error correction code algorithm also has two types of code,

  1. Predictable errors check, which helps to correct errors that are caused by internal mechanisms that are inherent to the design of the chip.  A prime example of such an error would be adjacent cell disturb.
  2. Unpredictable errors check, which help in correcting more unpredictable wear that occurs due to charge getting trapped, elevated heat, etc. which are not expected to happen due to measures already in place. In short this algorithm handles more complex errors.

More sophisticated ECC requires more processing power in the controller and may be slower than less sophisticated algorithms.  Also, the number of errors that can be corrected can depend upon how large a segment of memory is being corrected.  A controller with elaborate ECC capabilities is likely to use more compute resources and more internal RAM than would one with simpler ECC.  These enhancements will make the controller more expensive hence the increase in cost of a flash device.

Another thing, I did not mention is the math behind the error correction code algorithm. I am mesmerized by the aptitude of the folks who have mastered ECC, and their ability to extract more life out of a flash block than any mere mortal would think possible. I am not attempting to explain the same in this blog as I want this blog to be simple for anyone to understand.

Last but not the least, every error correction code algorithm is designed to correct only a limited (this usually depends on research that an organization has done on the flash chip) number of errors which actually determines the lifespan of a flash chip.

References: SAN Disk, Toshiba, Micron, IBM Redbooks

Flash: P/E Cycles, Wear Leveling & Garbage Collection

Starting this page, I am going to explain the real deal in the functioning of the flash storage, which in turn, would change your outlook on how you would evaluate a flash product.

Program Erase (P/E) Cycles

As explained in my previous blog, when you write the data onto the flash it is called a program state that is when you hold the electrons in the floating gate. Writing operations happen at the page level (typically 8-16KB in size). Read operations also happen at the page level.

An Interesting part of the operations in a flash chip is when you have to update the data already written, unlike a disk storage you cannot just perform update operations or undo or change a particular data. In a flash chip if you want to update or change the data already written you will have to erase the old data first and rewrite the whole data again and erase operations happen at a block level.

Picture14Picture13

 

 

 

Every time you have to erase a page or update data in a page you will have to erase the whole block even if you don’t want to update the other pages in the block. Erase operation takes longer than read operations as you have to change the whole block. This is why the life of flash chip is measured in Program Erase cycles, (also referred to as PE cycles) because both program and erase happen simultaneously and this, in turn, leads to the damage of oxide layer (please refer to my blog understanding flash at its core), and each flash chip has only a limited number of program & erase cycles it can take.

An alternative to this “erase operation” would be to mark the page as invalid and write the new data in a new page. In this way you can avoid the obscure erase cycle and increase the life of a chip but when you write data to another location you will have to redirect the reads of that page marked as invalid to the location. This is where flash translation layer would kick in.

So to make flash a friendly medium for storing our data, we have an abstraction layer (Flash translation layer) which will:

  1. Write updated information to a new empty page and then divert all subsequent read requests to its new address
  2. Ensure that newly-programmed pages are evenly distributed across all of the available the flash so that it wears evenly
  3. Keep a list of all the old invalid pages so that at some point, later on, they can all be recycled for reuse

Wear Leveling

Picture12Wear leveling sounds pretty simple and easy when you hear it first. You have a flash with defined set of blocks and PE cycles, (program/ write happens at page level and erase happens at block level,) as constant program and erase cycles wear out the flash blocks, instead of erasing a block every time you have to update a page within a block you mark that page as invalid and write into a new page. These invalid pages at some time have to be erased to reutilize the space. This helps the flash blocks to wear out evenly than a few blocks wearing out early and thus reducing the capacity promised to the customer.

There also another part of wear leveling which we don’t look at i.e. within the flash storage there would be some blocks which would be only frequently read but not updated or where data doesn’t change. These would be cold blocks while other are being updated which would be hot blocks. These cold blocks would never wear out. This would again lead to uneven wearing of flash blocks. So to avoid such situation of the system we take steps to manually relocate that cold data otherwise those blocks won’t ever wear… and that means we are actually adding write workload to the system, which ultimately means increasing the wear.

In other words, the more aggressive we are at wear leveling the early we would wear out the system but if we don’t do wear leveling thinking of the cons … we would end up with hot and cold spots and it would lead to uneven wearing of the system. Hence, it is a question of right balance.

Garbage collection

We have so far talked about marking the pages invalid and writing the new data on a fresh page. These invalid pages have to be recycled i.e. they have to be erased. Of course erase would be a big operation to do as you have many pages in the same block which are being used and in flash you have to erase a complete block and cannot just recycle a page.

Let me explain the tricky part

Picture10

In the above picture, you will see that 30% of the blocks are written and rest are empty.

Picture11

Now if the data has to be updated then instead of erasing the whole block and re-writing it they mark those page as invalid or stale and write the data in another page in the same block or another block.

Picture9

Picture8

In the above diagram, there is 50% free space in each block which garbage collection algorithm can use to copy the data from a second block and erase the second block completely and reclaim the space.

What if the block is 50%-70% full? Like in the diagram below

Picture7

How will the garbage collection algorithm erase the invalid pages without being able to copy the complete block data into other blocks?

This situation is a disaster because at this point it can never free up the stale blocks, which means I’ve effectively just turned my flash system into a read-only device and if you look at the capacity graph I have used only 70% of the capacity. Does this mean I can never use my flash system to 100%?

This is the reason why all flash vendors over-provision (as below) the storage for free to help you utilize 100% of the Flash storage.

Picture6

Yes, there’s more flash in your device than you can actually see. This extra area is not pinned, by the way, it’s not a dedicated set of blocks or pages. It’s just mandatory headroom that stops situations like the one we described above from ever occurring.

Having explained three important concepts of flash I will take a break now.

I will explain few more interesting features of NAND flash functioning in coming few blogs hence………to be continued.

NAND Flash: SLC or MLC or eMLC or TLC

A big question that I came across all the time is, “Why should a customer go with MLC (or cMLC) or eMLC flash?”

Storage Market is so dominant with MLC NAND flash (I will not mention NAND further in this article but I would be talking about NAND flash only in this post) that no one even looks at SLC or TLC and many are not even aware that there is something like SLC and TLC exist too. So when I decided to write this blog post, I thought I should cover the complete picture of flash, to give my readers an end to end view of flash solutions, which will help them in planning the storage needs of their organization.

Picture15

Picture used from Toshiba’s document

As discussed in my previous post (Understanding flash at its core), a cell is programmed by applying a voltage to the control gate and erased by applying a negative voltage to the control gate. If programming of the cell is either “0” or “1” level, i.e. for example if you are applying a maximum of 5V at control gate and the charge of the cell is below 2.5V (i.e. 50% of the maximum charge of 5V), then we take it as “0” and any charge above 2.5V or above 50% is taken as “1”, this is called Single Level Cell (SLC).

SLC flash is always in one of two states, programmed (0) or erased (1).  As there are only two choices, zero or one, the state of the cell can be interpreted very quickly and the chances of bit error due to varying voltage is reduced. Hence, each SLC cell can be programmed at a very less voltage or erased easily. This increases the endurance of the cell hence the program-erase cycles.

SLC flash is generally used in commercial and industrial applications and embedded systems that require high performance and long-term reliability. SLC uses a high-grade of flash media which provides good performance and endurance, but the trade-off is its high price.  SLC flash is typically more than twice the price of multi-level cell (MLC) flash.

Multi Level Cell (MLC), on the other hand, uses more states or levels of the cell than just “0” or “1” i.e. as used in the example above of 5V. If we break the voltage further into three or four levels, (we will use four levels for example) 0V – 1.25V as 00 (level 1), 1.25V – 2.50V as 01 (level 2), 2.50V – 3.75V as 10 (level 3) and 3.75V – 5V as 11(level 4). Hence, a more precise voltage has to be measured. This increased density gives MLC a lower cost per bit stored but also creates a higher probability of bit errors due to the very precise voltage used.

Therefore, the time taken for read, write and erase in MLC is much longer than SLC as now voltage has to be much more precise to read, write and erase. Hence program and erase cycles decrease too thus decreasing the lifetime of the cell.

Picture17Triple Level cell (TLC) takes it a step further and stores three bits per cell, or eight voltage states (000, 001, 010, 011, 100, 101, 110, and 111). Using 4V as an example to make it easy to understand, 0V – 0.5V as 000 (level 1), 0.5V – 1V as 001 (level 2), 1V – 1.5V as 010 (level 3),1.5V – 2V as 100 (level 4), 2V – 2.5V as 011 (Level 5), 2.5V – 3V as 101 (Level 6), 3V – 3.5V as 110 (Level 7), 3.5V – 4 V as 111 (Level 8).

From the above example, you can make out how difficult or precise the measurement of voltage would become, therefore increasing the time of read, write and erase. Same die as SLC or MLC would become denser but wear levels and endurance of the cell drop down a lot thus decreasing the program and erase cycles of the cell.

TLC is targeted towards environments with predominant read uses and has not been commonly used.

You might be wondering that why I haven’t touched eMLC till now and moved on to explain TLC? I kept eMLC to the last because I want to take some time to explain in detail what is eMLC (lot of vendors in the market are trying to convince customers that they need eMLC and MLC just don’t cut their need) and how vendors like IBM, Violin, etc today have built technologies around MLC and made it equivalent to eMLC.

Enterprise Multi Level Cell (eMLC)

Long back MLC was never considered for enterprise applications, as MLC on its own can take only 3,000 to 10,000 program and erase cycles leading to decrease in the endurance and reliability of an MLC flash chip. Customers use to depend on SLC for enterprise applications. MLC was used only in consumer devices, such as cameras, smartphones, media players and USB sticks.

SLC being very expensive, corporates use to find it difficult to adopt SLC though it has higher (100,000) program and erase cycles. Hence, vendors in the market started finding out a mid-way between MLC and SLC.

To try to address reliability issues of MLC, NAND flash manufacturers have created a grade of MLC called eMLC. In eMLC they have decreased the density of the data that can be written to the cell (i.e. increased the difference of voltage between the two states) and slowed down the speed with which the data is written (or programmed) to the flash device to increase the program and erase cycles of MLC by 3X and thus increasing the endurance of the MLC chip.

Following are benefits that were achieved because of this;

  • Decrease in bit error rate (increases the amount of margin separating the states and lower errors)
  • MLC program & erase (P/E) cycles have increased to 30,000.
  • Lower cost than SLC.

Following are cons that were achieved because of this;

  • By decreasing the density of the data that can be written to a cell, they have also increased the number of cells to be written when compared to the same amount of data in MLC.
  • Decrease in the write speeds lead to a decrease in the performance of an MLC chip (which comes at a high cost).
Picture16

Data taken from tests run on eMLC and MLC by “SSD guy”

The write parameters in the table above show how different these two technologies are;

  • 4K write IOPS are only 75% of the write IOPS of the MLC version
  • Sequential write speed is 74% as fast as MLC
  • 70/30 read/write IOPS (similar to a standard workload), has a lot of reads.  Since the read speed of the eMLC is equal to that of the MLC, the speed gap for this test is smaller. The eMLC SSD is 85% as fast as the MLC SSD
  • In the case of write latency, MLC has only 83% of the latency of the eMLC version.

In short while eMLC provides you 3X increases in endurance than and MLC chip, your performance drops down by 15-25%.

Also, as eMLC drives are sold 100x less often, they are tested less in the field and, in turn, contain a much higher frequency of bugs in their firmware. MLC on the other hand, is found in millions of consumer based devices, and in turn manufacturers rigorously test MLC drives to avoid early failures and widespread issues in millions of devices. In a few tests conducted by industry leading MLC vendors, it was found that failure rates on eMLC are 10x worse than MLC due to firmware issues.

In the present era, storage vendors like IBM, Violin memory, etc., have taken MLC chips and improved the program & erase cycles (or endurance) of MLC itself up to 9 times by over provisioning, improving the intelligence of controllers used and by including proprietary Error Correction Code Algorithms.

Hence, I would suggest my readers to understand or do through research on below features (and how they would work with their applications) of any enterprise flash device (while purchasing) than just alone considering the superficial knowledge given by storage vendors in the market.

  • Program Erase Cycles
  • Wear Leveling
  • Garbage Collection
  • Write Amplification
  • Error Correction Code Algorithms
  • Bit Error Rate.

Having done a marathon job of explaining the difference between SLC, MLC, eMLC and TLC, I would want to end this post with a note that I will explore the above mentioned topics as a basis of my next post to help you understand flash further more.

References: Toshiba, Micron, TechTarget

Understanding flash chip at its core

When I started researching about flash, I delved into flash technology so deep that I felt it would be difficult for my reader to understand the SLC, MLC, eMLC, etc. of flash technology without explaining the base of flash chip construction, and the physics behind it. Hence I have decided to write a bit on the basics first, and then jump into much-detailed intended topics.

Flash Architecture

Flash is a type of non-volatile, solid-state storage technology. In enterprise applications, multiple Flash chips are used together to produce modules in the form of external rack mount systems or internal cards or drives.

A Flash memory chip is divided into multiple nested entities as below.

Picture5

  • The flash chip is the black box or rectangle you would see in every picture online. If you look at an SSD, a flash card or the internals of a flash array you will see many flash chips, each of whicPicture4h is produced by one of the big flash manufacturers like Toshiba, Samsung, Micron, Intel, SanDisk, etc.
  • Each flash chip contains eight dies. The die is the smallest unit that can independently execute commands or report status.
  • Each die contains two planes. Identical, concurrent operations can take place on each plane although with some restrictions.
  • Each plane contains 2048 blocks, which are the smallest unit that can be erased.
  • Each block contains 64 pages, which are the smallest unit that can be programmed (i.e. written to) and this is where error correction code algorithms are applied. A page is a collection of cells (which I would be talking about while explaining how flash works)

The write operations that take place to a page, are typically 8-16KB in size while erase operations take place to a block, are 4-8MB in size.

How does a flash chip function?

This is a bit complicated for those who are from a non-physics background. I shall simplify and explain it’s working in layman terms.

Picture2

In the picture above imagine source as the starting point of the flow of electrons and drain is the destination. Control Gate is where the charge is applied to make the semiconductor move the electrons from source to drain. You will see in the picture above that there is an insulator, which is nothing but the oxide layer which prevents the control gate from directly attaching to the source or drain. This way of working of a transistor is called MOSFET (Metal Oxide Semiconductor Field Effect Transistor).

In case of a flash cell what you find is FGMOSFET (Floating Gate Metal Oxide Semiconductor Field Effect Transistor)

Picture3

If you compare both the pictures above, you would see an additional gate called floating gate (as it is completely separated) in the picture of a flash cell which is between the control gate and the conductor. You will also notice an additional oxide layer called tunnel oxide layer which is thinner than the blocking oxide layer.

How the floating gate functions is what tells us how a flash cell works? When you apply high charge at control gate electrons flowing from source to drain will tunnel or jump (this is called tunneling) through the tunnel oxide layer to floating gate and thus retaining the charge there, this is called programming state of the flash cell.

To erase the charge stored on the floating gate a high voltage is applied from source to drain and a negative voltage is applied to the control gate which makes the electrons stored on the floating gate to tunnel or move back their original path.

With the electrons in the floating gate in a program state, control gate has to apply a higher charge to make the semiconductor conduct the cells.

After understanding how a programming/write operation & erase operation happens it is also important for us to understand how a read operation happens.

For a reading operation a voltage (VT) which is intermediate between the threshold voltages of program state (VT0) and a voltage (VT1) which the control gPicture1ate applies to make the semiconductor conduct is applied. If there are no electrons in the floating gate it would make the semiconductor to conduct and thus return a logical value “1” but if there are electrons present in the floating gate it does not conduct and returns a logical value “0”. This again varies in different cells (SLC, MLC, etc.), which I will explain in further posts.

Having explained all this complicated story of how a flash chip works there are few points I want you to remember which actually would form the basis of my further posts as I explain the difference between various cells and how they are used.

  • In an FGMOSFET the tunnel oxide layer which isolates the floating gate from the semiconductor is designed to be thin enough to allow tunneling of electrons when a high enough charge is applied, but this process gradually damages the layer.
  • Reads are not a problem because only lower voltages are used and no electron tunneling takes place.
  • In the case of program and erase operations it’s a different story, which is why wear is measured by the number of program/erase cycles.
  • As the layer gets more damaged, the isolation of the floating gate is increasingly affected and the probability of electrons leaking out will increase.

To conclude, in this article I have tried to explain how the flash chip works in the most simplest of means. Having covered the basics, the next chapter will embark on explaining the different cells, and how their design is modified to achieve significant purposes (and operations).

References : Micron; SAN Disk