Introduction to Deduplication

According to wikipedia, “Data deduplication is a specific form of compression where redundant data is eliminated, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored.

Methods For DedupLication Algorithm:

  • File-level Deduplication
  • Block-level Deduplication

File-level deduplication watches for multiple copies of the same file, stores the first copy, and then just links the other references to the first file. Only one copy gets stored on the disk/tape archive. Ultimately, the space you save on disk relates to how many copies of the file there were in the file system.

Lets assume a company having a 1000 employee share a common file say “data.txt” which is 10MB in Size. Each employee does the same changes and save the exact similar 1000 copies of file on server. So the estimated storage require to save a file on server side is 10 GB.

If all the files are identical then there is no point in uploading all of them on the server. Just save a single copy on the server and put pointers in a users folder that points to a single copy on server. This is how Data Deduplication technique used to save the TB’s of storage.

Block-level Deduplication, sometimes called variable block-level deduplication, looks at the data block itself to see if another copy of this block already exists. If so, the second (and subsequent) copies are not stored on the disk/tape, but a link/pointer is created to point to the original copy.

Let’s assume we have three users each having 4 data blocks. The green and Gray blocks are common in three users and are backed up in data center. The blue, red and purple blocks are common between two users and are backed up in Data center. Here the Important point is the memory required for block data in data storage is very less which is equal to size of(block) in memory.

Block level deduplication is always efficient than file-level deduplication because in file level one has to dump whole file in data storage if its version changed where as in block level you have to dump the changed block which takes comparatively less space in data storage.

Implemention Of Deduplication Algorithm

1. Start
2. Declare Variable
3. Initialize variable
4. Read 1024 bytes from file in tone iteration
5. Read from file until reach EOF
   5.1 Generate Hash Value from strBuff[BLOCKSIZE]
   5.2 if (FirstBlock)
          Consider node as root element
          Inc BlockCtr
          search the generated Hash in BST
          if (Find Hash == True)
             Compute the Node
             Add the Node to a linked List
             Change the EndLink of SLL
             Add the node in BST
             Inc The BLockCounter
6. Calculate Deduplication Ratio
7. Display the Result for each iteration
8. END

We have a diagrammatic representation which includes BST and SLL that we are created during parsing of file at block level

Here we are using two Important structures.

Binary Search Tree Structure

typedef struct treeNode
     TCHAR  tszHash[MAX];
     int iBlockNo;
     struct Sll *structSlink;
     struct Sll *structLlink;
     struct treeNode *left;
     struct treeNode *right;
  1. Hash <- This filled is used to store the hash generated by CRC64 bit hash algorithm
  2. iBlockno <- each block is having a specific block no.repeated blocks are points to that particular block no.
  3. Slink <- it is start link pointer that contains the address of first node of single linked list
  4. Elink <- it is start link pointer that contains the address of last node of single linked list
  5. left <- it is a pointer that store the address of left child
  6. right <- it is a pointer that store the address of right child

Single Linked List Structure

struct Sll
     int iRepBlockNo;
     struct Sll *pLink;
  1. irepBlockNo <- Counter variable that keeps track for no of blocks
  2. pLink <- Pointer that holds the address of next nod

Core Logic Loop

  1. As the loop runs it reads 256 bytes of data in one iteration and generate the hash for the same .
  2. Search() procedure ensures that entry is not already present in BST
  3. If the hash already exist
  • Compute the node
  • Add the node to Single linked list
  • change the slink and elink pointers of tree node.

4. if hash doesn’t exist

  • Compute the node
  • Use hash as a key for comparison
  • Add the node in BST

5.Calculate deduplication percentage as a measurement also known as Deduplication ratio


    mbstowcs(wcstring, tszBuff,sizeof(tszBuff));
    wstring strTempHash = HashString(wcstring).c_str();
    if(iIncCtr == 0)
         root = Insert(root, strTempHash.c_str(), iBlockCtr, &amp;ppTempStore)
         pTemp = Find(root,tszTempHash.c_str());
         if(pTemp == NULL)
             root = Insert(root,tszTempHash.c_str(),iBlockCtr,&amp;ppTempStore);
            pNwLink = (Sll*)malloc(sizeof(Sll));
            if(pTemp-&gt;structLlink == NULL &amp;&amp; pTemp-&gt;structSlink == NULL)
  • Code is having a Time complexity of O(nlogn)

As it shows the output of Block Level Deduplication on 50MB of File

Links And Abbreviation

  1. SLL <- Singly Linked List
  2. BST <- Binary Search Tree
  3. Dedup<- Deduplication

To know more email: