Deduplication Algorithm

Introduction to Deduplication

According to wikipedia, “Data deduplication is a specific form of compression where redundant data is eliminated, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored.

Methods For DedupLication Algorithm:

  • File-level Deduplication
  • Block-level Deduplication

File-level deduplication watches for multiple copies of the same file, stores the first copy, and then just links the other references to the first file. Only one copy gets stored on the disk/tape archive. Ultimately, the space you save on disk relates to how many copies of the file there were in the file system.

Lets assume a company having a 1000 employee share a common file say “data.txt” which is 10MB in Size. Each employee does the same changes and save the exact similar 1000 copies of file on server. So the estimated storage require to save a file on server side is 10 GB.

If all the files are identical then there is no point in uploading all of them on the server. Just save a single copy on the server and put pointers in a users folder that points to a single copy on server. This is how Data Deduplication technique used to save the TB’s of storage.

Block-level Deduplication, sometimes called variable block-level deduplication, looks at the data block itself to see if another copy of this block already exists. If so, the second (and subsequent) copies are not stored on the disk/tape, but a link/pointer is created to point to the original copy.

Let’s assume we have three users each having 4 data blocks. The green and Gray blocks are common in three users and are backed up in data center. The blue, red and purple blocks are common between two users and are backed up in Data center. Here the Important point is the memory required for block data in data storage is very less which is equal to size of(block) in memory.

Block level deduplication is always efficient than file-level deduplication because in file level one has to dump whole file in data storage if its version changed where as in block level you have to dump the changed block which takes comparatively less space in data storage.

Implemention Of Deduplication Algorithm

1. Start
2. Declare Variable
3. Initialize variable
4. Read 1024 bytes from file in tone iteration
5. Read from file until reach EOF
   5.1 Generate Hash Value from strBuff[BLOCKSIZE]
   5.2 if (FirstBlock)
          Consider node as root element
          Inc BlockCtr
       else
          search the generated Hash in BST
          if (Find Hash == True)
             Compute the Node
             Add the Node to a linked List
             Change the EndLink of SLL
          else
             Add the node in BST
             Inc The BLockCounter
6. Calculate Deduplication Ratio
7. Display the Result for each iteration
8. END

We have a diagrammatic representation which includes BST and SLL that we are created during parsing of file at block level

Here we are using two Important structures.

Binary Search Tree Structure

typedef struct treeNode
{
     TCHAR  tszHash[MAX];
     int iBlockNo;
     struct Sll *structSlink;
     struct Sll *structLlink;
     struct treeNode *left;
     struct treeNode *right;
}treeNode;
  1. Hash <- This filled is used to store the hash generated by CRC64 bit hash algorithm
  2. iBlockno <- each block is having a specific block no.repeated blocks are points to that particular block no.
  3. Slink <- it is start link pointer that contains the address of first node of single linked list
  4. Elink <- it is start link pointer that contains the address of last node of single linked list
  5. left <- it is a pointer that store the address of left child
  6. right <- it is a pointer that store the address of right child

Single Linked List Structure

struct Sll
{
     int iRepBlockNo;
     struct Sll *pLink;
};
  1. irepBlockNo <- Counter variable that keeps track for no of blocks
  2. pLink <- Pointer that holds the address of next nod

Core Logic Loop

  1. As the loop runs it reads 256 bytes of data in one iteration and generate the hash for the same .
  2. Search() procedure ensures that entry is not already present in BST
  3. If the hash already exist
  • Compute the node
  • Add the node to Single linked list
  • change the slink and elink pointers of tree node.

4. if hash doesn’t exist

  • Compute the node
  • Use hash as a key for comparison
  • Add the node in BST

5.Calculate deduplication percentage as a measurement also known as Deduplication ratio

Syntax

while(is.read(strBuff,BLOCKSIZE))
{
    mbstowcs(wcstring, tszBuff,sizeof(tszBuff));
    wstring strTempHash = HashString(wcstring).c_str();
    if(iIncCtr == 0)
    {
         root = Insert(root, strTempHash.c_str(), iBlockCtr, &amp;ppTempStore)
    }
    else
    {
         pTemp = Find(root,tszTempHash.c_str());
         if(pTemp == NULL)
         {
             root = Insert(root,tszTempHash.c_str(),iBlockCtr,&amp;ppTempStore);
         }
         else
         {
            pNwLink = (Sll*)malloc(sizeof(Sll));
            pNwLink-&gt;pLink=NULL;
            pNwLink-&gt;iRepBlockNo=iIncCtr;
            if(pTemp-&gt;structLlink == NULL &amp;&amp; pTemp-&gt;structSlink == NULL)
            {
                 pTemp-&gt;structSlink=pNwLink;
                 pTemp-&gt;structLlink=pNwLink;
            }
            else
            {
               pTemp-&gt;structLlink-&gt;pLink=pNwLink;
            }
 
         }
 
    }
    iIncCtr++;
}
  • Code is having a Time complexity of O(nlogn)

As it shows the output of Block Level Deduplication on 50MB of File

Links And Abbreviation

  1. SLL <- Singly Linked List
  2. BST <- Binary Search Tree
  3. Dedup<- Deduplication

To know more email: marketing@calsoftinc.com

 
Share:

Related Posts

Technical Documentation

Technical Documentation Review and Tips

Technical reviews are vital for effective and quality documentation. To make this happen, have documentation and its reviews listed as one of the deliverables – just like development or testing. This will place priority on the process, and ensure everyone involved understands the importance of proper and thorough reviews.

Share:
Technology Trends 2024

Technology Trends 2024- The CXO perspective

In the rapidly evolving landscape of 2024, technology trends are reshaping industries and redefining business strategies. From the C-suite perspective, executives are navigating a dynamic environment where artificial intelligence, augmented reality, and blockchain are not just buzzwords but integral components of transformative business models. The Chief Experience Officers (CXOs) are at the forefront, leveraging cutting-edge technologies to enhance customer experiences, streamline operations, and drive innovation. This blog delves into the strategic insights and perspectives of CXOs as they navigate the ever-changing tech terrain, exploring how these leaders are shaping the future of their organizations in the era of 2024’s technological evolution.

Share:
Technology Trends 2024

The Winds of Technology Blowing into 2024

As 2023 draws to a close, the digital landscape is poised for a seismic shift in 2024. Generative Artificial Intelligence (Gen AI) continues its integrative streak, disrupting industries from B2B to healthcare. Networking trends emphasize simplicity, while the synergy of cloud and edge computing with Gen AI promises real-time workflows. Quantum computing, cybersecurity, intelligent automation, and sustainable technology are key players, reshaping the technological fabric. Join us as we navigate the transformative currents of 2024, unraveling the impact on enterprises in our forthcoming article. Stay tuned for the tech evolution ahead!

Share:
Generative AI Shaping Future Industries

[Infoblog] Generative AI Shaping Future Industries

Generative AI is at the forefront of innovation, harnessing the power of machine learning algorithms to create new and original content, from images and music to entire virtual environments. This infographic depicts how Gen AI is evolving industries and shaping its future.

Share:

Enhancing vCenter Capabilities with VMware vCenter Plugins: A Deep Dive

 vCenter Server is one of the most powerful tools in VMware’s product portfolio, enabling efficient management of virtualized environments. One of the most used features in vCenter is the vCenter plugin, which extends the capabilities by providing custom features such as 3rd Party system discovery, and provisioning, providing a unified view, allowing administrators to manage vSphere, and 3rd Party systems seamlessly.

Share:
Generative AI: Transforming Industries for Success

Generative AI : Transforming Industries for Success

Generative AI is the hot topic of discussion everywhere and is being embraced by everyone. Read this blog to explore how different sectors are leveraging Generative AI to drive innovation, enhance efficiency, and deliver superior experiences.

Share: