Home > Domain Expertise > Storage Services > Data De-duplication
Everyday 2.5 quintillion bytes of data is created. In fact, 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: from sensors used to gather climate info., digital photos, posts to social media, and videos posted online, transaction records of online purchases, and from cell phone GPS signals to name a few. This data is “Big Data”.
Big Data spans three dimension:
- Variety – Big Data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files, and more.
- Velocity – Often time-sensitive, Big Data must be used as it is streaming into an enterprise in order to maximize its value to the business.
- Volume – Big Data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
While the Big Data, in different dimensions, is stored it aggregates lot of redundant data. Data deduplication is an effective way to get rid of redundant data. A de-duplication system identifies and eliminates duplicate blocks of data and hence significantly reduces physical storage requirements.
Figure: Illustration of a typical de-duplication system functions
Commonly referred to as Single-Instance Storage (SIS), file-level data deduplication compares a file to be backed up or archived with those already stored by checking its attributes against an index. If the file is unique, it is stored and the index is updated; if not, only a pointer to an existing file is stored. The result is, only one instance of a file is saved and subsequent copies are replaced with a "stub" that points to the original file.
- Block-level deduplication
Block-level data deduplication operates on the sub-file level. As its name implies, the file is typically broken down into segments -- chunks or blocks -- that are examined for redundancy vs. previously stored information.
Pros and Cons of File-Level and Block Level data de-duplication:
| Sr. No.
|| File Level deduplication
|| Block Level deduplication
||Save the entire file a second time
||Save the changed blocks between one version of the file and the next
||Indexes are significantly smaller, which takes less computational time when duplicates are being determined
||Indexes are larger, hence it takes more computational time when duplicates are being determined
||Backup performance is less affected by the deduplication process
||Backup performance is significantly affected by the deduplication process
||Requires less processing power due to the smaller index and reduced number of comparisons
||Require more processing power due to larger index and higher number of comparisons
||Store unique files and pointers to existing unique files there is less to reassemble
||Require "reassembly" of the chunks based on the master index
Data can be de-duplicated at target or source:
- Target Based Data De-duplication
Target-based deduplication acts on the target data storage media. In this case the client server is unmodified and not aware of any deduplication. The deduplication engine can be embedded in the hardware array, which can be used as NAS/SAN device with deduplication capabilities. Alternatively it can also be offered as an independent software or hardware appliance which acts as intermediary between backup server and storage arrays.
- Source Based Data De-duplication
On the contrary, Source-based deduplication acts on the data at source before it’s moved. A deduplication aware backup agent is installed on the server which backs up only unique data. The result is improved bandwidth and storage utilization. But, this imposes additional computational load on backup client.
Calsoft has helped ISVs in developing data deduplication solutions that protect a full range of environments — from small distributed offices to the largest enterprise data centers. Contact Calsoft today to solve your Big Data related challenges.