{"id":516130,"date":"2024-06-05T16:16:48","date_gmt":"2024-06-05T20:16:48","guid":{"rendered":"https:\/\/www.commvault.com\/?post_type=cmv_glossary&p=516130"},"modified":"2024-06-05T16:16:50","modified_gmt":"2024-06-05T20:16:50","slug":"data-deduplication","status":"publish","type":"cmv_glossary","link":"https:\/\/www.commvault.com\/glossary-library\/data-deduplication","title":{"rendered":"Data Deduplication"},"content":{"rendered":"\n
Data deduplication, also called \u201cdedup\u201d, eliminates duplicate or redundant information in a dataset. In short, dedup is a process that ensures only one copy of data exists in a particular dataset or block.
The process improves storage capacity and optimizes redundancies without compromising the fidelity or integrity of the data. References or pointers to the single saved copy replace the deleted redundant information. Very often, data deduplication is coupled with data compression for more optimal storage savings.<\/p>\n\n\n\n
There are two ways to classify deduplication based on where it takes place. The first is source side deduplication which indicates that dedup occurs at source where the data originates. The second is target side deduplication which suggests that dedup occurs on the target storage where the information is stored.<\/p>\n\n\n\n
Data deduplication is the process of removing duplicate copies of datasets to optimize storage resources and enhance their performance. By eliminating redundant information, the system frees storage space and reduces the size of datasets. The changes lower the cost of storage and improve the performance of applications that can carry out tasks using smaller sets of data.<\/p>\n\n\n\n
Data deduplication makes sense only when applied to secondary storage locations with backup data. The secondary storage repositories used for backing up data tend to have higher duplication rates that make dedup necessary. On the other hand, primary storage used for production environments prioritize performance over other factors and may opt not to utilize dedup.<\/p>\n\n\n\n
When deploying deduplication, it is important to be aware that some relational databases such as Oracle and Microsoft SQL do not benefit from dedup. Such databases often keep a unique key for each record, preventing deduplication engines from recognizing duplicate records as duplicates.<\/p>\n\n\n\n
Let us look at a common example.<\/p>\n\n\n\n
The CEO of the company sends an email to all 100 employees with an updated organization chart as an attachment. All 100 employees receive the same file and save it to their desktop so they can refer to it later. When the backup runs later that night it saves all 100 copies of that file, but because of deduplication only 1 actual file is saved, with 99 references or pointers to the original.<\/p>\n\n\n\n
In this example, we saw a deduplication ratio of 100 to 1. When paired with data compression, only the saved instances of the data are compressed to further enhance storage capacity with the efficient encoding of data.<\/p>\n\n\n\n
Deduplication involves analyzing the data to identify the unique blocks of data before storing them. When duplicates of the individual data block are encountered, the redundant patterns are deleted and replaced with references, or pointers, to the saved unique data. The agent performing the deduplication assigns a unique identifier number to each stored block of data.<\/p>\n\n\n\n
When duplicate data is found when checking the unique identifier numbers, the newly discovered redundant information is removed. As part of the deduplication process, blocks or chunks of data are compared by scanning the unique identifier numbers assigned to stored data. When a duplicate identifier number is found, the new data chunk with the duplicate identifying number is deemed a redundant copy and deleted. The deduplication agent assumes that duplicate unique numbers mean a redundant data block.<\/p>\n\n\n\n
Data deduplication can be performed either in-line as the data is flowing or \u201cpost-process\u201d after the information has already been written to storage devices.<\/p>\n\n\n\n
Data deduplication runs in the background and follows prescribed optimization policies to identify files that need work. The dedup agent then breaks the target files into variable data chunks, identifies and stores the unique ones with assigned unique identifier numbers. As part of the deduplication process, duplicate data blocks are removed and replaced with identical saved data references. Next, the dedup agent re-establishes the original file stream with the newly optimized files.<\/p>\n\n\n\n
As noted earlier in the email example, it is not uncommon (with certain data types) for deduplication ratios to be as high as 90:1, which increases the benefits of deduplication. More storage efficiencies are realized by coupling deduplication with data compression that further reduces the size of stored data and its storage footprint.<\/p>\n\n\n\n
Data deduplication is a simple concept that contributes dramatically to reducing storage resources and the costs associated with them. When coupled with data compression, the savings increase, and the storage devices\u2019 performance and applications improve.<\/p>\n\n\n\n
The unchecked exponential data growth coupled with shrinking backup windows, data retention SLAs, and regulatory requirements strain IT resources and business growth. Data deduplication technology reduces physical disk capacity requirements while meeting data retention requirements.<\/p>\n\n\n\n
The amount of data created daily is staggering, and all of it must be optimized for better utilization of storage capacity and protected against loss. On average, in 2020, humans generated 1.7MB of data every second for every person on earth.1<\/sup> By 2025, it is estimated that 463 exabytes of data will be created each day.2<\/sup> According to Microsoft\u2019s estimate, virtualization libraries may see typical space-saving up to 50%, while virtualization libraries might realize up to 95% space savings.3 <\/sup>The savings vary depending on datasets.<\/p>\n\n\n\n
Unfortunately, most of the stored data are duplicates that increase customers\u2019 storage costs and lower cloud systems\u2019 performance and utilization. For example, users\u2019 file sharing results in many duplicate copies of the same file. In virtualized environments, guest operating systems may be almost identical to several of the virtual machines (VMs). Even backup snapshots may have minor differences from one day to another. Deduplication removes these inefficiencies from storage systems.<\/p>\n\n\n\nWant to see data protection in action? <\/h2>\n\n\n