5 0 0:Deduplication
Overview
When similar systems are backed up to the same data storage device, there is the potential for redundancy within the backed up data. However, a data repository only needs to store one copy of the files in order to restore them.
How does deduplication work?
SEP sesam Si3 applies deduplication at the block level. In this deduplication technique, data is divided into blocks, which are then checked and duplicates are skipped. Only unique blocks are sent to storage. By eliminating redundant blocks, the size of the backed up data is reduced as no duplicate data is backed up. Storing the identical data only once results in reduced storage space requirements and network load as no duplicates are transferred over the network.
SEP sesam offers a hybrid of both:
- target-based (Si3T) and
- source-based (Si3S) deduplication
to enable the best possible scenarios for efficient data backup in different environments. Both methods use a configured Si3 deduplication data store that requires a special licence. See Licensing for details.
-
Data chunks are compared and transferred
-
Only new chunks saved and/or only pointers set
Deduplication options: HW and SW
Hardware (HW) deduplication refers to a data deduplication-specific hardware that is dedicated to deduplicating of data storage. SEP support appliances such as HPE StoreOnce, Fujitsu CS800, Quantum DXi, etc., or any disk array with deduplication. (For the complete list, check SEP sesam Storage Hardware Support Matrix.)
The advantage of hardware deduplication is that the integration, management and monitoring of the deduplication for the storage device is already done. They offer high scalability, have a performance guarantee, but come at a higher cost.
They can be used as backup storage and the deduplication on the hardware is completely transparent to the SEP sesam software. If the deduplication appliance also performs replication, the replica is unknown to the SEP sesam backup server (except for the Catalyst copy).
Software (SW) deduplication here refers to SEP Si3 integrated deduplication. The advantage of Si3 deduplication is flexible pricing, it is already included in the volume license (see Licensing or contact sales@sep.de). Si3 deduplication supports any direct attached disk, provides global deduplication with source (Si3S) and target side (Si3T) deduplication, replication and encryption, and allows single file restore (SFR) and instant recovery.
Tip | |
As of SEP sesam v. 5.0.0 Jaglion, two Si3 data store types are available. It is strongly recommended to use the newer type SEP Si3 deduplication store as the old type (Si3 V1) will soon be obsolete. Si3 is advantageous over older Si3 V1 as it offers better performance and resource saving, enables you to back up your data directly to S3 cloud storage or Azure storage, and restore the items you want directly from there. It also provides a new immutable storage feature – SiS. For more details, see Configuring Si3 Deduplication Store. |
You can also duplicate backups to HPE StoreOnce Catalyst stores, however, in this case different requirements must be met. For details, see HPE StoreOnce documentation.
Cloud options
- SEP sesam enables you to back up your data to S3 cloud storage and Azure (≥ Jaglion V2).
- It provides integration with the Hewlett Packard Enterprise (HPE) StoreOnce Catalyst storage system and supports HPE Cloud Bank Storage as a Catalyst copy target for data replication and HPE Cloud Volumes for direct backup and replication. For details, see HPE StoreOnce Configuration.
Si3 encryption
To protect your data from unauthorised access, SEP sesam Si3 deduplication provides Si3 encryption defined at the data store level. It is different from encryption at the backup task level. By default, Si3 encryption is disabled.
When Si3 encryption is enabled (by specifying the encryption password), the data is encrypted after deduplication. The data is encrypted and stored in encrypted form during transmission to the storage server. Without the password, the data on the Si3 data store cannot be read. For details, see Encrypting Si3 Deduplication Store.
Si3 target deduplication (Si3T)
Si3T means that deduplication takes place on the target destination – the Si3 repository. This is an inline block-level data deduplication method where the data is written directly from the SEP sesam Server or Remote Device Server to the backup media. The backups are deduplicated on the fly as the data is written to the storage target. Since the data redundancies are transferred over the network unreduced and are deduplicated directly at the destination, the network load increases, but the storage savings are huge.
SEP sesam analyses data blocks and determines whether the data is unique or has already been copied to the Si3 repository. Only single instances of unique data are sent to the repository, while each deduplicated file is replaced by a stub object. This stub object points to the repository and is used to retrieve stored data.
Not all data is suitable for deduplication: encrypted files, disk blocks with a non-standard size, etc. cannot be deduplicated. See Data Deduplication Use Cases for more information.
Si3 source deduplication (Si3S)
Si3S means that data is deduplicated before it is sent over the network, making the backup extremely bandwidth efficient. During the backup, SEP sesam calculates the hash values of the data to be backed up on the client and queries the storage to determine whether the hash value of the block is already stored there. If it is, SEP sesam sends only the hash value; if not, it sends only changed or unknown blocks of the target Si3 dedup store to the backup server.
The advantage of Si3S deduplication is that only new or changed data is transferred to the backup server during the backup. This optimises bandwidth usage and requires less storage capacity. But in contrast to target-based deduplication at the storage location, source-based deduplication requires much more computing effort and is therefore not suitable for every scenario. Whether the backup windows will be reduced, actually depends on your data structure – note that hashing chunks of data is very CPU intensive and such backups might take even longer. You should consider which clients can be overloaded in this way. In general, source-based deduplication can be an excellent solution for environments with a low daily data change rate and low bandwidth between the backup server and the backed up client.
Not all data is suitable for deduplication: encrypted files, disk blocks with a non-standard size, etc. cannot be deduplicated. See Data Deduplication Use Cases for more information.
Why use SEP sesam Si3 deduplication?
- Si3 deduplication is inline deduplication that removes redundancies from data before writing it to backup storage. Compared to post-process deduplication, it reduces the required raw disk capacity and saves a lot of disk storage and bandwidth, as data in its original size is never written to disk and less data has to be transported over the network.
- Very effective because it uses a variable-length block approach. This deduplication algorithm uses advanced context-aware anchor points to look at a sequence of data and divide it into variable-length blocks based on the characteristics of the data itself. In this way, when a block is repeated, a pointer to the original is stored instead of storing the block again. This results in significant space savings compared to fixed-length deduplication.
- As each chunk is compressed, the data size is smaller at the binary level, consuming less storage space and storing more data in the available storage capacity.
- Si3 encryption for the Si3 deduplication store is one of the SEP sesam encryption types that can help comply with data protection legislation as you get a fully encrypted deduplication store. For details, see Encrypting Si3 Deduplication Store.
- SEP sesam provides replication to support the uninterrupted operation of business-critical applications and ensure remote access to critical data and applications in the event of a disaster. Only changed data blocks are sent over a network and replicated to the target server.
- Replication is asynchronous (based on a schedule at a time specified by the user) and is typically performed after a backup at the primary site to a secondary data server or to the cloud.
- By using near-CDP, based on scheduled frequent replication jobs (similar to CDP in terms of RPOs), SEP sesam replication can be more cost-effective and less resource intensive than true-CDP. Since SEP sesam only stores incremental changes, the load on the network is minimal while throughput is accelerated.
- As retention period of the backed up and replicated data is based on the media pool EOL, allowing different retention time for each media pool, you can customize your retention policies to keep only the data you need and efficiently manage your storage space.
- Replication only works for disk storage (no tape).
What works best
Note | |
When choosing your deduplication method to eliminate redundant backup data, carefully analyze your existing infrastructure, network constraints and the type of data you want to protect. |
- Typically, source-side deduplication is well suited for environments with a low LAN/WAN bandwidth and less amount of data. Another typical source-side dedup case is for remote data (ROBO) backup – for protecting and storing the data created by remote and branch offices.
- On the other hand, target deduplication might be more suitable for large data sets in a fast network, such as structured databases, that need to significantly reduce data volumes, or for data located on clients for which you do not want to increase CPU overhead.
- You should be aware of deduplication limitations before configuring it. For example, it does not make sense to deduplicate certain data, such as media files, which cannot be actively deduplicated because the files are unique and exist in compressed media formats. Among them are MP3, MP4, JPEG, PNG, zipped files etc.
- For different data types, different deduplication policy and method should be set up. E.g. databases in one Si3 store and path backups in another Si3 store.
Si3 deduplication can be used together with Si3 replication to provide backup redundancy for disaster recovery and reduce the data transferred over the network.
Data deduplication use cases
The achievable deduplication ratio depends on several factors:
- Data type
-
- Data that is encrypted, pre-compressed or rich in metadata has the lowest deduplication values (pdf; audio files, such as mp3, wma; video files: avi, mkv, etc.; image files: jpg, png, etc.). (The only exception is, when the same data is backed up repeatedly, then of course the dedup rate is very good even for such data.)
- Relational databases such as SQL and Oracle cannot achieve high deduplication ratio because they have a unique key for each DB record that prevents the deduplication process from identifying them as duplicates.
- The greatest benefit can be achieved in virtual environments where multiple VMs with application deployments and those used for testing and development result in duplicate guests and associated data.
- Similarly, deduplication is beneficial for virtual desktop infrastructures and endpoint clients, as these tend to create duplicate data.
Tip | |
Instead of encrypting your data, use Si3 encyption |
.
- Change rate
-
- The higher the daily data change rate, the lower the deduplication ratio.
- Primary storage has less duplicate data, moderate duplicate data have periodic archives, while largely duplicated data is characteristic for recurring backups.
- Retention time
-
- Data with longer retention time and more copies have a better deduplication ratio.
- Backup level
-
- With a daily full backup, the deduplication rate is higher than with an incremental or differential backup due to data redundancy.
Space reduction ratio and percent
- The deduplication ratio is the measure of the original size of the data compared to the size of the data after the redundancy has been removed.
- The deduplication ratio usually depends on the use case and time; the ratio should be at least 4:1, which means that four times more data is protected than the storage space needed to store it.
- Note that the ratios can only be meaningfully compared under the same assumptions.
- Even relatively low ratios can still result in significant space savings because less storage space is needed.
Deduplication example
Watch SEP sesam video Why and how to use deduplication with SEP sesam for additional information on calculating the deduplication ratio and for more information on deduplication options in SEP sesam environment.
See also
Si3 Deduplication Hardware Requirements – Configuring Si3 NG Deduplication Store – Configuring Source-side Deduplication – Replication - HPE StoreOnce – Licensing – Encrypting Si3 NG Deduplication Store