Deduplication is becoming more prevalent in the world of proprietary solutions for data backup. However an open source solution deduplication shows the tip of his nose for some time and begins to mature : Opendedup.
For those who have forgotten or do not know this technology, I propose the definition of Wikipedia :
« Data deduplication is a specific form of compression where redundant data is eliminated, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored. For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only 1 MB. Different applications have different levels of data redundancy. Backup applications generally benefit the most from de-duplication due to the nature of repeated full backups of an existing file system. »
Also, add that to optimize this deduplication, data storage is usually in blocks of data as shown in the diagram below :
Great Applications for Deduplication
- Virtual Machines
- Network shares for unstructured data such as office documents and PSTs
- Any application with a large amount of deduplicated data
Applications that are not a good fit for Deplication
- Anything that has totally unique data
- Music Files
- Encrypted Data
Deduplication with OpenDedup
SDFS leverages data deduplication for primary storage. It acts as a normal file system that can be used for typical IO operations similiar to EXT3, NTFS … etc. The difference is SDFS hashes blocks of data as they are written to the file system and only writes those that are unique to disk. Blocks that are not unique just refernce the data that is already on disk.
- x64 Linux Distribution. (The application was tested and developed on ubuntu 10.0.4)
- Fuse 2.8+ . Debian Packages for this are available at : http://opendedup.googlecode.com/files/debian-fuse.tar.gz
- 2 GB of RAM
- Java 7 – available at https://jdk7.dev.java.net/
- attr – (setfattr and getfattr) if you plan on doing snapshotting or setting extended file attributes.
Install (on Debian 6)
Consider that the Debian squeeze 6 is already installed.
# wget http://opendedup.googlecode.com/files/sdfs-1.0.7.tar.gz # tar -zxf sdfs-1.0.5.tar.gz # mv sdfs-bin /opt/sdfs
# apt-get install attr
# wget http://opendedup.googlecode.com/files/debian-fuse.tar.gz # tar -zxf debian-fuse.tar.gz # cd debian-fuse # apt-get install libselinux1-dev libsepol1-dev # dpkg -i libfuse2_2.8.3-opendedup0_amd64.deb \ # libfuse-dev_2.8.3-opendedup0_amd64.deb \ # fuse-utils_2.8.3-opendedup0_amd64.deb
# tar -zxf jdk-7-fcs-bin-b146-linux-x64-20_jun_2011.tar.gz # mkdir /usr/lib/jvm # mv jdk1.7.0 /usr/lib/jvm/jdk # export JAVA_HOME=/usr/lib/jvm/jdk
Create SDFS file system
For all possible parameters of mkfs.sdfs : mkfs.sdfs –help
–volume-capacity and –volume-name are required but I recommend –volume-maximum-full-percentage which will back an error when the file system is full. Otherwise the command « df » will show 100% but the storage space continue to increase. By default, data is stored in /opt/sdfs/<volume name>.
# mv /opt/sdfs-bin /opt/sdfs # cd /opt/sdfs /opt/sdfs# ./mkfs.sdfs --volume-name=sdfs_vol1 --volume-capacity=500MB --volume-maximum-full-percentage=100 Attempting to create volume ... Volume [sdfs_vol1] created with a capacity of [500MB] check [/etc/sdfs/sdfs_vol1-volume-cfg.xml] for configuration details if you need to change anything
Mount SDFS file system
/opt/sdfs# mkdir /mnt/sdfs /opt/sdfs# ./mount.sdfs -v sdfs_vol1 -m /mnt/sdfs Running SDFS Version 1.0.7 reading config file = /etc/sdfs/sdfs_vol1-volume-cfg.xml -f /mnt/sdfs -o direct_io,big_writes,allow_other,fsname=sdfs_vol1-volume-cfg.xml 11:11:05.114 main INFO [fuse.FuseMount]: Mounting filesystem
Two identical copies of files
To begin, we’ll copy the same file two times with a different name on SDFS file system :
# du -hc /opt/sdfs/volumes/sdfs_vol1/ [...] 20K total # cp jdk-7-fcs-bin-b146-linux-x64-20_jun_2011.tar.gz /mnt/sdfs/ # df -h /mnt/sdfs/ Sys. de fichiers Taille Uti. Disp. Uti% Monté sur sdfs_vol1-volume-cfg.xml 500M 91M 410M 19% /mnt/sdfs # du -hc /opt/sdfs/volumes/sdfs_vol1/ [...] 91M total # cp jdk-7-fcs-bin-b146-linux-x64-20_jun_2011-copie.tar.gz /mnt/sdfs/ df -h /mnt/sdfs/ Sys. de fichiers Taille Uti. Disp. Uti% Monté sur sdfs_vol1-volume-cfg.xml 500M 181M 319M 37% /mnt/sdfs # du -hc /opt/sdfs/volumes/sdfs_vol1/ [...] 91M total
We can see that the disk space occupied by the file system is still the same. The command « df » clearly indicates the sum of the two files.
Copy of two files which the second contains two times the data of the first
# ls -lh ldap* -rw-r--r-- 1 root root 42M 8 juil. 13:56 ldap2x.ldif -rw-r--r-- 1 root root 21M 14 mars 2006 ldap.ldif # cp ldap*.ldif /mnt/sdfs/ # df -h /mnt/sdfs/ Sys. de fichiers Taille Uti. Disp. Uti% Monté sur sdfs_vol1-volume-cfg.xml 500M 63M 438M 13% /mnt/sdfs # du -hc /opt/sdfs/volumes/sdfs_vol1/ [...] 42M total
For this test the rate of duplication is 1/3.
Copying 500 MB of files (text, jpg, pdf, mp3 …) until saturation of the mounted file system
# ls -rlh /mnt/sdfs/ -rw-r--r-- 1 root root 7,8M 21 sept. 2010 Water - Evolution.mp3 -rw-r--r-- 1 root root 643K 10 juin 10:27 terrain vague.jpg -rw-r--r-- 1 root root 34M 7 juin 09:48 squeezeboxserver_7.5.4_all.deb [...] # df -h /mnt/sdfs Sys. de fichiers Taille Uti. Disp. Uti% Monté sur sdfs_vol1-volume-cfg.xml 500M 500M 0 100% /mnt/sdfs # du -hc /mnt/sdfs [...] 564M total # du -hc /opt/sdfs/volumes/sdfs_vol1/ [...] 583M total
There, I confess, I have some difficulty in interpreting these results !
For testing, I used a virtual machine with 4 G0 RAM, 2 CPU and 3 virtual disks, then I installed the Linux distribution Debian Squeeze 6.0.
# hdparm -t /dev/sda /dev/sda: Timing buffered disk reads: 486 MB in 3.02 seconds = 160.71 MB/sec # hdparm -t /dev/sdb /dev/sdb: Timing buffered disk reads: 484 MB in 3.00 seconds = 161.18 MB/sec # hdparm -t /dev/sdc /dev/sdc: Timing buffered disk reads: 482 MB in 3.00 seconds = 160.43 MB/sec
- Test copy of a 698 MB file :
Unsurprisingly EXT4 leads the race ahead and SDFS is close to the sag wagon.
- Test with dd
# time sh -c "dd if=/dev/zero of=/mnt/ext3/test bs=4096 count=175000 && sync" # time sh -c "dd if=/dev/zero of=/mnt/ext4/test bs=4096 count=175000 && sync" # time sh -c "dd if=/dev/zero of=/mnt/sdfs/test bs=4096 count=175000 && sync"
In this example, we create a test file on each partition (ext3, ext4 and sdfs) in which we will write 175 000 blocks of 4KB. This will give us a file of 717 MB. dd will return the time and the bandwidth used.
SDFS is almost four times slower than EXT3.
time sh -c "dd if=/mnt/ext3/test of=/dev/null bs=4096 count=175000 && sync" time sh -c "dd if=/mnt/ext4/test of=/dev/null bs=4096 count=175000 && sync" time sh -c "dd if=/mnt/sdfs/test of=/dev/null bs=4096 count=175000 && sync"
We read the same test file (was sent to / dev / null). dd we will return the same information. But it is now reading.
In this second test, the gap is even greater with a ratio of 1 / 15!
- Test with bonnie++
The next step is the analysis of performance using the program bonnie + +. This program analyzes the type of access database to a file, as well as create, read, and destruction of small files simulating the use made by programs like Squid, INN, or programs using the Maildir format (qmail).
# bonnie++ -d /mnt/ext3 -s 512 -r 256 -u0 # bonnie++ -d /mnt/ext4 -s 512 -r 256 -u0 # bonnie++ -d /mnt/sdfs -s 512 -r 256 -u0
The command runs the test using 512 MB in the mounted file system. The other options specify the amount of memory (256 MB), and the user (-u0, that is the administrator).
Note that in certain boxes displayed « +++++ » means that the test to be less than 500 ms and the result could not be calculated.
|SF||test space||KB/s||% CPU||KB/s||% CPU||KB/s||% CPU||KB/s||% CPU||KB/s||% CPU||KB/s||% CPU|
|Sequential creations||Random creations
|SF||number of files||/ sec||% CPU||/ sec||% CPU||/ sec||% CPU||/ sec||% CPU||/ sec||% CPU||/ sec||% CPU|
The order is complied with EXT4 which is slightly higher than EXT3 and SDFS in last place far behind the other 2.
Opendedup is certainly an attractive and promising. However for good performance, I think it should be used with disks at 15,000 rev / min and with minimum 4 GB of RAM. Also I noticed an encoding problem when file names contain accented characters. In addition deduplication which is supposed to be in blocks of data does not seem very effective. While the documentation is not abundant but it contains sufficient information. So I certainly must have missed something …
Your comments are welcome if you want to explore this subject.