I recently had a look at a few linux deduplicated filesystems, namely opendedup, lessfs and ddumbfs to act as a backup repository storage.
I was frustrated with opendedup because if would return "filesystem full" when there was no obvious reason why the filesystem might be full. As a result, I did not feel confident that I could safely monitor the system to alert me prior to the filesystem filling up.
Lessfs was too slow in my environment using the berkelydb and tokyodb backends, and I could not get hamsterdb to make a clean filesystem
In the end I was happiest with ddumbfs (http://www.magiksys.net/ddumbfs/). Upon running some tests, I found that the dedup ratio dropped by 40% every time I doubled the block size, so I wanted to use a 4kB block size to maximise the deduplication ratio. Unfortunately this either required a lot of RAM (just under 1% of the raw disk capacity), otherwise I was not seeing good performance. Buying 90GB of RAM for 9TB of raw storage would eliminate most of the cost benefits of using deduplicated storage, so we looked at SSD as a compromise. I ended up applying the attached patch in order to improve the performance of ddumbfs 4x whent he index is not locked in RAM.
Note: Every time you double the block size, you cut your SSD/RAM requirements in half. Depending on how your data is aligned, you may see a significant reduction in your dedup ratio every time you double the block size.
Ddumbfs requires a normal linux filesystem to use when creating files. The contents of the files on this filesystem are the hash values of the dedup data and an index position within the data device that contains the real data. This means that you need this filesystem to have the same capacity as your index file if you have no deduplication, or to multiply the index size by your deduplication ratio (e.g. if you expect every unique block to be used 5 times, you need 5% of your storage to be available in this filesystem).
The end result was the following system configuration:
8x2.4GHz CPU cores
28GB RAM
512GB Samsung XP941 SSD (only 90GB required)
9TB RAID5 Data store
The 9TB drive was partitioned with 500GB as an EXT4 filesystem in /dev/sdb1 and the remaining space as raw storage in /dev/sdb2
The 512GB drive was partitioned with a 128GB partition in /dev/sdc1
First, I created my ext4 filesystem and mounted it in /.ddumbfs/veeam as follows:
mkfs.ext4 -m0 /dev/sdb1
/etc/fstab entry:
UUID=0ba3c8d7-8d62-48bb-a837-a588ad1dc8be /.ddumbfs/veeam ext4 relatime 0 2
mount /.ddumbfs/veeam
Second, I created my ddumbfs filesystem as follows:
mkddumbfs -i /dev/sdc1 -b /dev/sda2 -B 4k /.ddumbfs/veeam/
Third, I mounted the ddumbfs as follows:
/etc/fstab entry:
-oparent=/.ddumbfs/veeam,nolock_index /var/veeam fuse.ddumbfs defaults 0 0
mount /var/veeam