IJRCS – Volume 1 Issue 4 Paper 3


Author’s Name : Amal Thankachan | R Sujitha

Volume 01 Issue 04  Year 2014  ISSN No:  2349-3828  Page no:  10-14



Deduplication has become a widely deployed technology in cloud data centers to improve IT resources efficiency. However, traditional techniques face a great challenge in big data deduplication to strike a sensible tradeoff between the conflicting goals of scalable deduplication throughput and high duplicate elimination ratio. We propose AppDedupe, an application-aware scalable inline distributed deduplication framework in cloud environment, to meet this challenge by exploiting application awareness, data similarity and locality to optimize distributed deduplication with inter-node two-tiered data routing and intra-node application-aware deduplication. It first dispenses application data at file level with an application-aware routing to keep application locality, then assigns similar application data to the same storage node at the super-chunk granularity using a hand printing based stateful data routing scheme to maintain high global deduplication efficiency, meanwhile balances the workload across nodes. AppDedupe builds application-aware similarity indices with super-chunk handprints to speedup the intra-node deduplication process with high efficiency. Our experimental evaluation of AppDedupe against state-of-the-art, driven by real-world datasets, demonstrates that AppDedupe achieves the highest global deduplication efficiency with a higher global deduplication effectiveness than the high-overhead and poorly scalable traditional scheme, but at an overhead only slightly higher than that of the scalable but low duplicate-elimination-ratio approaches


Big Data Deduplication, Application Awareness, Data Routing, Handprinting, Similarity Index


  1. Gantz, D. Reinsel, “The Digital Universe Decade-Are You Ready?” White Paper, IDC, May 2010.
  2. Biggar, “Experiencing Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements,” White Paper, the Enterprise Strategy Group, Feb. 2007.
  3. R. Jayaram, C. Peng, Z. Zhang, M. Kim, H. Chen, H. Lei. “An Empirical Analysis of Similarity in Virtual Machine Images,” Proc. Of the ACM/IFIP/USENIX Middleware Industry Track Workshop (Middleware’11), Dec. 2011.
  4. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. “iDedup: Latency-aware, inline data deduplication for primary storage,” Proc. of the 10th USENIX Conference on File and Storage Technologies (FAST’12). Feb. 2012.
  5. Shilane, M. Huang, G. Wallace, and W. Hsu. “WAN opti- mized replication of backup datasets using stream-informed delta compression,” ACM Transactions on Storage (TOS), 8(4): 915-921, Nov. 2012.