On Mon, Feb 11, 2013 at 6:22 AM, SaE!o Kiselkov <skiselkov.ml@gmail.com>wrote:
So I've been talking to some people around storage and found out that SHA-256 hashing *is* a significant cost in implementing dedup.
I know I'm arriving late to this discussion. For my own understanding, I'd like to attempt to summarize the various positions and get everyone's feedback. SHA-256 is slow; we'd like to find a faster algorithm to replace it, which will use less CPU. Such a replacement must be usable for dedup without verification. (Performance and behavior with verification is a secondary concern to the points outlined here.) Dedup inherently relies on probabilistic correctness: if there is a hash collision, incorrect data will be returned. This is true even with the best hash algorithms (including SHA-256), and even without the possibility of storing malicious data. However, the probability involved is exceedingly small. Given 2^64 bytes (16 exabytes) in 8KB blocks, the odds of a collision are approximately 1/2^155. This is less likely than consecutively buying 5 jackpot-winning lottery tickets (assuming lottery odds 1/100 million). Dedup is sometimes used with user-generated data (i.e. untrusted, possibly malicious users provide the data to store). In the case, the hash algorithm should be such that it is infeasible to find a block which hashes to a given value. Otherwise an untrusted user may cause ZFS to return incorrect data. Dedup is sometimes used only with trusted data (i.e. none of the data can be maliciously generated). In this case, the algorithm need only distribute input blocks evenly over all 2^256 possible hash values. It is OK if it is feasible to find a block with a given hash value, because the risk associated is no worse than the ideal case (e.g 1/2^155 chance of returning incorrect data with the workload described above). SHA-256 is currently used for both of these use cases (trusted and untrusted data). So a replacement should also be usable for both. Optionally, we could implement a third, even faster, algorithm for use only in the trusted case. Some people believe that this choice may be misused (i.e. used even when the data can not be trusted), and therefore this option should not be offered. I'm not an expert in crypto algorithms; I've only read the wikipedia page on the SHA-3 competition ( http://en.wikipedia.org/wiki/NIST_hash_function_competition). Being fairly paranoid mainly due to my lack of expertise in this area, I would prefer to choose one of the finalists as a replacement for SHA-256. It sounds like BLAKE and Skein have reasonable performance. I think that the easiest way forward will be to first agree on a high-performance replacement for SHA-256, which is usable in all cases that SHA-256 is, including with untrusted data. We can then evaluate demand for an even faster algorithm to be used only with trusted data. --matt ------------------------------------------- illumos-zfs Archives: https://www.listbox.com/member/archive/182191/=now RSS Feed: https://www.listbox.com/member/archive/rss/182191/22842876-6fe17e6f Modify Your Subscription: https://www.listbox.com/member/?member_id=22842876&id_secret=22842876-a25d3366 Powered by Listbox: http://www.listbox.com ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE