com.google.common.hash.HashFunction Maven / Gradle / Ivy
/*
* Copyright (C) 2011 The Guava Authors
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
* in compliance with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software distributed under the License
* is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
* or implied. See the License for the specific language governing permissions and limitations under
* the License.
*/
package com.google.common.hash;
import com.google.common.annotations.Beta;
import com.google.common.primitives.Ints;
import java.nio.charset.Charset;
/**
* A hash function is a collision-averse pure function that maps an arbitrary block of data to a
* number called a hash code.
*
* Definition
*
* Unpacking this definition:
*
*
* - block of data: the input for a hash function is always, in concept, an ordered byte
* array. This hashing API accepts an arbitrary sequence of byte and multibyte values (via
* {@link Hasher}), but this is merely a convenience; these are always translated into raw byte
* sequences under the covers.
*
*
- hash code: each hash function always yields hash codes of the same fixed bit length
* (given by {@link #bits}). For example, {@link Hashing#sha1} produces a 160-bit number, while
* {@link Hashing#murmur3_32()} yields only 32 bits. Because a {@code long} value is clearly
* insufficient to hold all hash code values, this API represents a hash code as an instance of
* {@link HashCode}.
*
*
- pure function: the value produced must depend only on the input bytes, in the order
* they appear. Input data is never modified. {@link HashFunction} instances should always be
* stateless, and therefore thread-safe.
*
*
- collision-averse: while it can't be helped that a hash function will sometimes produce
* the same hash code for distinct inputs (a "collision"), every hash function strives to
* some degree to make this unlikely. (Without this condition, a function that always
* returns zero could be called a hash function. It is not.)
*
*
* Summarizing the last two points: "equal yield equal always; unequal yield unequal
* often." This is the most important characteristic of all hash functions.
*
*
Desirable properties
*
* A high-quality hash function strives for some subset of the following virtues:
*
*
* - collision-resistant: while the definition above requires making at least some
* token attempt, one measure of the quality of a hash function is how well it succeeds
* at this goal. Important note: it may be easy to achieve the theoretical minimum collision
* rate when using completely random sample input. The true test of a hash function is
* how it performs on representative real-world data, which tends to contain many hidden
* patterns and clumps. The goal of a good hash function is to stamp these patterns out as
* thoroughly as possible.
*
*
- bit-dispersing: masking out any single bit from a hash code should yield only
* the expected twofold increase to all collision rates. Informally, the "information" in
* the hash code should be as evenly "spread out" through the hash code's bits as possible. The
* result is that, for example, when choosing a bucket in a hash table of size 2^8, any
* eight bits could be consistently used.
*
*
- cryptographic: certain hash functions such as {@link Hashing#sha512} are designed to
* make it as infeasible as possible to reverse-engineer the input that produced a given hash
* code, or even to discover any two distinct inputs that yield the same result. These
* are called cryptographic hash functions. But, whenever it is learned that either of
* these feats has become computationally feasible, the function is deemed "broken" and should
* no longer be used for secure purposes. (This is the likely eventual fate of all
* cryptographic hashes.)
*
*
- fast: perhaps self-explanatory, but often the most important consideration. We have
* published microbenchmark results for many common hash functions.
*
*
* Providing input to a hash function
*
* The primary way to provide the data that your hash function should act on is via a
* {@link Hasher}. Obtain a new hasher from the hash function using {@link #newHasher}, "push" the
* relevant data into it using methods like {@link Hasher#putBytes(byte[])}, and finally ask for the
* {@code HashCode} when finished using {@link Hasher#hash}. (See an {@linkplain #newHasher example}
* of this.)
*
*
If all you want to hash is a single byte array, string or {@code long} value, there are
* convenient shortcut methods defined directly on {@link HashFunction} to make this easier.
*
*
Hasher accepts primitive data types, but can also accept any Object of type {@code
* T} provided that you implement a {@link Funnel Funnel} to specify how to "feed" data from that
* object into the function. (See {@linkplain Hasher#putObject an example} of this.)
*
* Compatibility note: Throughout this API, multibyte values are always interpreted in
* little-endian order. That is, hashing the byte array {@code {0x01, 0x02, 0x03, 0x04}} is
* equivalent to hashing the {@code int} value {@code 0x04030201}. If this isn't what you need,
* methods such as {@link Integer#reverseBytes} and {@link Ints#toByteArray} will help.
*
*
Relationship to {@link Object#hashCode}
*
* Java's baked-in concept of hash codes is constrained to 32 bits, and provides no separation
* between hash algorithms and the data they act on, so alternate hash algorithms can't be easily
* substituted. Also, implementations of {@code hashCode} tend to be poor-quality, in part because
* they end up depending on other existing poor-quality {@code hashCode} implementations,
* including those in many JDK classes.
*
*
{@code Object.hashCode} implementations tend to be very fast, but have weak collision
* prevention and no expectation of bit dispersion. This leaves them perfectly suitable for
* use in hash tables, because extra collisions cause only a slight performance hit, while poor bit
* dispersion is easily corrected using a secondary hash function (which all reasonable hash table
* implementations in Java use). For the many uses of hash functions beyond data structures,
* however, {@code Object.hashCode} almost always falls short -- hence this library.
*
* @author Kevin Bourrillion
* @since 11.0
*/
@Beta
public interface HashFunction {
/**
* Begins a new hash code computation by returning an initialized, stateful {@code
* Hasher} instance that is ready to receive data. Example:
{@code
*
* HashFunction hf = Hashing.md5();
* HashCode hc = hf.newHasher()
* .putLong(id)
* .putBoolean(isActive)
* .hash();}
*/
Hasher newHasher();
/**
* Begins a new hash code computation as {@link #newHasher()}, but provides a hint of the expected
* size of the input (in bytes). This is only important for non-streaming hash functions (hash
* functions that need to buffer their whole input before processing any of it).
*/
Hasher newHasher(int expectedInputSize);
/**
* Shortcut for {@code newHasher().putInt(input).hash()}; returns the hash code for the given
* {@code int} value, interpreted in little-endian byte order. The implementation might
* perform better than its longhand equivalent, but should not perform worse.
*
* @since 12.0
*/
HashCode hashInt(int input);
/**
* Shortcut for {@code newHasher().putLong(input).hash()}; returns the hash code for the given
* {@code long} value, interpreted in little-endian byte order. The implementation might
* perform better than its longhand equivalent, but should not perform worse.
*/
HashCode hashLong(long input);
/**
* Shortcut for {@code newHasher().putBytes(input).hash()}. The implementation might
* perform better than its longhand equivalent, but should not perform worse.
*/
HashCode hashBytes(byte[] input);
/**
* Shortcut for {@code newHasher().putBytes(input, off, len).hash()}. The implementation
* might perform better than its longhand equivalent, but should not perform worse.
*
* @throws IndexOutOfBoundsException if {@code off < 0} or {@code off + len > bytes.length} or
* {@code len < 0}
*/
HashCode hashBytes(byte[] input, int off, int len);
/**
* Shortcut for {@code newHasher().putUnencodedChars(input).hash()}. The implementation
* might perform better than its longhand equivalent, but should not perform worse. Note
* that no character encoding is performed; the low byte and high byte of each {@code char} are
* hashed directly (in that order).
*
* Warning: This method will produce different output than most other languages do when
* running the same hash function on the equivalent input. For cross-language compatibility, use
* {@link #hashString}, usually with a charset of UTF-8. For other use cases, use {@code
* hashUnencodedChars}.
*
* @since 15.0 (since 11.0 as hashString(CharSequence)).
*/
HashCode hashUnencodedChars(CharSequence input);
/**
* Shortcut for {@code newHasher().putString(input, charset).hash()}. Characters are encoded using
* the given {@link Charset}. The implementation might perform better than its longhand
* equivalent, but should not perform worse.
*
*
Warning: This method, which reencodes the input before hashing it, is useful only for
* cross-language compatibility. For other use cases, prefer {@link #hashUnencodedChars}, which is
* faster, produces the same output across Java releases, and hashes every {@code char} in the
* input, even if some are invalid.
*/
HashCode hashString(CharSequence input, Charset charset);
/**
* Shortcut for {@code newHasher().putObject(instance, funnel).hash()}. The implementation
* might perform better than its longhand equivalent, but should not perform worse.
*
* @since 14.0
*/
HashCode hashObject(T instance, Funnel super T> funnel);
/**
* Returns the number of bits (a multiple of 32) that each hash code produced by this hash
* function has.
*/
int bits();
}