com.addthis.basis.chars.CharBuf Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of basis Show documentation
AddThis core java classes
There is a newer version: 3.0.4
/*
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.addthis.basis.chars;

import com.google.common.annotations.Beta;

/**
 * A variation on ByteBufs for Character Strings. This variation has three primary goals:
 *
 * 1. Faster serialization and deserialization. Character Strings that are only
 * infrequently treated as anything more than byte sequences waste a lot of CPU
 * and (although also sort of CPU) heap garbage. This is especially egregious for
 * the all too frequent case of deserializing a string, passing it around a few
 * threads, and then serializing it again, but it is almost as bad when the only
 * operations are comparisons to other Strings.
 *
 * 1 (Example). In hydra, bundles are sent from a query worker to the master with
 * many String values serialized as byte arrays in the UTF-8 format. It is entirely
 * possible for that String to be passed to the user without ever being manipulated.
 * That means it was deserialized and then reserialized back to the same byte array
 * for essentially no reason. That worst case could be resolved by lazy loading or
 * a special un-deserializable value, but this does not scale well for the long tail
 * of few, low intensity operations like comparisons to other Character Strings.
 * Additionally, a lazy loading implementation would be likely implemented as a wrapper
 * class. That would cause another layer of indirection and memory waste. This solution
 * is closer to 'lazy loading of chars', which actually turns out to be pretty cheap.
 *
 * 2. Reduced memory overhead. Standard java char types are 16 bits, but for the common case
 * of all or mostly ASCII characters, this is twice (or near that) as much memory as needed.
 *
 * 3. More flexible char[] semantics similar to the difference between byte[]s and ByteBufs. Eg. decreasing
 * the number of readable values is possible as a constant time operation without creating a new array.
 * String itself is also really, deeply, into making char[] copies. See AsciiSequence.toString() for
 * an example of easy it can be to accidentally make lots of array copies, and how hard it is to avoid even
 * when you are trying to. (in hydra, AbstractBufferingHttpBundleEncoder ran into a similar issue where it
 * was mistakenly creating an unnecessary copy).
 *
 * * * *
 * Secondary goals/ benefits:
 * * * *
 *
 * - Specializing in one encoding with one backing structure allows for much more efficient
 * encode and decode methods than those in the standard library due to abstraction limitations.
 *
 * - Gets around some of the other more egregious inefficiencies with jdk UTF-8 encoding/ decoding
 * like decoding pre-allocating three times as much space as needed for the ASCII only case and
 * then cutting down by re-allocating to the smaller char array. This implementation allows and
 * encourages providing hints about how much to allocate, and should be able to more easily support
 * correcting under-estimates (as far as I can tell, the JDK NIO coding library does support that --
 * it just isn't actually used anywhere I can find. Possibly because benchmarks showed it wasn't worth
 * it, but it is also possible that was due to limitations we do not have here).
 *
 * - Using CharSequence here and other places gives us more options with respect to optimizing
 * things like sub-string semantics (shared/ unshared), and efficient streaming cache hit
 * detection.
 *
 * - Using ByteBufs directly makes integration with other ByteBuf based IO easy and efficient.
 *
 * This interface combines several related ones and additionally imposes the following contracts:
 *
 * - all backing data should be stored in UTF-8 format only. UTF-8 is the one
 * true format, and heretics will be persecuted without remorse.
 *
 * - hashCode and equals should return consistent values across implementations
 * for the same underlying logic character sequence.
 * -- for lack of other motivations, but for possibly no actual benefit, this
 * will be the same values that an equivilent String representation would return.
 *
 * - compareTo should perform lexicographical string comparison.
 * -- Note that while such comparisons are likely to be consistent with other
 * CharSequence implementations, we cannot actually guarantee that to be the
 * case because CharSequence does not require it. Accordingly, we do not derive
 * much benefit from declaring Comparable of type CharSequence because eg.
 * native Strings declare Comparable only for other Strings.
 * -- Also note that the UTF-8 format (which you are required to implement)
 * should be able to do lexicographical comparisons without converting to chars
 * (byte-wise comparison should suffice).
 *
 * Component reasoning
 *
 * CharSequence: to sub in for arbitrary String usages
 *
 * Appendable: Convenient for building CharSequences, and CharBufs are likely efficient at doing so
 *
 * Comparable: so that CharBuf only CharSequence environments can use sorted data structures
 *
 * ByteBufHolder: subject to change, but helpful for resource management, and exposing
 * the underlying data store for more efficient operations than per-char method calls.
 * Possible replacements for ByteBufHolder might be directly extending ByteBuf with more/
 * different char methods, or simply creating a whole char based equivalent with conversions.
 *
 * Maybe add Iteratable Character, or primitive equivalent?
 */
@Beta
public interface CharBuf extends ReadableCharBuf, Appendable {

}