oracle.kv.avro.AvroCatalog Maven / Gradle / Ivy

Go to download
/*-
 * Copyright (C) 2011, 2018 Oracle and/or its affiliates. All rights reserved.
 *
 * This file was distributed by Oracle as part of a version of Oracle NoSQL
 * Database made available at:
 *
 * http://www.oracle.com/technetwork/database/database-technologies/nosqldb/downloads/index.html
 *
 * Please see the LICENSE file included in the top-level directory of the
 * appropriate version of Oracle NoSQL Database for a copy of the license and
 * additional information.
 */

package oracle.kv.avro;

import java.util.Map;

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.specific.SpecificRecord;

import oracle.kv.Consistency;
import oracle.kv.KVStore;
import oracle.kv.KVStoreConfig;
import oracle.kv.Value;

/**
 * A catalog of Avro schemas and bindings for a store.
 * 
 * Manages schemas and provides {@link AvroBinding}s for use with the Avro
 * data format.  The bindings are used along with {@link KVStore} APIs for
 * storing and retrieving key-value pairs.  The bindings are used to serialize
 * Avro values before writing them, and deserialize Avro values after reading
 * them.  An AvroCatalog is obtained by calling {@link KVStore#getAvroCatalog}.
 * 

 * WARNING: We strongly recommend using an {@link AvroBinding}.  NoSQL
 * Database will leverage Avro in the future to provide additional features and
 * capabilities.
 * 

 * WARNING: To take advantage of the Avro data format, the bindings in
 * this class must be used.  The {@link Value} byte array is constructed by the
 * binding to include an internal reference to the schema used for
 * serialization.  The {@link Value} byte array may not be manipulated directly
 * by the application.
 *
 * 
Avro Schemas
 *
 * When the Avro data format is used, each stored value must be associated with
 * an Avro schema.  The Avro schema describes the fields allowed in the value,
 * along with their data types.  An Avro schema is created by the application
 * developer, added to the store using the NoSQL Database administration
 * interface, and used in the client API via the {@code AvroCatalog} class.
 * 
 * An Avro schema is created in JSON format, typically using a text editor and
 * initially saved in a text file.  Of course, to create an Avro schema the
 * developer must understand the Avro schema syntax.  For more information see
 * Avro Schemas in the Getting Started Guide and the Avro schema
 * specification.
 * 

 * Once created and saved in a text file, the schema is added to the store
 * using the {@code ddl add-schema} administrative command, using the text file
 * as input; see Adding Schema in the Getting Started Guide.  Until a schema is
 * added, it may not be used in the client API to store values.  The use of the
 * schema in the client API is described further below.
 * 

 * Note that the use of Avro schemas allows serialized values to be stored in a
 * very space-efficient binary format.  Each value is stored without any
 * metadata other than a small internal schema identifier, between 1 and 4
 * bytes in size.  One such reference is stored per key-value pair.  In this
 * way, the serialized Avro data format is always associated with the schema
 * used to serialize it, with minimal overhead.  This association is made
 * transparently to the application, and the internal schema identifier is
 * managed by the bindings supplied by the {@code AvroCatalog} class.  The
 * application never sees or uses the internal identifier directly.
 * 

 * Two example schemas are shown below along with the administrative commands
 * for adding them to the store.  These schemas are used further below in other
 * examples.
 * 

 * The schemas might be stored in a simple text file, {@code schema1.txt}:
 * 
 *  {
 *  "type": "record",
 *  "name": "MemberInfo",
 *  "namespace": "avro",
 *  "fields": [
 *      {"name": "name", "type": {
 *          "type": "record",
 *          "name": "FullName",
 *          "fields": [
 *              {"name": "first", "type": "string", "default": ""},
 *              {"name": "last", "type": "string", "default": ""}
 *          ]
 *      }, "default": {}},
 *      {"name": "age", "type": "int", "default": 0}
 *   ]
 * }
 *
 * The administrative command for adding the above schemas is:
 *  *  > ddl add-schema -file schema1.txt
 *
 * Schema Evolution
 *
 * A schema may be changed, even after data values are stored using that
 * schema, using the {@code ddl add-schema} administrative command with the
 * {@code -evolve} option; see Changing Schema in the Getting Started Guide.
 * The modified schema is saved in a text file, which is passed to this command
 * as input.  For example, fields may be added, removed or renamed.
 * 
 * For example, if a middle name property is added in the future to the
 * schema, it might be stored in  {@code schema2.txt}.  Note that a new field
 * must be given a default value.
 * 
 *  {
 *  "type": "record",
 *  "name": "MemberInfo",
 *  "namespace": "avro",
 *  "fields": [
 *      {"name": "name", "type": {
 *          "type": "record",
 *          "name": "FullName",
 *          "fields": [
 *              {"name": "first", "type": "string", "default": ""},
 *              { "name": "middle", "type": "string", "default": "" },
 *              {"name": "last", "type": "string", "default": ""}
 *          ]
 *      }, "default": {}},
 *      {"name": "age", "type": "int", "default": 0}
 *   ]
 * }
 *
 * The administrative command for adding the new version of the schema is:
 *  *  > ddl add-schema -file schema2.txt -evolve
 * 
 * When a schema is changed, multiple versions of the schema will exist and be
 * maintained by the store.  The version of the schema used to serialize a
 * value, before writing it to the store, is called the writer
 * schema.  The writer schema is specified by the application when
 * creating a binding.  It is associated with the value when calling the
 * binding's {@link AvroBinding#toValue} method to serialize the data.  As
 * mentioned above, the writer schema is associated internally with every
 * stored value.
 * 

 * The reader schema is used to deserialize a value after reading it
 * from the store. Like the writer schema, the reader schema is specified by
 * the client application when creating a binding.  It is used to deserialize
 * the data when calling the binding's {@link AvroBinding#toObject} method,
 * after reading a value from the store.
 * 

 * When the reader and writer schemas are different, schema evolution is
 * applied during deserialization.  Schema evolution is applied by transforming
 * the data during deserialization, so that data stored according to the writer
 * schema is transformed to conform to the reader schema.  When the reader and
 * writer schemas are the same, no data transformation is necessary.  Also note
 * that no data transformation takes place during serialization; i.e., data is
 * always written according to the writer schema.
 * 

 * Reader and writer schemas can be different when a client is changed to use a
 * new version of the schema, and then reads data that was written using the
 * old version.  Schema versions can also be different when two clients are
 * operating concurrently using two different versions of a schema.  In a
 * distributed system such as NoSQL Database, it is normally not possible or
 * desirable to upgrade all clients simultaneously, since this would require
 * downtime.  Therefore, for some period of time there will be a mix of clients
 * operating concurrently using different versions of a schema.  Fortunately,
 * this situation is handled gracefully by virtue of schema evolution.
 * 

 * For example, imagine that a new field is added to a schema and there are two
 * versions of the schema.  The new field is only present in the new version of
 * the schema.  The new field must be assigned a default value in the new
 * schema.  There are three possible cases.
 *   

 *   The writer schema and reader schema are the same.  Schema evolution is
 *   not necessary and no data transformation is applied.
 *   
 *   The writer schema is the old version and the reader schema is the new
 *   version.  Because the writer schema is the old version, the new field is
 *   not present in the stored data.  When a client uses the new version as a
 *   reader schema, the new field will appear to the client as having the
 *   default value.
 *   
 *   The writer schema is the new version and the reader schema is the old
 *   version.  Because the writer schema is the new version, the new field is
 *   present in the stored data.  When a client uses the old version as a
 *   reader schema, the new field will not appear at all to the client.
 *   
 *   
 * If instead a field were deleted from a schema, the same rules would apply
 * but with the roles reversed.  Renaming a field is also possible by adding a
 * field alias to the schema; in this case the field is accessible by both the
 * old and new name.  For more information see Schema Evolution in the Getting
 * Started Guide and the detailed rules for schema evolution in the Avro schema
 * specification.
 * 
 * To support schema evolution, be sure never to change a schema's name or
 * namespace.  A schema is uniquely identified by its Avro full name, which is
 * similar to a full Java class name and consists of a combination of the Avro
 * schema namespace and the schema name.
 *
 * 
Avro schema restrictions
 *
 * The Avro type of a top-level schema, that is to be stored as the value in a
 * key-value pair, must be the Avro type record.
 *
 * Choosing a Binding
 *
 * The {@code AvroCatalog} provides a variety of {@link AvroBinding}s that
 * serialize and deserialize the Avro data format.  A summary of each binding
 * is below.
 *   
 *   {@link SpecificAvroBinding} is recommended when the schema(s) of the
 *   object(s) in the database are known when the application is being written.
 *   The names of the fields, and how to access them, are known at build time.
 *   A POJO (Plain Old Java Object) class for each schema is generated using
 *   the Avro compiler tools.  The POJO classes have property getters and
 *   setters that provide type safety.  This makes the {@code
 *   SpecificAvroBinding} the easiest of the bindings to use.
 *   
 *   {@link GenericAvroBinding} is recommended when the schema(s) of the
 *   object(s) in the database are not known at build-time. Rather than access
 *   the objects using predefined getters and setters, a program using {@code
 *   GenericAvroBinding} passes in the names of the fields to a generalized
 *   getter to retrieve data from an Avro object. For example, a generalized
 *   NoSQL Database record browser would require this capability.
 *   
 *   {@link JsonAvroBinding} is recommended when interoperability with
 *   other components or external systems that use JSON objects is needed.
 *   With the {@code JsonAvroBinding}, the Jackson API is used to manipulate
 *   JSON data objects.  Note that certain Avro data types are not conveniently
 *   represented as JSON values; see {@code JsonAvroBinding} for details.
 *   
 *   {@link RawAvroBinding} is recommended when an "escape" from the
 *   built-in serialization provided by the other bindings is needed.  The
 *   {@code RawAvroBinding} does not perform serialization, but instead allows
 *   specifying the Avro binary data as a byte array.  Serialization can be
 *   performed in any way desired, or not at all in the case where Avro binary
 *   data is exchanged with other components or external systems. Because it is
 *   low level and provides complete flexibility, the {@code RawAvroBinding}
 *   provides the least safety and is the most difficult of the bindings to
 *   use.
 *   
 * 
 * The detailed trade-offs for using each type of binding are described in
 * their javadoc: {@link SpecificAvroBinding}, {@link GenericAvroBinding},
 * {@link JsonAvroBinding}, and {@link RawAvroBinding}.
 *
 * 
Single schema and multiple schema bindings
 *
 * Specific, generic and JSON bindings have a single schema variant ({@link
 * #getSpecificBinding getSpecificBinding}, {@link
 * #getGenericBinding getGenericBinding} and {@link #getJsonBinding
 * getJsonBinding}) and a multiple schema variant ({@link
 * #getSpecificMultiBinding getSpecificMultiBinding}, {@link
 * #getGenericMultiBinding getGenericMultiBinding} and {@link
 * #getJsonMultiBinding getJsonMultiBinding}).
 * 
 * A single schema binding provides type checking.  Only values with the given
 * schema (or class, in the case of a specific class binding) can be used with
 * the binding.  A single schema specific class binding provides
 * compile-time type checking, while a a single schema generic or JSON binding
 * provides run-time type checking.
 * 

 * A single schema binding is safer than a multiple schema binding and often
 * preferable for that reason.  However, a multiple schema binding may be more
 * useful when retrieving key-value pairs of different types.  A {@link
 * KVStore} method may return values of different types if the application
 * stores multiple types for a single key, or if a method is called that
 * returns multiple key-value pairs such as {@link KVStore#multiGet multiGet},
 * {@link KVStore#multiGetIterator multiGetIterator}, or {@link
 * KVStore#storeIterator storeIterator}.  There are several ways of determining
 * which type is returned in these cases.
 *   

 *   The key in the key-value pair may indicate the value type according
 *   to application specific knowledge of the key structure.  In this case,
 *   using a single schema binding may be appropriate.  The application can
 *   choose which binding to use based on the key structure.
 *   
 *   The schema name or a common property of the object may be used to
 *   determine the value type.  In this case a multiple schema binding can be
 *   used to return a {@link SpecificRecord}, {@link GenericRecord} or {@link
 *   JsonRecord}, and then the schema name or a property of the object can be
 *   examined.
 *   
 *   For a specific binding, the class may be used to determine the value
 *   type.  In this case a multiple schema binding can be used to return the
 *   {@link SpecificRecord}, and then {@code instanceof} can be used to
 *   determine the concrete class.
 *   
 *   
 * 
 * Note that both single and multiple schema bindings perform class evolution
 * when deserializing a value.  The deserialized value will conform to the
 * schema specified as an argument of the getXxxBinding or getXxxMultiBinding
 * method.
 * 

 * A special use case for a generic or JSON multiple schema binding is when the
 * application treats values dynamically based on their schema, rather than
 * using a fixed set of schemas that is known in advance to the client
 * application.  In this case the {@link #getCurrentSchemas getCurrentSchemas}
 * method can be used to obtain a map of the most current schemas,  which can
 * be passed to {@link #getGenericMultiBinding getGenericMultiBinding} or
 * {@link #getJsonMultiBinding getJsonMultiBinding}.
 *
 * 
Using Schemas with Bindings
 *
 * A client application normally embeds a copy of the schemas it uses, rather
 * than getting the current schemas from the store.  The client's schemas are
 * specified when a binding is created by one of the getXxxBinding methods.
 * This supports schema evolution (as described above), in that the {@link
 * AvroBinding#toObject toObject} method will transform the serialized data
 * such that the returned object conforms to the schema known to the
 * application.
 * 
 * The application specifies its known, embedded schemas in different ways,
 * depending on the type of binding used.
 *   

 *   If an Avro specific binding is used, the schema is specified when the
 *   specific class is generated using the Avro compiler tools.  The schema
 *   text (in JSON format) is included in the generated code as a static String
 *   field, and is internally available to the binding.
 *   
 *   If a generic binding or JSON binding is used, the application's
 *   schemas must be explicitly embedded in the application.  For example, the
 *   application might maintain the text (in JSON format) of its schemas in the
 *   application source code (in static String fields) or in a resource file
 *   included in the application jar.  To create {@link Schema} objects from
 *   the schema text, the {@link org.apache.avro.Schema.Parser Schema.Parser}
 *   class may be used by the application.  After creating it, the schema
 *   object is passed to the getXxxBinding method.  A schema object is also
 *   passed to the constructor of {@link GenericRecord}, {@link JsonRecord} and
 *   {@link RawRecord}.
 *   
 *   
 * 
 * As described further above, all schemas used by an application must be
 * defined using the NoSQL Database administrative interface.  If a schema
 * specified by the application via the client API has not been defined in the
 * store, an {@link UndefinedSchemaException} will be thrown by the
 * getXxxBinding method (if the schema is passed to this method), or by one of
 * the methods of the returned binding.  Matching of the application specified
 * schemas with schemas in the store is performed using the {@link
 * Schema#equals} method.
 * 

 * One exception to the above is that an application may choose to use the
 * current version of schemas in the store that are returned by {@link
 * #getCurrentSchemas getCurrentSchemas}; in this case the set of schemas used
 * in the application need not be fixed at build time.  A second exception is
 * when the application chooses to use a raw binding and does not serialize or
 * deserialize the data, for example, when the serialized byte array is copied
 * to or from another component or system.
 * 

 * WARNING: The application should not create new {@code Schema} objects
 * unnecessarily, since schema creation is an expensive operation.  The
 * expected approach is to create each distinct {@code Schema} only once, and
 * reuse that object whenever it is needed.  Also note that all {@code Schema}
 * objects created by the application and passed to an API method in this
 * package are cached.  This cache is associated with the {@code AvroCatalog}
 * instance, which is associated with the {@code KVStore} instance.  The cached
 * references to the {@code Schema} objects are not discarded until the {@code
 * KVStore} instance is closed and discarded.  For example, a very undesirable
 * approach would be for the application to create a new {@code Schema} object
 * for each serialization or deserialization operation; in this case,
 * performance would suffer greatly and the cached schemas would eventually
 * fill the JVM heap.
 *
 * @since 2.0
 *
 * @deprecated as of 4.0, use the table API instead.
 */
@Deprecated
public interface AvroCatalog {

    /**
     * Returns a binding for representing values as instances of a generated
     * Avro specific class, for a single given class.
     *
     * @param cls an Avro specific class that was previously generated using
     * the Avro code generation tools.
     *
     * @return the AvroBinding that can be used for serialization and
     * deserialization.
     *
     * @throws UndefinedSchemaException if the schema associated with the given
     * class parameter has not been defined using the NoSQL Database
     * administration interface.
     *
     * @see SpecificAvroBinding
     */
    public  SpecificAvroBinding
        getSpecificBinding(Class cls);

    /**
     * Returns a binding for representing values as instances of generated Avro
     * specific classes, for any Avro specific class.
     *
     * @return the AvroBinding that can be used for serialization and
     * deserialization.
     *
     * @see SpecificAvroBinding
     */
    public SpecificAvroBinding getSpecificMultiBinding();

    /**
     * Returns a binding for representing a value as an Avro {@link
     * GenericRecord}, for values that conform to a single given expected
     * schema.
     *
     * @param schema the Avro schema expected for all values and {@link
     * GenericRecord}s used with this binding.
     *
     * @return the AvroBinding that can be used for serialization and
     * deserialization.
     *
     * @throws UndefinedSchemaException if the given schema has not been
     * defined using the NoSQL Database administration interface.
     */
    public GenericAvroBinding getGenericBinding(Schema schema);

    /**
     * Returns a binding for representing a value as an Avro {@link
     * GenericRecord}, for values that conform to multiple given expected
     * schemas.
     *
     * @param schemas the Avro schemas expected for all values and {@link
     * GenericRecord}s used with this binding.  The key in the map is the full
     * name of the schema.
     *
     * @return the AvroBinding that can be used for serialization and
     * deserialization.
     *
     * @throws UndefinedSchemaException if any of the given schemas has not
     * been defined using the NoSQL Database administration interface.
     */
    public GenericAvroBinding
        getGenericMultiBinding(Map schemas);

    /**
     * Returns a binding for representing a value as a {@link JsonRecord}, for
     * values that conform to a single given expected schema.
     *
     * @param schema the Avro schema expected for all values and {@link
     * JsonRecord}s used with this binding.
     *
     * @return the AvroBinding that can be used for serialization and
     * deserialization.
     *
     * @throws UndefinedSchemaException if the given schema has not been
     * defined using the NoSQL Database administration interface.
     */
    public JsonAvroBinding getJsonBinding(Schema schema);

    /**
     * Returns a binding for representing a value as a {@link JsonRecord}, for
     * values that conform to multiple given expected schemas.
     *
     * @param schemas the Avro schemas expected for all values and {@link
     * JsonRecord}s used with this binding.  The key in the map is the full
     * name of the schema.
     *
     * @return the AvroBinding that can be used for serialization and
     * deserialization.
     *
     * @throws UndefinedSchemaException if any of the given schemas has not
     * been defined using the NoSQL Database administration interface.
     */
    public JsonAvroBinding getJsonMultiBinding(Map schemas);

    /**
     * Returns a binding for representing a value as a {@link RawRecord}
     * containing the raw Avro serialized byte array and its associated schema.
     *
     * @return the AvroBinding that can be used for packaging and unpackaging
     * the serialized value.
     */
    public RawAvroBinding getRawBinding();

    /**
     * Returns an immutable Map containing the most current version of all
     * schemas from the {@link KVStore} client schema cache.  The Map key is
     * the full name of the schema.
     * 

     * A special use case for a generic or JSON multiple schema binding is when
     * the application treats values dynamically based on their schema, rather
     * than using a fixed set of known schemas.  The {@link #getCurrentSchemas
     * getCurrentSchemas} method can be used to obtain a map of the most
     * current schemas,  which can be passed to {@link #getGenericMultiBinding
     * getGenericMultiBinding} or {@link #getJsonMultiBinding
     * getJsonMultiBinding}.  See {@link GenericAvroBinding} and {@link
     * JsonAvroBinding} for an example of this use case.
     *
     * @return an immutable Map of full schema name to schema object.
     */
    public Map getCurrentSchemas();

    /**
     * Refreshes the cache of stored schemas, adding any new schemas or new
     * versions of schemas to the cache that have been stored via the
     * administration interface since the cache was last refreshed.
     * 

     * Calling this method is normally not necessary, since the schema cache is
     * automatically refreshed whenever a schema is specified via any of the
     * Avro binding APIs, and that schema is not already present in the cache.
     * 

     * Calling this method periodically may be necessary when the {@link
     * KVStore} handle is long lived, the {@link #getCurrentSchemas} method is
     * used to obtain current schemas, and the application wishes to obtain
     * schemas that were recently added using the administration interface.
     * 
     * WARNING: Calling this method often from multiple threads may cause
     * blocking during the query for schema changes.  Also note calling this
     * method often could impact the performance of other operations, since it
     * queries kv pairs in the store.
     *
     * @param consistency determines the consistency associated with the read
     * used to query for new schemas.  If null, the {@link
     * KVStoreConfig#getConsistency default consistency} is used.
     */
    public void refreshSchemaCache(Consistency consistency);
}