org.apache.hadoop.hbase.constraint.package-info Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of hbase-server Show documentation
Server functionality for HBase
There is a newer version: 3.0.0-beta-1
/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/**
 * Restrict the domain of a data attribute, often times to fulfill business rules/requirements.
 * Table of Contents
 * 
 * Overview
 * Concurrency and Atomicity
 * Caveats
 * Example Usage
 * 
 * Overview Constraints are used to enforce business rules in a
 * database. By checking all {@link org.apache.hadoop.hbase.client.Put Puts} on a given table, you
 * can enforce very specific data policies. For instance, you can ensure that a certain column
 * family-column qualifier pair always has a value between 1 and 10. Otherwise, the
 * {@link org.apache.hadoop.hbase.client.Put} is rejected and the data integrity is maintained.
 * 
 * Constraints are designed to be configurable, so a constraint can be used across different tables,
 * but implement different behavior depending on the specific configuration given to that
 * constraint.
 * 

 * By adding a constraint to a table (see Example Usage), constraints will
 * automatically be enabled. You also then have the option of to disable (just 'turn off') or remove
 * (delete all associated information) all constraints on a table. If you remove all constraints
 * (see
 * {@link org.apache.hadoop.hbase.constraint.Constraints#remove(org.apache.hadoop.hbase.client.TableDescriptorBuilder)},
 * you must re-add any {@link org.apache.hadoop.hbase.constraint.Constraint} you want on that table.
 * However, if they are just disabled (see
 * {@link org.apache.hadoop.hbase.constraint.Constraints#disable(org.apache.hadoop.hbase.client.TableDescriptorBuilder)},
 * all you need to do is enable constraints again, and everything will be turned back on as it was
 * configured. Individual constraints can also be individually enabled, disabled or removed without
 * affecting other constraints.
 * 

 * By default, constraints are disabled on a table. This means you will not see any slow down
 * on a table if constraints are not enabled.
 * 

 * 
Concurrency and Atomicity Currently, no attempts at enforcing
 * correctness in a multi-threaded scenario when modifying a constraint, via
 * {@link org.apache.hadoop.hbase.constraint.Constraints}, to the the
 * {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder}. This is particularly important
 * when adding a constraint(s) to the {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder}
 * as it first retrieves the next priority from a custom value set in the descriptor, adds each
 * constraint (with increasing priority) to the descriptor, and then the next available priority is
 * re-stored back in the {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder}.
 * 
 * Locking is recommended around each of Constraints add methods:
 * {@link org.apache.hadoop.hbase.constraint.Constraints#add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, Class...)},
 * {@link org.apache.hadoop.hbase.constraint.Constraints#add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, org.apache.hadoop.hbase.util.Pair...)},
 * and
 * {@link org.apache.hadoop.hbase.constraint.Constraints#add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, Class, org.apache.hadoop.conf.Configuration)}.
 * Any changes on a single TableDescriptor should be serialized, either within a single
 * thread or via external mechanisms.
 * 

 * Note that having a higher priority means that a constraint will run later; e.g. a constraint with
 * priority 1 will run before a constraint with priority 2.
 * 

 * Since Constraints currently are designed to just implement simple checks (e.g. is the value in
 * the right range), there will be no atomicity conflicts. Even if one of the puts finishes the
 * constraint first, the single row will not be corrupted and the 'fastest' write will win; the
 * underlying region takes care of breaking the tie and ensuring that writes get serialized to the
 * table. So yes, this doesn't ensure that we are going to get specific ordering or even a fully
 * consistent view of the underlying data.
 * 

 * Each constraint should only use local/instance variables, unless doing more advanced usage.
 * Static variables could cause difficulties when checking concurrent writes to the same region,
 * leading to either highly locked situations (decreasing through-put) or higher probability of
 * errors. However, as long as each constraint just uses local variables, each thread interacting
 * with the constraint will execute correctly and efficiently.
 * 
Caveats In traditional (SQL) databases, Constraints are often used
 * to enforce referential
 * integrity. However, in HBase, this will likely cause significant overhead and dramatically
 * decrease the number of {@link org.apache.hadoop.hbase.client.Put Puts}/second possible on a
 * table. This is because to check the referential integrity when making a
 * {@link org.apache.hadoop.hbase.client.Put}, one must block on a scan for the 'remote' table,
 * checking for the valid reference. For millions of {@link org.apache.hadoop.hbase.client.Put Puts}
 * a second, this will breakdown very quickly. There are several options around the blocking
 * behavior including, but not limited to:
 * 
 * Create a 'pre-join' table where the keys are already denormalized
 * Designing for 'incorrect' references
 * Using an external enforcement mechanism
 * 
 * There are also several general considerations that must be taken into account, when using
 * Constraints:
 * 
 * All changes made via {@link org.apache.hadoop.hbase.constraint.Constraints} will make
 * modifications to the {@link org.apache.hadoop.hbase.client.TableDescriptor} for a given table. As
 * such, the usual renabling of tables should be used for propagating changes to the table. When at
 * all possible, Constraints should be added to the table before the table is created.
 * Constraints are run in the order that they are added to a table. This has implications for
 * what order constraints should be added to a table.
 * Whenever new Constraint jars are added to a region server, those region servers need to go
 * through a rolling restart to make sure that they pick up the new jars and can enable the new
 * constraints.
 * There are certain keys that are reserved for the Configuration namespace:
 * 
 * _ENABLED - used server-side to determine if a constraint should be run
 * _PRIORITY - used server-side to determine what order a constraint should be run
 * 
 * If these items are set, they will be respected in the constraint configuration, but they are
 * taken care of by default in when adding constraints to an
 * {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder} via the usual method.
 * 
 * 
 * Under the hood, constraints are implemented as a Coprocessor (see
 * {@link org.apache.hadoop.hbase.constraint.ConstraintProcessor} if you are interested).
 * 
Example usage First, you must define a
 * {@link org.apache.hadoop.hbase.constraint.Constraint}. The best way to do this is to extend
 * {@link org.apache.hadoop.hbase.constraint.BaseConstraint}, which takes care of some of the more
 * mundane details of using a {@link org.apache.hadoop.hbase.constraint.Constraint}.
 * 
 * Let's look at one possible implementation of a constraint - an IntegerConstraint(there are also
 * several simple examples in the tests). The IntegerConstraint checks to make sure that the value
 * is a String-encoded int. It is really simple to implement this kind of constraint,
 * the only method needs to be implemented is
 * {@link org.apache.hadoop.hbase.constraint.Constraint#check(org.apache.hadoop.hbase.client.Put)}:
 * 
 
 *
 *  public class IntegerConstraint extends BaseConstraint {
 public void check(Put p) throws ConstraintException {

 Map<byte[], List<KeyValue>> familyMap = p.getFamilyMap();

 for (List <KeyValue> kvs : familyMap.values()) {
 for (KeyValue kv : kvs) {

 // just make sure that we can actually pull out an int
 // this will automatically throw a NumberFormatException if we try to
 // store something that isn't an Integer.

 try {
 Integer.parseInt(new String(kv.getValue()));
 } catch (NumberFormatException e) {
 throw new ConstraintException("Value in Put (" + p
 + ") was not a String-encoded integer", e);
 } } }
 * 
 *
 * 
 
 * 
 * Note that all exceptions that you expect to be thrown must be caught and then rethrown as a
 * {@link org.apache.hadoop.hbase.constraint.ConstraintException}. This way, you can be sure that a
 * {@link org.apache.hadoop.hbase.client.Put} fails for an expected reason, rather than for any
 * reason. For example, an {@link java.lang.OutOfMemoryError} is probably indicative of an inherent
 * problem in the {@link org.apache.hadoop.hbase.constraint.Constraint}, rather than a failed
 * {@link org.apache.hadoop.hbase.client.Put}.
 * 

 * If an unexpected exception is thrown (for example, any kind of uncaught
 * {@link java.lang.RuntimeException}), constraint-checking will be 'unloaded' from the regionserver
 * where that error occurred. This means no further
 * {@link org.apache.hadoop.hbase.constraint.Constraint Constraints} will be checked on that server
 * until it is reloaded. This is done to ensure the system remains as available as possible.
 * Therefore, be careful when writing your own Constraint.
 * 

 * So now that we have a Constraint, we want to add it to a table. It's as easy as:
 * 
 
 *
 *  TableDescriptor builder = TableDescriptorBuilder.newBuilder(TABLE_NAME);
 ...
 Constraints.add(builder, IntegerConstraint.class);
 * 
 *
 * 
 * 
 * Once we added the IntegerConstraint, constraints will be enabled on the table (once it is
 * created) and we will always check to make sure that the value is an String-encoded integer.
 * 

 * However, suppose we also write our own constraint, MyConstraint.java. First, you
 * need to make sure this class-files are in the classpath (in a jar) on the regionserver where that
 * constraint will be run (this could require a rolling restart on the region server - see
 * Caveats above)
 * 

 * Suppose that MyConstraint also uses a Configuration (see
 * {@link org.apache.hadoop.hbase.constraint.Constraint#getConf()}). Then adding MyConstraint looks
 * like this: 
 
 *
 *  TableDescriptor builder = TableDescriptorBuilder.newBuilder(TABLE_NAME);
 Configuration conf = new Configuration(false);
 ...
 (add values to the conf)
 (modify the table descriptor)
 ...
 Constraints.add(builder, new Pair(MyConstraint.class, conf));
 * 
 *
 * 
 * 
 * At this point we added both the IntegerConstraint and MyConstraint to the table, the
 * IntegerConstraint will be run first, followed by MyConstraint.
 * 

 * Suppose we realize that the {@link org.apache.hadoop.conf.Configuration} for MyConstraint is
 * actually wrong when it was added to the table. Note, when it is added to the table, it is
 * not added by reference, but is instead copied into the
 * {@link org.apache.hadoop.hbase.client.TableDescriptor}. Thus, to change the
 * {@link org.apache.hadoop.conf.Configuration} we are using for MyConstraint, we need to do this:
 * 
 
 *
 *  (add/modify the conf)
 ...
 Constraints.setConfiguration(desc, MyConstraint.class, conf);
 * 
 *
 * 
 * 
 * This will overwrite the previous configuration for MyConstraint, but not change the order
 * of the constraint nor if it is enabled/disabled.
 * 

 * Note that the same constraint class can be added multiple times to a table without repercussion.
 * A use case for this is the same constraint working differently based on its configuration.
 * 

 * Suppose then we want to disable just MyConstraint. Its as easy as:
 * 
 
 *
 *  * Constraints.disable(desc, MyConstraint.class);
 * 
 *
 * 
 * 
 * This just turns off MyConstraint, but retains the position and the configuration associated with
 * MyConstraint. Now, if we want to re-enable the constraint, its just another one-liner:
 * 
 
 *
 *  * Constraints.enable(desc, MyConstraint.class);
 * 
 *
 * 
 * 
 * Similarly, constraints on the entire table are disabled via:
 * 
 
 *
 *  * Constraints.disable(desc);
 * 
 *
 * 
 * 
 * Or enabled via: 
 
 *
 *  * Constraints.enable(desc);
 * 
 *
 * 
 * 
 * Lastly, suppose you want to remove MyConstraint from the table, including with position it should
 * be run at and its configuration. This is similarly simple:
 * 
 
 *
 *  * Constraints.remove(desc, MyConstraint.class);
 * 
 *
 * 
 * 
 * Also, removing all constraints from a table is similarly simple:
 * 
 
 *
 *  * Constraints.remove(desc);
 * 
 *
 *  This will remove all constraints (and associated information) from the table
 * and turn off the constraint processing.
 * 
 * NOTE
 * 

 * It is important to note the use above of 
 
 *
 *  * Configuration conf = new Configuration(false);
 * 
 *
 * 
 If you just use  new Configuration(), then the Configuration
 * will be loaded with the default properties. While in the simple case, this is not going to be an
 * issue, it will cause pain down the road. First, these extra properties are going to cause serious
 * bloat in your {@link org.apache.hadoop.hbase.client.TableDescriptor}, meaning you are keeping
 * around a ton of redundant information. Second, it is going to make examining your table in the
 * shell, via describe 'table', a huge pain as you will have to dig through a ton of
 * irrelevant config values to find the ones you set. In short, just do it the right way.
 */
package org.apache.hadoop.hbase.constraint;