All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.apache.hadoop.hbase.constraint.package-info Maven / Gradle / Ivy

There is a newer version: 3.0.0-beta-1
Show newest version
/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/**
 * Restrict the domain of a data attribute, often times to fulfill business rules/requirements.
 * 

Table of Contents

* *

Overview

Constraints are used to enforce business rules in a * database. By checking all {@link org.apache.hadoop.hbase.client.Put Puts} on a given table, you * can enforce very specific data policies. For instance, you can ensure that a certain column * family-column qualifier pair always has a value between 1 and 10. Otherwise, the * {@link org.apache.hadoop.hbase.client.Put} is rejected and the data integrity is maintained. *

* Constraints are designed to be configurable, so a constraint can be used across different tables, * but implement different behavior depending on the specific configuration given to that * constraint. *

* By adding a constraint to a table (see Example Usage), constraints will * automatically be enabled. You also then have the option of to disable (just 'turn off') or remove * (delete all associated information) all constraints on a table. If you remove all constraints * (see * {@link org.apache.hadoop.hbase.constraint.Constraints#remove(org.apache.hadoop.hbase.client.TableDescriptorBuilder)}, * you must re-add any {@link org.apache.hadoop.hbase.constraint.Constraint} you want on that table. * However, if they are just disabled (see * {@link org.apache.hadoop.hbase.constraint.Constraints#disable(org.apache.hadoop.hbase.client.TableDescriptorBuilder)}, * all you need to do is enable constraints again, and everything will be turned back on as it was * configured. Individual constraints can also be individually enabled, disabled or removed without * affecting other constraints. *

* By default, constraints are disabled on a table. This means you will not see any slow down * on a table if constraints are not enabled. *

*

Concurrency and Atomicity

Currently, no attempts at enforcing * correctness in a multi-threaded scenario when modifying a constraint, via * {@link org.apache.hadoop.hbase.constraint.Constraints}, to the the * {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder}. This is particularly important * when adding a constraint(s) to the {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder} * as it first retrieves the next priority from a custom value set in the descriptor, adds each * constraint (with increasing priority) to the descriptor, and then the next available priority is * re-stored back in the {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder}. *

* Locking is recommended around each of Constraints add methods: * {@link org.apache.hadoop.hbase.constraint.Constraints#add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, Class...)}, * {@link org.apache.hadoop.hbase.constraint.Constraints#add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, org.apache.hadoop.hbase.util.Pair...)}, * and * {@link org.apache.hadoop.hbase.constraint.Constraints#add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, Class, org.apache.hadoop.conf.Configuration)}. * Any changes on a single TableDescriptor should be serialized, either within a single * thread or via external mechanisms. *

* Note that having a higher priority means that a constraint will run later; e.g. a constraint with * priority 1 will run before a constraint with priority 2. *

* Since Constraints currently are designed to just implement simple checks (e.g. is the value in * the right range), there will be no atomicity conflicts. Even if one of the puts finishes the * constraint first, the single row will not be corrupted and the 'fastest' write will win; the * underlying region takes care of breaking the tie and ensuring that writes get serialized to the * table. So yes, this doesn't ensure that we are going to get specific ordering or even a fully * consistent view of the underlying data. *

* Each constraint should only use local/instance variables, unless doing more advanced usage. * Static variables could cause difficulties when checking concurrent writes to the same region, * leading to either highly locked situations (decreasing through-put) or higher probability of * errors. However, as long as each constraint just uses local variables, each thread interacting * with the constraint will execute correctly and efficiently. *

Caveats

In traditional (SQL) databases, Constraints are often used * to enforce referential * integrity. However, in HBase, this will likely cause significant overhead and dramatically * decrease the number of {@link org.apache.hadoop.hbase.client.Put Puts}/second possible on a * table. This is because to check the referential integrity when making a * {@link org.apache.hadoop.hbase.client.Put}, one must block on a scan for the 'remote' table, * checking for the valid reference. For millions of {@link org.apache.hadoop.hbase.client.Put Puts} * a second, this will breakdown very quickly. There are several options around the blocking * behavior including, but not limited to: *
    *
  • Create a 'pre-join' table where the keys are already denormalized
  • *
  • Designing for 'incorrect' references
  • *
  • Using an external enforcement mechanism
  • *
* There are also several general considerations that must be taken into account, when using * Constraints: *
    *
  1. All changes made via {@link org.apache.hadoop.hbase.constraint.Constraints} will make * modifications to the {@link org.apache.hadoop.hbase.client.TableDescriptor} for a given table. As * such, the usual renabling of tables should be used for propagating changes to the table. When at * all possible, Constraints should be added to the table before the table is created.
  2. *
  3. Constraints are run in the order that they are added to a table. This has implications for * what order constraints should be added to a table.
  4. *
  5. Whenever new Constraint jars are added to a region server, those region servers need to go * through a rolling restart to make sure that they pick up the new jars and can enable the new * constraints.
  6. *
  7. There are certain keys that are reserved for the Configuration namespace: *
      *
    • _ENABLED - used server-side to determine if a constraint should be run
    • *
    • _PRIORITY - used server-side to determine what order a constraint should be run
    • *
    * If these items are set, they will be respected in the constraint configuration, but they are * taken care of by default in when adding constraints to an * {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder} via the usual method.
  8. *
*

* Under the hood, constraints are implemented as a Coprocessor (see * {@link org.apache.hadoop.hbase.constraint.ConstraintProcessor} if you are interested). *

Example usage

First, you must define a * {@link org.apache.hadoop.hbase.constraint.Constraint}. The best way to do this is to extend * {@link org.apache.hadoop.hbase.constraint.BaseConstraint}, which takes care of some of the more * mundane details of using a {@link org.apache.hadoop.hbase.constraint.Constraint}. *

* Let's look at one possible implementation of a constraint - an IntegerConstraint(there are also * several simple examples in the tests). The IntegerConstraint checks to make sure that the value * is a String-encoded int. It is really simple to implement this kind of constraint, * the only method needs to be implemented is * {@link org.apache.hadoop.hbase.constraint.Constraint#check(org.apache.hadoop.hbase.client.Put)}: *

* *
 public class IntegerConstraint extends BaseConstraint {
 public void check(Put p) throws ConstraintException {

 Map<byte[], List<KeyValue>> familyMap = p.getFamilyMap();

 for (List <KeyValue> kvs : familyMap.values()) {
 for (KeyValue kv : kvs) {

 // just make sure that we can actually pull out an int
 // this will automatically throw a NumberFormatException if we try to
 // store something that isn't an Integer.

 try {
 Integer.parseInt(new String(kv.getValue()));
 } catch (NumberFormatException e) {
 throw new ConstraintException("Value in Put (" + p
 + ") was not a String-encoded integer", e);
 } } }
 * 
* *
*

* Note that all exceptions that you expect to be thrown must be caught and then rethrown as a * {@link org.apache.hadoop.hbase.constraint.ConstraintException}. This way, you can be sure that a * {@link org.apache.hadoop.hbase.client.Put} fails for an expected reason, rather than for any * reason. For example, an {@link java.lang.OutOfMemoryError} is probably indicative of an inherent * problem in the {@link org.apache.hadoop.hbase.constraint.Constraint}, rather than a failed * {@link org.apache.hadoop.hbase.client.Put}. *

* If an unexpected exception is thrown (for example, any kind of uncaught * {@link java.lang.RuntimeException}), constraint-checking will be 'unloaded' from the regionserver * where that error occurred. This means no further * {@link org.apache.hadoop.hbase.constraint.Constraint Constraints} will be checked on that server * until it is reloaded. This is done to ensure the system remains as available as possible. * Therefore, be careful when writing your own Constraint. *

* So now that we have a Constraint, we want to add it to a table. It's as easy as: *

* *
 TableDescriptor builder = TableDescriptorBuilder.newBuilder(TABLE_NAME);
 ...
 Constraints.add(builder, IntegerConstraint.class);
 * 
* *
*

* Once we added the IntegerConstraint, constraints will be enabled on the table (once it is * created) and we will always check to make sure that the value is an String-encoded integer. *

* However, suppose we also write our own constraint, MyConstraint.java. First, you * need to make sure this class-files are in the classpath (in a jar) on the regionserver where that * constraint will be run (this could require a rolling restart on the region server - see * Caveats above) *

* Suppose that MyConstraint also uses a Configuration (see * {@link org.apache.hadoop.hbase.constraint.Constraint#getConf()}). Then adding MyConstraint looks * like this:

* *
 TableDescriptor builder = TableDescriptorBuilder.newBuilder(TABLE_NAME);
 Configuration conf = new Configuration(false);
 ...
 (add values to the conf)
 (modify the table descriptor)
 ...
 Constraints.add(builder, new Pair(MyConstraint.class, conf));
 * 
* *
*

* At this point we added both the IntegerConstraint and MyConstraint to the table, the * IntegerConstraint will be run first, followed by MyConstraint. *

* Suppose we realize that the {@link org.apache.hadoop.conf.Configuration} for MyConstraint is * actually wrong when it was added to the table. Note, when it is added to the table, it is * not added by reference, but is instead copied into the * {@link org.apache.hadoop.hbase.client.TableDescriptor}. Thus, to change the * {@link org.apache.hadoop.conf.Configuration} we are using for MyConstraint, we need to do this: *

* *
 (add/modify the conf)
 ...
 Constraints.setConfiguration(desc, MyConstraint.class, conf);
 * 
* *
*

* This will overwrite the previous configuration for MyConstraint, but not change the order * of the constraint nor if it is enabled/disabled. *

* Note that the same constraint class can be added multiple times to a table without repercussion. * A use case for this is the same constraint working differently based on its configuration. *

* Suppose then we want to disable just MyConstraint. Its as easy as: *

* *
 * Constraints.disable(desc, MyConstraint.class);
 * 
* *
*

* This just turns off MyConstraint, but retains the position and the configuration associated with * MyConstraint. Now, if we want to re-enable the constraint, its just another one-liner: *

* *
 * Constraints.enable(desc, MyConstraint.class);
 * 
* *
*

* Similarly, constraints on the entire table are disabled via: *

* *
 * Constraints.disable(desc);
 * 
* *
*

* Or enabled via:

* *
 * Constraints.enable(desc);
 * 
* *
*

* Lastly, suppose you want to remove MyConstraint from the table, including with position it should * be run at and its configuration. This is similarly simple: *

* *
 * Constraints.remove(desc, MyConstraint.class);
 * 
* *
*

* Also, removing all constraints from a table is similarly simple: *

* *
 * Constraints.remove(desc);
 * 
* *
This will remove all constraints (and associated information) from the table * and turn off the constraint processing. *

* NOTE *

* It is important to note the use above of

* *
 * Configuration conf = new Configuration(false);
 * 
* *
If you just use new Configuration(), then the Configuration * will be loaded with the default properties. While in the simple case, this is not going to be an * issue, it will cause pain down the road. First, these extra properties are going to cause serious * bloat in your {@link org.apache.hadoop.hbase.client.TableDescriptor}, meaning you are keeping * around a ton of redundant information. Second, it is going to make examining your table in the * shell, via describe 'table', a huge pain as you will have to dig through a ton of * irrelevant config values to find the ones you set. In short, just do it the right way. */ package org.apache.hadoop.hbase.constraint;




© 2015 - 2024 Weber Informatics LLC | Privacy Policy