org.apache.hadoop.hbase.constraint.package-info Maven / Gradle / Ivy
Go to download
Show more of this group Show more artifacts with this name
Show all versions of hbase-server Show documentation
Show all versions of hbase-server Show documentation
Server functionality for HBase
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Restrict the domain of a data attribute, often times to fulfill business rules/requirements.
* Table of Contents
*
* Overview
Constraints are used to enforce business rules in a
* database. By checking all {@link org.apache.hadoop.hbase.client.Put Puts} on a given table, you
* can enforce very specific data policies. For instance, you can ensure that a certain column
* family-column qualifier pair always has a value between 1 and 10. Otherwise, the
* {@link org.apache.hadoop.hbase.client.Put} is rejected and the data integrity is maintained.
*
* Constraints are designed to be configurable, so a constraint can be used across different tables,
* but implement different behavior depending on the specific configuration given to that
* constraint.
*
* By adding a constraint to a table (see Example Usage), constraints will
* automatically be enabled. You also then have the option of to disable (just 'turn off') or remove
* (delete all associated information) all constraints on a table. If you remove all constraints
* (see
* {@link org.apache.hadoop.hbase.constraint.Constraints#remove(org.apache.hadoop.hbase.client.TableDescriptorBuilder)},
* you must re-add any {@link org.apache.hadoop.hbase.constraint.Constraint} you want on that table.
* However, if they are just disabled (see
* {@link org.apache.hadoop.hbase.constraint.Constraints#disable(org.apache.hadoop.hbase.client.TableDescriptorBuilder)},
* all you need to do is enable constraints again, and everything will be turned back on as it was
* configured. Individual constraints can also be individually enabled, disabled or removed without
* affecting other constraints.
*
* By default, constraints are disabled on a table. This means you will not see any slow down
* on a table if constraints are not enabled.
*
* Concurrency and Atomicity
Currently, no attempts at enforcing
* correctness in a multi-threaded scenario when modifying a constraint, via
* {@link org.apache.hadoop.hbase.constraint.Constraints}, to the the
* {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder}. This is particularly important
* when adding a constraint(s) to the {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder}
* as it first retrieves the next priority from a custom value set in the descriptor, adds each
* constraint (with increasing priority) to the descriptor, and then the next available priority is
* re-stored back in the {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder}.
*
* Locking is recommended around each of Constraints add methods:
* {@link org.apache.hadoop.hbase.constraint.Constraints#add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, Class...)},
* {@link org.apache.hadoop.hbase.constraint.Constraints#add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, org.apache.hadoop.hbase.util.Pair...)},
* and
* {@link org.apache.hadoop.hbase.constraint.Constraints#add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, Class, org.apache.hadoop.conf.Configuration)}.
* Any changes on a single TableDescriptor should be serialized, either within a single
* thread or via external mechanisms.
*
* Note that having a higher priority means that a constraint will run later; e.g. a constraint with
* priority 1 will run before a constraint with priority 2.
*
* Since Constraints currently are designed to just implement simple checks (e.g. is the value in
* the right range), there will be no atomicity conflicts. Even if one of the puts finishes the
* constraint first, the single row will not be corrupted and the 'fastest' write will win; the
* underlying region takes care of breaking the tie and ensuring that writes get serialized to the
* table. So yes, this doesn't ensure that we are going to get specific ordering or even a fully
* consistent view of the underlying data.
*
* Each constraint should only use local/instance variables, unless doing more advanced usage.
* Static variables could cause difficulties when checking concurrent writes to the same region,
* leading to either highly locked situations (decreasing through-put) or higher probability of
* errors. However, as long as each constraint just uses local variables, each thread interacting
* with the constraint will execute correctly and efficiently.
* Caveats
In traditional (SQL) databases, Constraints are often used
* to enforce referential
* integrity. However, in HBase, this will likely cause significant overhead and dramatically
* decrease the number of {@link org.apache.hadoop.hbase.client.Put Puts}/second possible on a
* table. This is because to check the referential integrity when making a
* {@link org.apache.hadoop.hbase.client.Put}, one must block on a scan for the 'remote' table,
* checking for the valid reference. For millions of {@link org.apache.hadoop.hbase.client.Put Puts}
* a second, this will breakdown very quickly. There are several options around the blocking
* behavior including, but not limited to:
*
* - Create a 'pre-join' table where the keys are already denormalized
* - Designing for 'incorrect' references
* - Using an external enforcement mechanism
*
* There are also several general considerations that must be taken into account, when using
* Constraints:
*
* - All changes made via {@link org.apache.hadoop.hbase.constraint.Constraints} will make
* modifications to the {@link org.apache.hadoop.hbase.client.TableDescriptor} for a given table. As
* such, the usual renabling of tables should be used for propagating changes to the table. When at
* all possible, Constraints should be added to the table before the table is created.
* - Constraints are run in the order that they are added to a table. This has implications for
* what order constraints should be added to a table.
* - Whenever new Constraint jars are added to a region server, those region servers need to go
* through a rolling restart to make sure that they pick up the new jars and can enable the new
* constraints.
* - There are certain keys that are reserved for the Configuration namespace:
*
* - _ENABLED - used server-side to determine if a constraint should be run
* - _PRIORITY - used server-side to determine what order a constraint should be run
*
* If these items are set, they will be respected in the constraint configuration, but they are
* taken care of by default in when adding constraints to an
* {@link org.apache.hadoop.hbase.client.TableDescriptorBuilder} via the usual method.
*
*
* Under the hood, constraints are implemented as a Coprocessor (see
* {@link org.apache.hadoop.hbase.constraint.ConstraintProcessor} if you are interested).
* Example usage
First, you must define a
* {@link org.apache.hadoop.hbase.constraint.Constraint}. The best way to do this is to extend
* {@link org.apache.hadoop.hbase.constraint.BaseConstraint}, which takes care of some of the more
* mundane details of using a {@link org.apache.hadoop.hbase.constraint.Constraint}.
*
* Let's look at one possible implementation of a constraint - an IntegerConstraint(there are also
* several simple examples in the tests). The IntegerConstraint checks to make sure that the value
* is a String-encoded int
. It is really simple to implement this kind of constraint,
* the only method needs to be implemented is
* {@link org.apache.hadoop.hbase.constraint.Constraint#check(org.apache.hadoop.hbase.client.Put)}:
*
*
*
public class IntegerConstraint extends BaseConstraint {
public void check(Put p) throws ConstraintException {
Map<byte[], List<KeyValue>> familyMap = p.getFamilyMap();
for (List <KeyValue> kvs : familyMap.values()) {
for (KeyValue kv : kvs) {
// just make sure that we can actually pull out an int
// this will automatically throw a NumberFormatException if we try to
// store something that isn't an Integer.
try {
Integer.parseInt(new String(kv.getValue()));
} catch (NumberFormatException e) {
throw new ConstraintException("Value in Put (" + p
+ ") was not a String-encoded integer", e);
} } }
*
*
*
*
* Note that all exceptions that you expect to be thrown must be caught and then rethrown as a
* {@link org.apache.hadoop.hbase.constraint.ConstraintException}. This way, you can be sure that a
* {@link org.apache.hadoop.hbase.client.Put} fails for an expected reason, rather than for any
* reason. For example, an {@link java.lang.OutOfMemoryError} is probably indicative of an inherent
* problem in the {@link org.apache.hadoop.hbase.constraint.Constraint}, rather than a failed
* {@link org.apache.hadoop.hbase.client.Put}.
*
* If an unexpected exception is thrown (for example, any kind of uncaught
* {@link java.lang.RuntimeException}), constraint-checking will be 'unloaded' from the regionserver
* where that error occurred. This means no further
* {@link org.apache.hadoop.hbase.constraint.Constraint Constraints} will be checked on that server
* until it is reloaded. This is done to ensure the system remains as available as possible.
* Therefore, be careful when writing your own Constraint.
*
* So now that we have a Constraint, we want to add it to a table. It's as easy as:
*
*
*
TableDescriptor builder = TableDescriptorBuilder.newBuilder(TABLE_NAME);
...
Constraints.add(builder, IntegerConstraint.class);
*
*
*
*
* Once we added the IntegerConstraint, constraints will be enabled on the table (once it is
* created) and we will always check to make sure that the value is an String-encoded integer.
*
* However, suppose we also write our own constraint, MyConstraint.java
. First, you
* need to make sure this class-files are in the classpath (in a jar) on the regionserver where that
* constraint will be run (this could require a rolling restart on the region server - see
* Caveats above)
*
* Suppose that MyConstraint also uses a Configuration (see
* {@link org.apache.hadoop.hbase.constraint.Constraint#getConf()}). Then adding MyConstraint looks
* like this:
*
*
TableDescriptor builder = TableDescriptorBuilder.newBuilder(TABLE_NAME);
Configuration conf = new Configuration(false);
...
(add values to the conf)
(modify the table descriptor)
...
Constraints.add(builder, new Pair(MyConstraint.class, conf));
*
*
*
*
* At this point we added both the IntegerConstraint and MyConstraint to the table, the
* IntegerConstraint will be run first, followed by MyConstraint.
*
* Suppose we realize that the {@link org.apache.hadoop.conf.Configuration} for MyConstraint is
* actually wrong when it was added to the table. Note, when it is added to the table, it is
* not added by reference, but is instead copied into the
* {@link org.apache.hadoop.hbase.client.TableDescriptor}. Thus, to change the
* {@link org.apache.hadoop.conf.Configuration} we are using for MyConstraint, we need to do this:
*
*
*
(add/modify the conf)
...
Constraints.setConfiguration(desc, MyConstraint.class, conf);
*
*
*
*
* This will overwrite the previous configuration for MyConstraint, but not change the order
* of the constraint nor if it is enabled/disabled.
*
* Note that the same constraint class can be added multiple times to a table without repercussion.
* A use case for this is the same constraint working differently based on its configuration.
*
* Suppose then we want to disable just MyConstraint. Its as easy as:
*
*
*
* Constraints.disable(desc, MyConstraint.class);
*
*
*
*
* This just turns off MyConstraint, but retains the position and the configuration associated with
* MyConstraint. Now, if we want to re-enable the constraint, its just another one-liner:
*
*
*
* Constraints.enable(desc, MyConstraint.class);
*
*
*
*
* Similarly, constraints on the entire table are disabled via:
*
*
*
* Constraints.disable(desc);
*
*
*
*
* Or enabled via:
*
*
* Constraints.enable(desc);
*
*
*
*
* Lastly, suppose you want to remove MyConstraint from the table, including with position it should
* be run at and its configuration. This is similarly simple:
*
*
*
* Constraints.remove(desc, MyConstraint.class);
*
*
*
*
* Also, removing all constraints from a table is similarly simple:
*
*
*
* Constraints.remove(desc);
*
*
*
This will remove all constraints (and associated information) from the table
* and turn off the constraint processing.
*
* NOTE
*
* It is important to note the use above of
*
*
* Configuration conf = new Configuration(false);
*
*
*
If you just use new Configuration()
, then the Configuration
* will be loaded with the default properties. While in the simple case, this is not going to be an
* issue, it will cause pain down the road. First, these extra properties are going to cause serious
* bloat in your {@link org.apache.hadoop.hbase.client.TableDescriptor}, meaning you are keeping
* around a ton of redundant information. Second, it is going to make examining your table in the
* shell, via describe 'table'
, a huge pain as you will have to dig through a ton of
* irrelevant config values to find the ones you set. In short, just do it the right way.
*/
package org.apache.hadoop.hbase.constraint;