src.java.overview.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of hbase-core Show documentation
HBase modified with patches needed for hbasene
The newest version!






    HBase


HBase is a scalable, distributed database built on Hadoop Core.

Table of Contents


  Requirements
  
    Windows
  


  Getting Started
  
    Standalone
    
      Distributed Operation: Pseudo- and Fully-distributed modes
      
        Pseudo-distributed
        Fully-distributed
      
    
  

Running and Confirming Your Installation
Upgrading
Example API Usage
Related Documentation


Requirements

  Java 1.6.x, preferably from Sun. Use the latest version available except u18 (u19 is fine).
  This version of HBase will only run on Hadoop 0.20.x.
  
    ssh must be installed and sshd must be running to use Hadoop's scripts to manage remote Hadoop daemons.
   You must be able to ssh to all nodes, including your local node, using passwordless login
   (Google "ssh passwordless login").
  
  
    HBase depends on ZooKeeper as of release 0.20.0.
    HBase keeps the location of its root table, who the current master is, and what regions are
    currently participating in the cluster in ZooKeeper.
    Clients and Servers now must know their ZooKeeper Quorum locations before
    they can do anything else (Usually they pick up this information from configuration
    supplied on their CLASSPATH). By default, HBase will manage a single ZooKeeper instance for you.
    In standalone and pseudo-distributed modes this is usually enough, but for
    fully-distributed mode you should configure a ZooKeeper quorum (more info below).
  
  Hosts must be able to resolve the fully-qualified domain name of the master.
  
    The clocks on cluster members should be in basic alignments. Some skew is tolerable but
    wild skew could generate odd behaviors. Run NTP
    on your cluster, or an equivalent.
  
  
    This is the current list of patches we recommend you apply to your running Hadoop cluster:
    
      
        HDFS-630: "In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block".
        Dead DataNodes take ten minutes to timeout at NameNode.
        In the meantime the NameNode can still send DFSClients to the dead DataNode as host for
        a replicated block. DFSClient can get stuck on trying to get block from a
        dead node. This patch allows DFSClients pass NameNode lists of known dead DataNodes.
      
    
  
  
    HBase is a database, it uses a lot of files at the same time. The default ulimit -n of 1024 on *nix systems is insufficient.
    Any significant amount of loading will lead you to
    FAQ: Why do I see "java.io.IOException...(Too many open files)" in my logs?.
    You will also notice errors like:
    2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
    
    Do yourself a favor and change this to more than 10k using the FAQ.
    Also, HDFS has an upper bound of files that it can serve at the same time, called xcievers (yes, this is misspelled). Again, before doing any loading,
    make sure you configured Hadoop's conf/hdfs-site.xml with this:
    <property>
  <name>dfs.datanode.max.xcievers</name>
  <value>2047</value>
</property>
    
    See the background of this issue here: Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 256".
    Failure to follow these instructions will result in data loss.
  



Windows
If you are running HBase on Windows, you must install
Cygwin
to have a *nix-like environment for the shell scripts. The full details
are explained in 
the Windows Installation
guide.


Getting Started
What follows presumes you have obtained a copy of HBase,
see Releases, and are installing
for the first time. If upgrading your HBase instance, see Upgrading.

Three modes are described: standalone, pseudo-distributed (where all servers are run on
a single host), and fully-distributed. If new to HBase start by following the standalone instructions.

Begin by reading Requirements.

Whatever your mode, define ${HBASE_HOME} to be the location of the root of your HBase installation, e.g.
/user/local/hbase. Edit ${HBASE_HOME}/conf/hbase-env.sh. In this file you can
set the heapsize for HBase, etc. At a minimum, set JAVA_HOME to point at the root of
your Java installation.

Standalone mode
If you are running a standalone operation, there should be nothing further to configure; proceed to
Running and Confirming Your Installation. If you are running a distributed
operation, continue reading.

Distributed Operation: Pseudo- and Fully-distributed modes
Distributed modes require an instance of the Hadoop Distributed File System (DFS).
See the Hadoop 
requirements and instructions for how to set up a DFS.

Pseudo-distributed mode
A pseudo-distributed mode is simply a distributed mode run on a single host.
Once you have confirmed your DFS setup, configuring HBase for use on one host requires modification of
${HBASE_HOME}/conf/hbase-site.xml, which needs to be pointed at the running Hadoop DFS instance.
Use hbase-site.xml to override the properties defined in
${HBASE_HOME}/conf/hbase-default.xml (hbase-default.xml itself
should never be modified). At a minimum the hbase.rootdir property should be redefined
in hbase-site.xml to point HBase at the Hadoop filesystem to use. For example, adding the property
below to your hbase-site.xml says that HBase should use the /hbase directory in the
HDFS whose namenode is at port 9000 on your local machine:

<configuration>
  ...
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:9000/hbase</value>
    <description>The directory shared by region servers.
    </description>
  </property>
  ...
</configuration>



Note: Let HBase create the directory. If you don't, you'll get warning saying HBase
needs a migration run because the directory is missing files expected by HBase (it'll
create them if you let it).
Also Note: Above we bind to localhost.  This means that a remote client cannot
connect.  Amend accordingly, if you want to connect from a remote location.

Fully-Distributed Operation
For running a fully-distributed operation on more than one host, the following
configurations must be made in addition to those described in the
pseudo-distributed operation section above.

In hbase-site.xml, set hbase.cluster.distributed to true.

<configuration>
  ...
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>
  ...
</configuration>



In fully-distributed mode, you probably want to change your hbase.rootdir
from localhost to the name of the node running the HDFS NameNode. In addition
to hbase-site.xml changes, a fully-distributed mode requires that you
modify ${HBASE_HOME}/conf/regionservers.
The regionserver file lists all hosts running HRegionServers, one host per line
(This file in HBase is like the Hadoop slaves file at ${HADOOP_HOME}/conf/slaves).

A distributed HBase depends on a running ZooKeeper cluster. All participating nodes and clients
need to be able to get to the running ZooKeeper cluster.
HBase by default manages a ZooKeeper cluster for you, or you can manage it on your own and point HBase to it.
To toggle HBase management of ZooKeeper, use the HBASE_MANAGES_ZK variable in ${HBASE_HOME}/conf/hbase-env.sh.
This variable, which defaults to true, tells HBase whether to
start/stop the ZooKeeper quorum servers alongside the rest of the servers.

When HBase manages the ZooKeeper cluster, you can specify ZooKeeper configuration
using its canonical zoo.cfg file (see below), or 
just specify ZookKeeper options directly in the ${HBASE_HOME}/conf/hbase-site.xml
(If new to ZooKeeper, go the path of specifying your configuration in HBase's hbase-site.xml).
Every ZooKeeper configuration option has a corresponding property in the HBase hbase-site.xml
XML configuration file named hbase.zookeeper.property.OPTION.
For example, the clientPort setting in ZooKeeper can be changed by
setting the hbase.zookeeper.property.clientPort property.
For the full list of available properties, see ZooKeeper's zoo.cfg.
For the default values used by HBase, see ${HBASE_HOME}/conf/hbase-default.xml.

At minimum, you should set the list of servers that you want ZooKeeper to run
on using the hbase.zookeeper.quorum property.
This property defaults to localhost which is not suitable for a
fully distributed HBase (it binds to the local machine only and remote clients
will not be able to connect).
It is recommended to run a ZooKeeper quorum of 3, 5 or 7 machines, and give each
ZooKeeper server around 1GB of RAM, and if possible, its own dedicated disk.
For very heavily loaded clusters, run ZooKeeper servers on separate machines from the
Region Servers (DataNodes and TaskTrackers).

To point HBase at an existing ZooKeeper cluster, add 
a suitably configured zoo.cfg to the CLASSPATH.
HBase will see this file and use it to figure out where ZooKeeper is.
Additionally set HBASE_MANAGES_ZK in ${HBASE_HOME}/conf/hbase-env.sh
to false so that HBase doesn't mess with your ZooKeeper setup:

   ...
  # Tell HBase whether it should manage it's own instance of Zookeeper or not.
  export HBASE_MANAGES_ZK=false



As an example, to have HBase manage a ZooKeeper quorum on nodes
rs{1,2,3,4,5}.example.com, bound to port 2222 (the default is 2181), use:

  ${HBASE_HOME}/conf/hbase-env.sh:

       ...
      # Tell HBase whether it should manage it's own instance of Zookeeper or not.
      export HBASE_MANAGES_ZK=true

  ${HBASE_HOME}/conf/hbase-site.xml:

  <configuration>
    ...
    <property>
      <name>hbase.zookeeper.property.clientPort</name>
      <value>2222</value>
      <description>Property from ZooKeeper's config zoo.cfg.
      The port at which the clients will connect.
      </description>
    </property>
    ...
    <property>
      <name>hbase.zookeeper.quorum</name>
      <value>rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com</value>
      <description>Comma separated list of servers in the ZooKeeper Quorum.
      For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
      By default this is set to localhost for local and pseudo-distributed modes
      of operation. For a fully-distributed setup, this should be set to a full
      list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
      this is the list of servers which we will start/stop ZooKeeper on.
      </description>
    </property>
    ...
  </configuration>



When HBase manages ZooKeeper, it will start/stop the ZooKeeper servers as a part
of the regular start/stop scripts. If you would like to run it yourself, you can
do:

${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper


Note that you can use HBase in this manner to spin up a ZooKeeper cluster,
unrelated to HBase. Just make sure to set HBASE_MANAGES_ZK to
false if you want it to stay up so that when HBase shuts down it
doesn't take ZooKeeper with it.

For more information about setting up a ZooKeeper cluster on your own, see
the ZooKeeper Getting Started Guide.
HBase currently uses ZooKeeper version 3.2.0, so any cluster setup with a
3.x.x version of ZooKeeper should work.

Of note, if you have made HDFS client configuration on your Hadoop cluster, HBase will not
see this configuration unless you do one of the following:

  Add a pointer to your HADOOP_CONF_DIR to CLASSPATH in hbase-env.sh.
  Add a copy of hdfs-site.xml (or hadoop-site.xml) to ${HBASE_HOME}/conf, or
  if only a small set of HDFS client configurations, add them to hbase-site.xml.


An example of such an HDFS client configuration is dfs.replication. If for example,
you want to run with a replication factor of 5, hbase will create files with the default of 3 unless
you do the above to make the configuration available to HBase.


Running and Confirming Your Installation
If you are running in standalone, non-distributed mode, HBase by default uses the local filesystem.

If you are running a distributed cluster you will need to start the Hadoop DFS daemons and
ZooKeeper Quorum before starting HBase and stop the daemons after HBase has shut down.

Start and stop the Hadoop DFS daemons by running ${HADOOP_HOME}/bin/start-dfs.sh.
You can ensure it started properly by testing the put and get of files into the Hadoop filesystem.
HBase does not normally use the mapreduce daemons.  These do not need to be started.

Start up your ZooKeeper cluster.

Start HBase with the following command:

${HBASE_HOME}/bin/start-hbase.sh


Once HBase has started, enter ${HBASE_HOME}/bin/hbase shell to obtain a
shell against HBase from which you can execute commands.
Type 'help' at the shells' prompt to get a list of commands.
Test your running install by creating tables, inserting content, viewing content, and then dropping your tables.
For example:

hbase> # Type "help" to see shell help screen
hbase> help
hbase> # To create a table named "mylittletable" with a column family of "mylittlecolumnfamily", type
hbase> create "mylittletable", "mylittlecolumnfamily"
hbase> # To see the schema for you just created "mylittletable" table and its single "mylittlecolumnfamily", type
hbase> describe "mylittletable"
hbase> # To add a row whose id is "myrow", to the column "mylittlecolumnfamily:x" with a value of 'v', do
hbase> put "mylittletable", "myrow", "mylittlecolumnfamily:x", "v"
hbase> # To get the cell just added, do
hbase> get "mylittletable", "myrow"
hbase> # To scan you new table, do
hbase> scan "mylittletable"



To stop HBase, exit the HBase shell and enter:

${HBASE_HOME}/bin/stop-hbase.sh


If you are running a distributed operation, be sure to wait until HBase has shut down completely
before stopping the Hadoop daemons.

The default location for logs is ${HBASE_HOME}/logs.

HBase also puts up a UI listing vital attributes. By default its deployed on the master host
at port 60010 (HBase RegionServers listen on port 60020 by default and put up an informational
http server at 60030).

Upgrading
After installing a new HBase on top of data written by a previous HBase version, before
starting your cluster, run the ${HBASE_DIR}/bin/hbase migrate migration script.
It will make any adjustments to the filesystem data under hbase.rootdir necessary to run
the HBase version. It does not change your install unless you explicitly ask it to.

Example API Usage
For sample Java code, see org.apache.hadoop.hbase.client documentation.

If your client is NOT Java, consider the Thrift or REST libraries.

Related Documentation

  HBase Home Page
  
HBase Wiki
  
Hadoop Home Page
  
Setting up Multiple HBase Masters
  
Rolling Upgrades
  
Transactional HBase
  
Table Indexed HBase
  
Stargate -- a RESTful Web service front end for HBase.