.marklogic-mapreduce2.2.2.4.source-code.overview.html Maven / Gradle / Ivy
Show all versions of marklogic-mapreduce2 Show documentation
Overview of MarkLogic Connector for Hadoop
This bundle provides an API for a MarkLogic Server content
connector for Apache Hadoop MapReduce. The overview covers the
following topics:
For detailed information, see the MarkLogic Connector for
Hadoop Developer's Guide.
Introduction
The MarkLogic Connector for Hadoop API allows you to
use MarkLogic Server as either or both a Hadoop MapReduce
input source and an output destination.
The following classes are provided for defining MarkLogic-specific
key and value types for your MapReduce key-value pairs:
- {@link com.marklogic.mapreduce.NodePath} for keys
- {@link com.marklogic.mapreduce.DocumentURI} for keys
- {@link com.marklogic.mapreduce.MarkLogicNode} for values
You may also use Apache Hadoop MapReduce types such as Text in
certain circumstances. See {@link com.marklogic.mapreduce.ValueInputFormat}
{@link com.marklogic.mapreduce.KeyValueInputFormat}.
You may generate input data using MarkLogic Server lexicon functions
by subclassing one of the lexicon function wrapper classes in
com.marklogic.mapreduce.functions. Use lexicon functions
with {@link com.marklogic.mapreduce.ValueInputFormat} and
{@link com.marklogic.mapreduce.KeyValueInputFormat}.
The following classes are provided for defining
MarkLogic-specific MapReduce input and output formats.
Input and output formats need not be the same type.
- {@link com.marklogic.mapreduce.DocumentInputFormat}
- {@link com.marklogic.mapreduce.NodeInputFormat}
- {@link com.marklogic.mapreduce.ValueInputFormat}
- {@link com.marklogic.mapreduce.KeyValueInputFormat}
- {@link com.marklogic.mapreduce.ContentOutputFormat}
- {@link com.marklogic.mapreduce.NodeOutputFormat}
- {@link com.marklogic.mapreduce.PropertyOutputFormat}
Configuration
Configure the connector using the standard Hadoop configuration
mechanism. That is, use a Hadoop configuration file to define
property values, or set properties programmatically on your
Job's {@link org.apache.hadoop.conf.Configuration} object.
The configuration properties available for the connector are
described in {@link com.marklogic.mapreduce.MarkLogicConstants}.
When using MarkLogic Server as an input source for MapReduce
tasks, you may use either basic or advanced input mode. The default
is basic
mode. The mode is controlled through
the {@link com.marklogic.mapreduce.MarkLogicConstants#INPUT_MODE
mapreduce.marklogic.input.mode} property. The following sections
describe the input modes briefly. For details, see the
MarkLogic Connector for Hadoop Developer's Guide.
Configuring the Input Query With a Path Expression
In basic mode, you may supply components of an XQuery path expression
which the connector uses to generate input data. You may not use this
option along with a lexicon function class.
To allow MarkLogic Server to optimize the input query, the path
expression is constructed from two components: A
{@link com.marklogic.mapreduce.MarkLogicConstants#DOCUMENT_SELECTOR
document node selector} and a
{@link com.marklogic.mapreduce.MarkLogicConstants#SUBDOCUMENT_EXPRESSION
sub-document expression}.
The input split is not configurable in basic
mode. The
splits are based on a rough count of the number of fragments in
each forest. Use advanced
input mode for more control
over input split generation.
Conceptually, the input data for each task is constructed from a
path expression similar to:
$document-selector/$subdocument-expression
Both components of the input path expression are optional. If no
document selector is given, fn:collection()
is used.
If no subdocument expression is given, the document nodes returned
by the document selector are used as the input values.
Examples:
document selector: none
subdocument expression: none
=> All document nodes in fn:collection()
document selector: fn:collection("wiki-topics")
subdocument expression: none
=> All document nodes in the "wiki-topics" collection
document selector: fn:collection("wiki-topics")
subdocument expression: //wp:a[@href]
=> All elements in the "wiki-topics" collection containing hrefs
document selector: fn:collection("wiki-topics")
subdocument expression: //wp:a[@href]/@title
=> The titles of all documents in the "wiki-topics" collection
containing hrefs
Configuring the Input Query with a Lexicon Function
In basic mode, you may gather input data using a MarkLogicServer
lexicon function. This option may not be used with the XPath
based configuration properties described above. If both are
configured for a job, the lexicon function takes precedence.
To use a lexicon function for input, implement a subclass of
one of the lexicon wrapper functions in com.marklogic.mapreduce.functions.
For example, to use cts:element-values
, implement a
subclass of {@link com.marklogic.mapreduce.functions.ElementValues}.
Override the methods corresponding to the function parameter value
you want to include in the call.
For details, see "Using a Lexicon to Generate Key-Value Pairs" in
the MarkLogic Connector for Hadoop Developer's Guide.
Configuring the Input Query in Advanced Mode
In advanced
input mode, you must supply an
{@link com.marklogic.mapreduce.MarkLogicConstants#SPLIT_QUERY
input split query} and an
{@link com.marklogic.mapreduce.MarkLogicConstants#INPUT_QUERY
input query}.
The split query is used to generate meta-data for Hadoop's
input splits. This query must return a sequence of triples,
each of which includes a forest id, record (fragment) count,
and list of host names. The count may be an estimate.
The input query is used to fetch the input data for each map task.
This query must return data that matches the configured InputFormat
subclass.