org.voltdb.planner.package-info Maven / Gradle / Ivy
/* This file is part of VoltDB.
* Copyright (C) 2008-2018 VoltDB Inc.
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as
* published by the Free Software Foundation, either version 3 of the
* License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Affero General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with VoltDB. If not, see .
*/
/**
* What is the package org.voltdb.planner?
* This package contains the cost-based, flexible, distributed
* planner/optimizer for VoltDB.
*
* The execution of any query is a some kind of a scan. By scan we mean a loop,
* or perhaps a nested loop. The most naive, and slowest, scan would be just a scan of a table.
* Each row of the table is fetched, evaluated and written to the output or else skipped, depending
* the filters of join conditions or the WHERE clause. A more sophisticated method would be
* to scan an index. This scans only a fraction of the rows. But it's still a scan.
*
*
* The planner creates plans. A plan is a kind of description of the execution of
* a query. It may help to think of a plan as an executable program file, say an ELF or
* a.out file. The plan contains the instructions needed to do the computation, but the
* plan doesn't have any resources to do any calculations. The plan is just a data structure.
*
*
* Each plan is made of a tree of plan nodes. Each plan node describes
* some computation which operates on tables and produces a single table. The result table
* and the argument tables may be persistent tables or they may be temporary tables. Temporary
* tables created during the execution of a plan are destroyed after the execution of the
* plan.
*
*
* For example, a SEQUENTIAL_SCAN node describes a computation which operates on a
* table, reads each row of the table and sends it to the output, possibly filtering
* with a predicate or applying a limit or an offset. A NESTED_LOOP plan node describes
* a join computation, either an inner or outer join.
* This join computation will be given two tables which must be scanned
* in two nested loops. For each row of the left hand table, which we call the outer
* table, we look at each row of the right hand table, which we call the inner
* table. If the pair of rows satisfy the join condition, we write the row to the
* output table.
* Note: The computation for NESTED_LOOP is somewhat more complicated
* than this, as we may be able to skip the scan of the inner table. There
* may be other optimizations.
* Note Also: The tree formed by a plan is not very similar to the parse tree formed
* from a SQL or DDL statement. The plan relates to the SQL text approximately the
* same way an assembly language program relates to a string of C++ program text.
*
*
* We say a plan node describes a computation because the plan node does
* not actually have the code to do the computation. Plan nodes are generated by
* the planner in the Java code in the VoltDB server. Plans are transmitted to the
* Execution Engine, or EE, in the form of JSON objects, represented as strings.
* The EE has its own definitions for all the plan nodes in C++. The
* EE can reconstruct the the plan in a C++ data structure from the JSON. The EE also
* has a hierarchy of classes called executor classes which is parallel to
* the plan node hierarchy. Executor class objects contain the code which does
* the real computations of the plan nodes. Most dynamic data, such as table
* data, are stored in the executor objects.
*
*
* The children of a plan
* node describe computations which produce temporary tables. The parents of a plan
* node describe computations which consume the result of the plan node. So, child
* nodes could also be called producer nodes, and parent nodes could also
* be called consumer nodes.
* Note: A node may have only one parent node. A node may, therefore, create only
* one value.
*
* The output schema of a plan node gives the set of columns which the computation
* of the plan node will create. For each row which is not rejected, some computation
* is done to the row, and the results are put into an output row.
*
*
* Some operations can be computed on a row-wise basis. For example, if
* a table has N columns, but only M < N columns are used in the output, the projection
* from N columns to M columns happens independently for all rows.
* Similarly, an output column can be a function of input columns. These
* functions can by computed row-wise. Finally, most
* aggregate functions can be computed row-wise. For example, the sum
* aggregate function needs only a running sum and values from the current row
* to update its state. Not all operations can be computed row-wise.
* For example, a sort requires an entire table.
* For row-wise operations, we avoid creating extra temporary
* tables by combining scans. The row-by-row operations are placed in inline
* nodes in a scan node. The inline nodes are applied to each row of
* the scan which passes the filters during the scan. The inline nodes have
* their own output columns, as discussed above.
*
*
* The classes of plan nodes are organized in a hierarchy. The root of the
* hierarchy is the AbstractPlanNode class.
*
*
* How Does The Planner Work In Detail?
*
* The planner takes a data structure describing a DML or DQL SQL
* statement. It produces a plan of execution which can be used to
* evaluate the statement. The Execution Engine executes the plan.
* Note that DDL is processed entirely differently, and is generally
* not part of this discussion. DDL which has embedded DQL or DML
* will need to be planned, as detailed here. For example, materialized
* views, single statement procedure definitions and delete statement
* limit plans are DDL statements which have plannable elements.
* The Input
*
* The input to the planner is originally a SQL statement, which is a
* text string. The HSQL front end translates the user’s text into an
* HSQL data structure which represents the statement. This data
* structure is translated into a second data structure called
* VoltXML, which is similar to XML. VoltXML has elements, called
* VoltXMLElements, and elements have string valued attributes, just
* like XML. There is no support for namespaces in VoltXML, though it
* is hardly needed. This VoltXML is translated again into another
* data structure called AbstractParsedStatement and its subclasses.
* This AbstractParsedStatement, along with a catalog defining the
* tables, indexes and other DDL objects, is the input to the
* planner.
*
*
* An AbstractParsedStatement object contains several pieces.
*
* - All the parts of the user’s original SQL statement,
* including
*
* - The display list,
* - The list of joined tables, including
*
* - Joined tables,
* - Derived tables, and
* - expressions used in WHERE and ON conditions,
*
*
* - The list of GROUP BY expressions,
* - The HAVING expression, if any,
* - Any LIMIT or OFFSET constants.
* - An indication of whether the statement is a SELECT, DELETE
* INSERT, UPSERT or UPDATE statement, or a set operation like
* UNION, or INTERSECT. As we said
* before, DDL is generally not planned, but some parts of
* DDL statements may require planning.
*
*
*
* Note that a joined or derived table is associated with a tablescan,
* and that the planner generally works with tablescans rather than
* tables. For example, in this SQL statement:
*
* select *
* from t as left join t as right
* on left.a = right.a;
*
* the aliases left
and right
name
* the table T, which is scanned twice independently as the
* left and right scans of the join.
*
*
* The SQL standard and HSQL refer to tablescans as range
* variables, but the planner calls them tablescans. These are
* essentially a table with an alias to distinguish between different
* references of the same table in a self join.
*
* The Plan Nodes In Detail
*
* Plan nodes compute some operation on one or more input tables. For
* example, an ORDERBY plan node sorts its only input table. A
* NESTEDLOOPJOIN node takes two tables as input and computes a join
* in a nested loop, scanning the outer table in the outer loop, and
* the inner table in the inner loop.
*
*
* There are about 24 different plan nodes, organized in five categories. The categories are:
*
* - Operator nodes, for executing DML operations such as DELETE, INSERT and UPDATE,
* - Scan nodes for scanning tables and indexes,
* - Join nodes for calculating different kinds of joins,
* - Communication nodes for sending result tables to the
* coordinator or receiving results from distributed executions,
* - Miscellaneous nodes. These include
*
* - Nodes to compute aggregates and GROUP BY. The aggregation algorithms include
*
* - Serial Aggregation, which is aggregation based on sorting,
* - Hash Aggregation, which is aggregation based on hash tables,
* - Partial Aggregation, which is a form of parallelized
* serial aggregation.
*
*
* - Nodes to compute set operations, like UNION or INTERSECT,
* - Nodes to compute ORDERBY,
* - Nodes to compute PROJECTION,
* - Nodes to compute MATERIALIZE,
* - Nodes to compute LIMIT and OFFSET,
* - Nodes to compute WINDOWFUNCTION.
*
*
*
*
*
* The Execution Model
* Plans and Fragments
*
* The plan nodes are connected in one or two fragments. Each
* fragment is a plan node tree.
*
* - A one-fragment plan is a single partition plan. It is
* scheduled on a particular partition, and the resulting table is
* sent to the user.
* - A two-fragment plan is a multi partition plan. The two
* fragments are called the coordinator fragment and the distributed
* fragment.
* - The distributed fragment is scheduled to be sent to all the
* partitions and executed in parallel on the partitions.
* - The results of the executions of the distributed fragment on
* each of the partitions are sent to the coordinator node which
* executes the coordinator fragment. The coordinator fragment
* has a RECEIVE or a MERGERECEIVE node which knows how to
* combine the distributed results into a single table.
* - Plan node execution is similar for both the distributed and
* coordinator fragment.
*
*
*
*
* Plan nodes have different kinds of children. A plan node can have
* child nodes which are out-of-line or inline. The latter are used to
* compute which can be computed row-wise, in a single pass through
* the input table along with the parent node. For example, a LIMIT
* or OFFSET node is often made inline, since it just counts how many
* initial rows to allow or ignore. PROJECTION nodes can also be
* inlined, since they operate only row-wise. An ORDERBY node can
* generally not be inlined, since the entire table must be present to
* sort it. But if the ORDERBY node’s input table comes from a
* RECEIVE node, and the planner can tell that the input is sorted,
* then RECEIVE node can be replaced with a MERGERECEIVE node and the
* ORDERBY node can be inlined in the MERGERECEIVE node. The
* MERGERECEIVE node gets sorted input tables from distributed
* fragments. The MERGERECEIVE node merges the input tables by
* choosing the first value of all its input tables which is smallest.
* Indeed, the MERGERECEIVE node must have an inline ORDERBY node to
* carry the ordering expressions.
*
* Fragment initialization
*
* The plan nodes have a method for initialization. They first
* initialize their children, then they initialize themselves.
*
* Fragment Execution
*
* A fragment tree executes the operations of each of its plans,
* starting with the leaves. At each node the execution function reads rows from
* the input tables, computes a function into a single row, perhaps
* alters that row with inline plan nodes and writes the output row to
* a single output table. A plan node can only have a single parent and a
* single output table.
*
* The Planner’s Passes
*
* The planner doesn’t have passes in the way a compiler might. It’s
* more of a pass itself. But it does have several phases.
*
* - First, the planner calculates some join orders. The
* statement has a set of tablescans, and the efficiency of a
* plan depends on the order in which the tablescans are
* scanned. There is a large amount of confusing code devoted
* to this, but essentially each possible order of the
* tablescans is generated. There is a limit to the size of
* this order set. If there are more than 5 tables, the order
* in the statement is used, to avoid exponential explosion.
* A java stored procedures, may also have an explicit
* user-provided join order specification.
* - Each join order results in a join tree, whose java type is
* JoinNode.
*
* - The subtypes of JoinNode are SubqueryLeafNode, BranchNode
* and TableLeafNode. These represent derived tables, which
* we often call subqueries, joined tables and tablescans
* respectively.
* - We do some analysis of the join predicates, to
* distribute them to the proper JoinNode nodes. For example,
*
* -
* If an expression contains only references to a single
* tablescan, that expression can be placed along with
* the TableLeafNode for the tablescan, to know if the
* expression can be used with an index.
* -
* Similarly, if the tablescan is the outer scan of a
* join, it can also be added to an outer-only list in
* the join node. In this case we will know to evaluate
* these outer-only nodes when scanning and not scan the
* inner table at all if the expression’s evaluation is
* false.
* -
* Expressions which reference both branches of a join
* might be outer expressions of a higher level
* join.
* - For OUTER joins, the differing semantics of ON clauses
* and WHERE clauses also factor into where in the join
* tree a predicate should apply.
*
*
* -
* The planner calculates a set of access paths for the tablescans.
* Note that this does not depend on the join order.
*
* -
* These access paths will tell how the tablescan will be
* scanned. It may be scanned directly or through an index.
* -
* The scan associated with the access path may provide
* ordering for a statement level order by or a window
* function. This is stored in the access path as well.
* -
* For each join order, the planner creates a subplan in
* SubPlanAssembler. This class traverses the join tree
* and, consulting with the access paths, turns the
* TableLeafNode and SubqueryLeafNode join nodes into some
* kind of AbstractScanPlanNode, and turns the BranchNodes
* into some kind of AbstractJoinPlanNode. The result is a
* set of subplans. These are just trees of scans and
* joins. The other parts of the statement, order by,
* group by, limit, offset, having, and DML operations
* table insertion, update and deletion are not represented
* here. In particular, if the plan is a multi-partition
* plan, it needs a SEND/RECEIVE pair to designate the
* boundary between the coordinator and distributed plans.
* But this pair will not be part of the subplan.
* -
* For each subplan, the plan assembler in PlanAssembler
* adds new nodes on the top of the subplan tree, to form
* the complete tree. These may include:
*
* - A SEND node on the top of the plan, to send the
* result to the server.
* - SEND/RECEIVE pairs in multi-partition queries,
* - Aggregate nodes, to calculate GROUP BY groups and
* aggregate functions.
* - Window function nodes to calculate window
* functions,
* - ORDERBY nodes to implement sorting.
* - PROJECTION nodes to calculate display lists.
*
* All but the first two are optional.
*
*
*
*
*
*
*
*
* Note that the transformation from ORDERBY->RECEIVE to
* MERGERECEIVE(ORDERBY inlined) discussed above is not done in the
* plan assembler. Also, sometimes nodes which could be made inline
* are not made so by the plan assembler. These and other final
* transformations could be performed in the plan assembler. But it
* seems simpler to do them locally, after the plan tree is formed.
*
* The Microoptimizer
*
* The Microoptimizer takes a plan and applies various transformations. The current set of transformations include:
*
* -
* Pushing limit and offset nodes into distributed fragments and
* making them inline. If a statement has LIMIT 100, for example,
* there is no reason for distributed fragments to generate more
* than 100 rows. Not all of these may be used.
* -
* Transforming plans with ORDERBY, LIMIT and OFFSET into plans
* with MERGERECEIVE and inline ORDERBY, LIMIT and OFFSET plan
* nodes.
* - Make aggregate nodes be inline.
*
* -
* A hash aggregate node reads each input row, calculates group
* by keys and a hash value from this calculation. The hash
* aggregate node puts the group by key value into a hash
* table. It also updates the aggregate function values. For
* example, if GBV1 and GBV2 are a group by keys, then the
* expression MIN(GBV1 * GBV2) would compute the minimum of
* the product of GBV1 and GBV2. This is a row-wise
* operation.
* -
* A serial aggregate node reads each input row and calculates
* the group by keys. The input table will be sorted, so if
* the group by key values change we know we have seen an
* entire group and output a row. In any case we update
* aggregate values, like the MIN(GBV1*GBV2) expression above.
* So serial aggregate nodes can be calculated row-wise.
* -
* Partial aggregation is a kind of serial aggregation.
*
* So, the aggregate node’s calculation can be done row-wise. This
* means the aggregate node can be inlined in some cases.
* -
* Replace an aggregate plan node with an index count plan node or
* a table count plan node.
* -
* Replace an aggregate plan node with an index scan plan node with
* an inline limit node.
*
*
*/
package org.voltdb.planner;