All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.voltdb.planner.package-info Maven / Gradle / Ivy

There is a newer version: 10.1.1
Show newest version
/* This file is part of VoltDB.
 * Copyright (C) 2008-2018 VoltDB Inc.
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU Affero General Public License as
 * published by the Free Software Foundation, either version 3 of the
 * License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU Affero General Public License for more details.
 *
 * You should have received a copy of the GNU Affero General Public License
 * along with VoltDB.  If not, see .
 */
/**
 * 

What is the package org.voltdb.planner?

*

This package contains the cost-based, flexible, distributed * planner/optimizer for VoltDB.

*

* The execution of any query is a some kind of a scan. By scan we mean a loop, * or perhaps a nested loop. The most naive, and slowest, scan would be just a scan of a table. * Each row of the table is fetched, evaluated and written to the output or else skipped, depending * the filters of join conditions or the WHERE clause. A more sophisticated method would be * to scan an index. This scans only a fraction of the rows. But it's still a scan.

* *

* The planner creates plans. A plan is a kind of description of the execution of * a query. It may help to think of a plan as an executable program file, say an ELF or * a.out file. The plan contains the instructions needed to do the computation, but the * plan doesn't have any resources to do any calculations. The plan is just a data structure.

* *

* Each plan is made of a tree of plan nodes. Each plan node describes * some computation which operates on tables and produces a single table. The result table * and the argument tables may be persistent tables or they may be temporary tables. Temporary * tables created during the execution of a plan are destroyed after the execution of the * plan.

* *

* For example, a SEQUENTIAL_SCAN node describes a computation which operates on a * table, reads each row of the table and sends it to the output, possibly filtering * with a predicate or applying a limit or an offset. A NESTED_LOOP plan node describes * a join computation, either an inner or outer join. * This join computation will be given two tables which must be scanned * in two nested loops. For each row of the left hand table, which we call the outer * table, we look at each row of the right hand table, which we call the inner * table. If the pair of rows satisfy the join condition, we write the row to the * output table.
* Note: The computation for NESTED_LOOP is somewhat more complicated * than this, as we may be able to skip the scan of the inner table. There * may be other optimizations.
* Note Also: The tree formed by a plan is not very similar to the parse tree formed * from a SQL or DDL statement. The plan relates to the SQL text approximately the * same way an assembly language program relates to a string of C++ program text.

* *

* We say a plan node describes a computation because the plan node does * not actually have the code to do the computation. Plan nodes are generated by * the planner in the Java code in the VoltDB server. Plans are transmitted to the * Execution Engine, or EE, in the form of JSON objects, represented as strings. * The EE has its own definitions for all the plan nodes in C++. The * EE can reconstruct the the plan in a C++ data structure from the JSON. The EE also * has a hierarchy of classes called executor classes which is parallel to * the plan node hierarchy. Executor class objects contain the code which does * the real computations of the plan nodes. Most dynamic data, such as table * data, are stored in the executor objects.

* *

* The children of a plan * node describe computations which produce temporary tables. The parents of a plan * node describe computations which consume the result of the plan node. So, child * nodes could also be called producer nodes, and parent nodes could also * be called consumer nodes.
* Note: A node may have only one parent node. A node may, therefore, create only * one value.

* *

The output schema of a plan node gives the set of columns which the computation * of the plan node will create. For each row which is not rejected, some computation * is done to the row, and the results are put into an output row. * *

* Some operations can be computed on a row-wise basis. For example, if * a table has N columns, but only M < N columns are used in the output, the projection * from N columns to M columns happens independently for all rows. * Similarly, an output column can be a function of input columns. These * functions can by computed row-wise. Finally, most * aggregate functions can be computed row-wise. For example, the sum * aggregate function needs only a running sum and values from the current row * to update its state. Not all operations can be computed row-wise. * For example, a sort requires an entire table. * For row-wise operations, we avoid creating extra temporary * tables by combining scans. The row-by-row operations are placed in inline * nodes in a scan node. The inline nodes are applied to each row of * the scan which passes the filters during the scan. The inline nodes have * their own output columns, as discussed above.

* *

* The classes of plan nodes are organized in a hierarchy. The root of the * hierarchy is the AbstractPlanNode class. *

* *

How Does The Planner Work In Detail?

*

* The planner takes a data structure describing a DML or DQL SQL * statement. It produces a plan of execution which can be used to * evaluate the statement. The Execution Engine executes the plan. * Note that DDL is processed entirely differently, and is generally * not part of this discussion. DDL which has embedded DQL or DML * will need to be planned, as detailed here. For example, materialized * views, single statement procedure definitions and delete statement * limit plans are DDL statements which have plannable elements.

*

The Input

*

* The input to the planner is originally a SQL statement, which is a * text string. The HSQL front end translates the user’s text into an * HSQL data structure which represents the statement. This data * structure is translated into a second data structure called * VoltXML, which is similar to XML. VoltXML has elements, called * VoltXMLElements, and elements have string valued attributes, just * like XML. There is no support for namespaces in VoltXML, though it * is hardly needed. This VoltXML is translated again into another * data structure called AbstractParsedStatement and its subclasses. * This AbstractParsedStatement, along with a catalog defining the * tables, indexes and other DDL objects, is the input to the * planner.

* *

* An AbstractParsedStatement object contains several pieces. *

    *
  1. All the parts of the user’s original SQL statement, * including *
      *
    1. The display list,
    2. *
    3. The list of joined tables, including *
        *
      1. Joined tables,
      2. *
      3. Derived tables, and
      4. *
      5. expressions used in WHERE and ON conditions,
      6. *
      *
    4. *
    5. The list of GROUP BY expressions,
    6. *
    7. The HAVING expression, if any,
    8. *
    9. Any LIMIT or OFFSET constants.
    10. *
    11. An indication of whether the statement is a SELECT, DELETE * INSERT, UPSERT or UPDATE statement, or a set operation like * UNION, or INTERSECT. As we said * before, DDL is generally not planned, but some parts of * DDL statements may require planning.
    12. *
    *
  2. *
* Note that a joined or derived table is associated with a tablescan, * and that the planner generally works with tablescans rather than * tables. For example, in this SQL statement: *
 *    select *
 *    from t as left join t as right
 *    on left.a = right.a;
 *   
* the aliases left and right name * the table T, which is scanned twice independently as the * left and right scans of the join.

* *

* The SQL standard and HSQL refer to tablescans as range * variables, but the planner calls them tablescans. These are * essentially a table with an alias to distinguish between different * references of the same table in a self join.

* *

The Plan Nodes In Detail

*

* Plan nodes compute some operation on one or more input tables. For * example, an ORDERBY plan node sorts its only input table. A * NESTEDLOOPJOIN node takes two tables as input and computes a join * in a nested loop, scanning the outer table in the outer loop, and * the inner table in the inner loop.

* *

* There are about 24 different plan nodes, organized in five categories. The categories are: *

    *
  • Operator nodes, for executing DML operations such as DELETE, INSERT and UPDATE,
  • *
  • Scan nodes for scanning tables and indexes,
  • *
  • Join nodes for calculating different kinds of joins,
  • *
  • Communication nodes for sending result tables to the * coordinator or receiving results from distributed executions,
  • *
  • Miscellaneous nodes. These include *
      *
    • Nodes to compute aggregates and GROUP BY. The aggregation algorithms include *
        *
      • Serial Aggregation, which is aggregation based on sorting,
      • *
      • Hash Aggregation, which is aggregation based on hash tables,
      • *
      • Partial Aggregation, which is a form of parallelized * serial aggregation.
      • *
      *
    • *
    • Nodes to compute set operations, like UNION or INTERSECT,
    • *
    • Nodes to compute ORDERBY,
    • *
    • Nodes to compute PROJECTION,
    • *
    • Nodes to compute MATERIALIZE,
    • *
    • Nodes to compute LIMIT and OFFSET,
    • *
    • Nodes to compute WINDOWFUNCTION.
    • *
    *
  • *
*

* *

The Execution Model

*

Plans and Fragments

*

* The plan nodes are connected in one or two fragments. Each * fragment is a plan node tree. *

    *
  • A one-fragment plan is a single partition plan. It is * scheduled on a particular partition, and the resulting table is * sent to the user.
  • *
  • A two-fragment plan is a multi partition plan. The two * fragments are called the coordinator fragment and the distributed * fragment.
  • *
  • The distributed fragment is scheduled to be sent to all the * partitions and executed in parallel on the partitions.
  • *
  • The results of the executions of the distributed fragment on * each of the partitions are sent to the coordinator node which * executes the coordinator fragment. The coordinator fragment * has a RECEIVE or a MERGERECEIVE node which knows how to * combine the distributed results into a single table.
  • *
  • Plan node execution is similar for both the distributed and * coordinator fragment.
  • *
*

* *

* Plan nodes have different kinds of children. A plan node can have * child nodes which are out-of-line or inline. The latter are used to * compute which can be computed row-wise, in a single pass through * the input table along with the parent node. For example, a LIMIT * or OFFSET node is often made inline, since it just counts how many * initial rows to allow or ignore. PROJECTION nodes can also be * inlined, since they operate only row-wise. An ORDERBY node can * generally not be inlined, since the entire table must be present to * sort it. But if the ORDERBY node’s input table comes from a * RECEIVE node, and the planner can tell that the input is sorted, * then RECEIVE node can be replaced with a MERGERECEIVE node and the * ORDERBY node can be inlined in the MERGERECEIVE node. The * MERGERECEIVE node gets sorted input tables from distributed * fragments. The MERGERECEIVE node merges the input tables by * choosing the first value of all its input tables which is smallest. * Indeed, the MERGERECEIVE node must have an inline ORDERBY node to * carry the ordering expressions.

* *

Fragment initialization

*

* The plan nodes have a method for initialization. They first * initialize their children, then they initialize themselves.

* *

Fragment Execution

*

* A fragment tree executes the operations of each of its plans, * starting with the leaves. At each node the execution function reads rows from * the input tables, computes a function into a single row, perhaps * alters that row with inline plan nodes and writes the output row to * a single output table. A plan node can only have a single parent and a * single output table.

* *

The Planner’s Passes

*

* The planner doesn’t have passes in the way a compiler might. It’s * more of a pass itself. But it does have several phases. *

    *
  1. First, the planner calculates some join orders. The * statement has a set of tablescans, and the efficiency of a * plan depends on the order in which the tablescans are * scanned. There is a large amount of confusing code devoted * to this, but essentially each possible order of the * tablescans is generated. There is a limit to the size of * this order set. If there are more than 5 tables, the order * in the statement is used, to avoid exponential explosion. * A java stored procedures, may also have an explicit * user-provided join order specification.
  2. *
  3. Each join order results in a join tree, whose java type is * JoinNode. *
      *
    1. The subtypes of JoinNode are SubqueryLeafNode, BranchNode * and TableLeafNode. These represent derived tables, which * we often call subqueries, joined tables and tablescans * respectively.
    2. *
    3. We do some analysis of the join predicates, to * distribute them to the proper JoinNode nodes. For example, *
        *
      • * If an expression contains only references to a single * tablescan, that expression can be placed along with * the TableLeafNode for the tablescan, to know if the * expression can be used with an index.
      • *
      • * Similarly, if the tablescan is the outer scan of a * join, it can also be added to an outer-only list in * the join node. In this case we will know to evaluate * these outer-only nodes when scanning and not scan the * inner table at all if the expression’s evaluation is * false.
      • *
      • * Expressions which reference both branches of a join * might be outer expressions of a higher level * join.
      • *
      • For OUTER joins, the differing semantics of ON clauses * and WHERE clauses also factor into where in the join * tree a predicate should apply.
      • *
      *
    4. *
    5. * The planner calculates a set of access paths for the tablescans. * Note that this does not depend on the join order.
    6. *
        *
      1. * These access paths will tell how the tablescan will be * scanned. It may be scanned directly or through an index.
      2. *
      3. * The scan associated with the access path may provide * ordering for a statement level order by or a window * function. This is stored in the access path as well.
      4. *
      5. * For each join order, the planner creates a subplan in * SubPlanAssembler. This class traverses the join tree * and, consulting with the access paths, turns the * TableLeafNode and SubqueryLeafNode join nodes into some * kind of AbstractScanPlanNode, and turns the BranchNodes * into some kind of AbstractJoinPlanNode. The result is a * set of subplans. These are just trees of scans and * joins. The other parts of the statement, order by, * group by, limit, offset, having, and DML operations * table insertion, update and deletion are not represented * here. In particular, if the plan is a multi-partition * plan, it needs a SEND/RECEIVE pair to designate the * boundary between the coordinator and distributed plans. * But this pair will not be part of the subplan.
      6. *
      7. * For each subplan, the plan assembler in PlanAssembler * adds new nodes on the top of the subplan tree, to form * the complete tree. These may include: *
          *
        • A SEND node on the top of the plan, to send the * result to the server.
        • *
        • SEND/RECEIVE pairs in multi-partition queries,
        • *
        • Aggregate nodes, to calculate GROUP BY groups and * aggregate functions.
        • *
        • Window function nodes to calculate window * functions,
        • *
        • ORDERBY nodes to implement sorting.
        • *
        • PROJECTION nodes to calculate display lists.
        • *
        * All but the first two are optional.
      8. *
      * *
    *
  4. *
*

* *

* Note that the transformation from ORDERBY->RECEIVE to * MERGERECEIVE(ORDERBY inlined) discussed above is not done in the * plan assembler. Also, sometimes nodes which could be made inline * are not made so by the plan assembler. These and other final * transformations could be performed in the plan assembler. But it * seems simpler to do them locally, after the plan tree is formed.

* *

The Microoptimizer

*

* The Microoptimizer takes a plan and applies various transformations. The current set of transformations include: *

    *
  1. * Pushing limit and offset nodes into distributed fragments and * making them inline. If a statement has LIMIT 100, for example, * there is no reason for distributed fragments to generate more * than 100 rows. Not all of these may be used.
  2. *
  3. * Transforming plans with ORDERBY, LIMIT and OFFSET into plans * with MERGERECEIVE and inline ORDERBY, LIMIT and OFFSET plan * nodes.
  4. *
  5. Make aggregate nodes be inline. *
      *
    1. * A hash aggregate node reads each input row, calculates group * by keys and a hash value from this calculation. The hash * aggregate node puts the group by key value into a hash * table. It also updates the aggregate function values. For * example, if GBV1 and GBV2 are a group by keys, then the * expression MIN(GBV1 * GBV2) would compute the minimum of * the product of GBV1 and GBV2. This is a row-wise * operation.
    2. *
    3. * A serial aggregate node reads each input row and calculates * the group by keys. The input table will be sorted, so if * the group by key values change we know we have seen an * entire group and output a row. In any case we update * aggregate values, like the MIN(GBV1*GBV2) expression above. * So serial aggregate nodes can be calculated row-wise.
    4. *
    5. * Partial aggregation is a kind of serial aggregation.
    6. *
    * So, the aggregate node’s calculation can be done row-wise. This * means the aggregate node can be inlined in some cases.
  6. *
  7. * Replace an aggregate plan node with an index count plan node or * a table count plan node.
  8. *
  9. * Replace an aggregate plan node with an index scan plan node with * an inline limit node.
  10. *
*

*/ package org.voltdb.planner;




© 2015 - 2024 Weber Informatics LLC | Privacy Policy