org.apache.hudi.common.table.cdc.HoodieCDCInferenceCase Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of hudi-flink1.20-bundle Show documentation
The newest version!
/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hudi.common.table.cdc;

/**
 * Here define five cdc infer cases. The different cdc infer case will decide which file will be
 * used to extract the change data, and how to do this.
 *
 * AS_IS:
 *   For this type, there must be a real cdc log file from which we get the whole/part change data.
 *   When `hoodie.table.cdc.supplemental.logging.mode` is {@link HoodieCDCSupplementalLoggingMode#DATA_BEFORE_AFTER},
 *     it keeps all the fields about the change data, including `op`, `ts_ms`, `before` and `after`.
 *     So read it and return directly, no more other files need to be loaded.
 *   When `hoodie.table.cdc.supplemental.logging.mode` is {@link HoodieCDCSupplementalLoggingMode#DATA_BEFORE},
 *     it keeps the `op`, the key and the `before` of the changing record.
 *     When `op` is equal to 'i' or 'u', need to get the current record from the current base/log file as `after`.
 *   When `hoodie.table.cdc.supplemental.logging.mode` is '{@link HoodieCDCSupplementalLoggingMode#OP_KEY_ONLY',
 *     it just keeps the `op` and the key of the changing record.
 *     When `op` is equal to 'i', `before` is null and get the current record
 *     from the current base/log file as `after`.
 *     When `op` is equal to 'u', get the previous record from the previous file slice as `before`,
 *     and get the current record from the current base/log file as `after`.
 *     When `op` is equal to 'd', get the previous record from the previous file slice as `before`, and `after` is null.
 *
 * BASE_FILE_INSERT:
 *   For this type, there must be a base file at the current instant. All the records from this
 *   file is new-coming, so we can load this, mark all the records with `i`, and treat them as
 *   the value of `after`. The value of `before` for each record is null.
 *
 * BASE_FILE_DELETE:
 *   For this type, there must be an empty file at the current instant, but a non-empty base file
 *   at the previous instant. First we find this base file that has the same file group and belongs
 *   to the previous instant. Then load this, mark all the records with `d`, and treat them as
 *   the value of `before`. The value of `after` for each record is null.
 *
 * LOG_FILE:
 *   For this type, a normal log file of MOR table will be used. First we need to load the previous
 *   file slice (including the base file and other log files in the same file group). Then for each
 *   record (called `current record` hereafter) from the log file, get its key, and execute the following steps:
 *     1) if the current record is deleted,
 *       a) if there is a record with the same key in the data loaded (called `loaded record` hereafter),
 *          `op` is 'd', 'before' is the loaded record, `after` is null;
 *       b) if the loaded reocrd does not exist, just skip.
 *     2) the current record is not deleted,
 *       a) if there is a loaded record, `op` is 'u', 'before' is the loaded record, `after` is the current record;
 *       b) if the loaded record does not exist, `op` is 'i', 'before' is null, `after` is the current record;
 *
 * REPLACE_COMMIT:
 *   For this type, it must be a replacecommit, like INSERT_OVERWRITE and DROP_PARTITION. It drops
 *   a whole file group. First we find this file group. Then load this, mark all the records with
 *   `d`, and treat them as the value of `before`. The value of `after` for each record is null.
 */
public enum HoodieCDCInferenceCase {

  AS_IS,
  BASE_FILE_INSERT,
  BASE_FILE_DELETE,
  LOG_FILE,
  REPLACE_COMMIT;

}