com.google.cloud.hadoop.io.bigquery.CHANGES.txt Maven / Gradle / Ivy

Go to download
0.5.1 - 2015-01-22

  1. Added enforcement of maximum number of export shards (currently 500)
     when calculating splits for BigQueryIntputFormat.
  2. Fixed a bug where BigQueryOutputCommitter.needsTaskCommit() incorrectly
     depended on a Bigquery.Tables.list() call; listing tables suffers
     "eventual consistency", so occasionally a task would erroneously
     fail to commit data.
  3. Removed extraneous table-deletion in BigQueryOutputCommitter.abortTask();
     cleanup occurs during job cleanup anyways, and this would incorrectly
     (but harmlessly) try to delete a nonexistent table for map tasks.


0.5.0 - 2014-12-16

  1. BigQueryInputFormat has been renamed GsonBigQueryInputFormat to better
     reflect its nature as a gson-based format. A forwarding declaration
     was left in place to maintain compatibility.
  2. JsonTextBigQueryInputFormat was added to provide lines of JSON text as
     they appear in the BigQuery export.
  3. When using sharded BigQuery exports (the default), the keys will no
     longer be in increasing order per mapper. Instead, the keys will be
     as they are reported by the delegate RecordReader which is generally
     going to be the byte position within the current file. However, the
     sharded export creates many files per mapper so this position will
     appear to reset to 0 when we switch between files. The record reader's
     getProgress() will still report progress across the entire dataset that
     the record reader is responsible for.
  4. The BigQuery connector can now ingest Avro based BigQuery exports. Using
     and Avro-based export should result in less data transferred between your
     MapReduce job and Google Cloud Storage and should require less CPU time
     to parse the data files. To use Avro, set the input format to
     AvroBigQueryInputFormat and update your map code to expect LongWritable
     keys and Avro GenericData.Record values.
  5. Hadoop 2 support was added for java MapReduce. Streaming support for
     Hadoop 2 will be included in a future release.


0.4.5 - 2014-10-17

  1. Attempting to acquire an OAuth access token will be now be retried when
     using .p12 or installed application (JWT) credentials if there is a
     recoverable error such as an HTTP 5XX response code or an IOException.


0.4.4 - 2014-09-18

  1. Added new classes implementing the hadoop.mapred.* interfaces by wrapping
     the existing hadoop.mapreduce.* implementations and delegating
     appropriately. This enables backwards-compatability for some stacks which
     depend on the "old api" interfaces, including now being able to use
     the standard "hadoop-streaming.jar" to run binary mappers/reducers with
     the BigQuery connector. Note that in the absence of a blocking driver
     program to call BigQueryInputFormat.cleanupJob, you must instead explicitly
     clean up the temporary exported files after a hadoop-streaming job if
     using the input connector. Extra cleanup is not necessary if only using
     the output connector in hadoop-streaming. The new top-level classes:

       com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
       com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredOutputFormat

     See the javadocs for the associated RecordReader/Writer, InputSplit, and
     OutputCommitter classes.


0.4.3 - 2014-08-07

  1. Added better validation to BigQueryUtils.getSchemaFromString used by
     BigQueryOutputFormat to throw descriptive IllegalArgumentExceptions
     instead of NullPointerExceptions for most types of malformed schemas.
  2. Fixed a bug in BigQueryUtils.getSchemaFromString to support 'repeated'
     fields inside of nested records; used to throw IllegalStateException
     if a nested record contained more than 1 inner field.


0.4.2 - 2014-06-05

  1. Misc updates in library dependencies.


0.4.1 - 2014-05-08

  1. Removed mapred.bq.output.num.records.batch in favor of
     mapred.bq.output.buffer.size.
  2. Misc updates in library dependencies.


0.4.0 - 2014-04-09

  1. Preview release of BigQuery connector.
  2. Added support for different projectIds owning the input/output tables in
     BigQuery vs the projectId performing the BigQuery jobs. Necessary for
     reading public tables like publicdata:samples.shakespeare. Distinguished by
     mapred.bq.input.project.id and mapred.bq.output.project.id.
  3. Deprecated mapred.bq.input.query and modified the WordCount sample to
     eliminate usage of the query.
  4. Added a new BigQueryOutputFormat mode which uses resumable uploads; can be
     enabled with mapred.bq.output.async.write.enabled.
  5. Jar file renamed to bigquery-connector-jar.


0.3.0 - 2014-03-21

  1. Added CHANGES.txt for release notes to be included in connector jarfile.
  2. Fixed a bug where different task attempts could collide (either through
     retries or speculative execution) by using full TaskAttemptID for temp
     tableIds in BigQueryOutputFormat.
  3. Expanded debug logging, especially in the OutputFormat and helpers.
  4. Modified BigQueryOutputCommitter to correctly use the CopyTable API and
     avoid incurring "query" costs in the output path.
  5. Eliminated the need to call BigQueryInputFormat.setInputs(Job) in main
     class; this is now automatically set if necessary inside getSplits.
  6. The field 'mapred.bq.temp.gcs.path' is now optional; auto-generated based
     on JobID in BigQueryInputFormat.getSplits from 'mapred.bq.gcs.bucket' if
     unspecified. No longer set in BigQueryConfiguration.configureBigQueryInput.
  7. Fixed BigQueryInputFormat to only delete the input BigQuery table if a
     query was actually run AND mapred.bq.query.results.table.delete is true;
     changed the latter's default value from 'true' to 'false'.
  8. Fixed a bug where JsonRecordReader.initialize was getting called twice,
     which resulted in extraneous GCS API calls.
  9. Implemented a new "sharded" export mode for the BigQueryInputFormat which
     allows the MapReduce to progress concurrently with the BigQuery export;
     RecordReaders will read files as they become available, or block if more
     files are expected but are not yet available. Toggled on/off using
     "mapred.bq.input.sharded.export.enable" = [true|false]. The number of
     export directories is based on "mapred.map.tasks", but may automatically
     choose a smaller number of shards if the BigQuery table is small.
     This mode is now the default export mode for BigQueryInputFormat.
  10.Changed credential configuration values to allow per connector overrides.
     To use the metadata service, no extra configuration is required. To a use
     PKCS12 private key file, specify "mapred.bq.auth.service.account.email"
     and "mapred.bq.auth.service.account.keyfile". To use the installed app
     workflow, set "mapred.bq.service.account.enable" to "false" and
     "mapred.bq.auth.client.id", "mapred.bq.auth.client.secret" and
     "mapred.bq.auth.client.file" to appropriate values. Both PKCS12 and
     installed app workflows require the credential file to exist on all
     nodes and at the same path.
  11.Removed cleanup of things related to the BigQueryInputFormat from the
     BigQueryOutputCommitter so that Input/Output formats operate independently.
     Now, the main class must call BigQueryInputFormat.cleanupJob(JobContext)
     at the end of a job explicitly.


0.2.0 - 2014-02-12

  1. Compiled connector examples and scripts for running the sample mapreduce
     now available in bq-toolkit-0.2.0.tar.gz.
  2. Added low-level retries for transient server errors.
  3. Improved debug logging.
  4. Fixed a bug where the connector failed to perform the BigQuery export if
     DEFAULT_FS was set to ‘hdfs’.
  5. GCS export paths no longer contain the jobIdentifier, and instead are
     written to gs:///hadoop/tmp/bigquery/data-/data-*.json.