com.google.cloud.hadoop.io.bigquery.CHANGES.txt Maven / Gradle / Ivy
0.5.1 - 2015-01-22
1. Added enforcement of maximum number of export shards (currently 500)
when calculating splits for BigQueryIntputFormat.
2. Fixed a bug where BigQueryOutputCommitter.needsTaskCommit() incorrectly
depended on a Bigquery.Tables.list() call; listing tables suffers
"eventual consistency", so occasionally a task would erroneously
fail to commit data.
3. Removed extraneous table-deletion in BigQueryOutputCommitter.abortTask();
cleanup occurs during job cleanup anyways, and this would incorrectly
(but harmlessly) try to delete a nonexistent table for map tasks.
0.5.0 - 2014-12-16
1. BigQueryInputFormat has been renamed GsonBigQueryInputFormat to better
reflect its nature as a gson-based format. A forwarding declaration
was left in place to maintain compatibility.
2. JsonTextBigQueryInputFormat was added to provide lines of JSON text as
they appear in the BigQuery export.
3. When using sharded BigQuery exports (the default), the keys will no
longer be in increasing order per mapper. Instead, the keys will be
as they are reported by the delegate RecordReader which is generally
going to be the byte position within the current file. However, the
sharded export creates many files per mapper so this position will
appear to reset to 0 when we switch between files. The record reader's
getProgress() will still report progress across the entire dataset that
the record reader is responsible for.
4. The BigQuery connector can now ingest Avro based BigQuery exports. Using
and Avro-based export should result in less data transferred between your
MapReduce job and Google Cloud Storage and should require less CPU time
to parse the data files. To use Avro, set the input format to
AvroBigQueryInputFormat and update your map code to expect LongWritable
keys and Avro GenericData.Record values.
5. Hadoop 2 support was added for java MapReduce. Streaming support for
Hadoop 2 will be included in a future release.
0.4.5 - 2014-10-17
1. Attempting to acquire an OAuth access token will be now be retried when
using .p12 or installed application (JWT) credentials if there is a
recoverable error such as an HTTP 5XX response code or an IOException.
0.4.4 - 2014-09-18
1. Added new classes implementing the hadoop.mapred.* interfaces by wrapping
the existing hadoop.mapreduce.* implementations and delegating
appropriately. This enables backwards-compatability for some stacks which
depend on the "old api" interfaces, including now being able to use
the standard "hadoop-streaming.jar" to run binary mappers/reducers with
the BigQuery connector. Note that in the absence of a blocking driver
program to call BigQueryInputFormat.cleanupJob, you must instead explicitly
clean up the temporary exported files after a hadoop-streaming job if
using the input connector. Extra cleanup is not necessary if only using
the output connector in hadoop-streaming. The new top-level classes:
com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredOutputFormat
See the javadocs for the associated RecordReader/Writer, InputSplit, and
OutputCommitter classes.
0.4.3 - 2014-08-07
1. Added better validation to BigQueryUtils.getSchemaFromString used by
BigQueryOutputFormat to throw descriptive IllegalArgumentExceptions
instead of NullPointerExceptions for most types of malformed schemas.
2. Fixed a bug in BigQueryUtils.getSchemaFromString to support 'repeated'
fields inside of nested records; used to throw IllegalStateException
if a nested record contained more than 1 inner field.
0.4.2 - 2014-06-05
1. Misc updates in library dependencies.
0.4.1 - 2014-05-08
1. Removed mapred.bq.output.num.records.batch in favor of
mapred.bq.output.buffer.size.
2. Misc updates in library dependencies.
0.4.0 - 2014-04-09
1. Preview release of BigQuery connector.
2. Added support for different projectIds owning the input/output tables in
BigQuery vs the projectId performing the BigQuery jobs. Necessary for
reading public tables like publicdata:samples.shakespeare. Distinguished by
mapred.bq.input.project.id and mapred.bq.output.project.id.
3. Deprecated mapred.bq.input.query and modified the WordCount sample to
eliminate usage of the query.
4. Added a new BigQueryOutputFormat mode which uses resumable uploads; can be
enabled with mapred.bq.output.async.write.enabled.
5. Jar file renamed to bigquery-connector-jar.
0.3.0 - 2014-03-21
1. Added CHANGES.txt for release notes to be included in connector jarfile.
2. Fixed a bug where different task attempts could collide (either through
retries or speculative execution) by using full TaskAttemptID for temp
tableIds in BigQueryOutputFormat.
3. Expanded debug logging, especially in the OutputFormat and helpers.
4. Modified BigQueryOutputCommitter to correctly use the CopyTable API and
avoid incurring "query" costs in the output path.
5. Eliminated the need to call BigQueryInputFormat.setInputs(Job) in main
class; this is now automatically set if necessary inside getSplits.
6. The field 'mapred.bq.temp.gcs.path' is now optional; auto-generated based
on JobID in BigQueryInputFormat.getSplits from 'mapred.bq.gcs.bucket' if
unspecified. No longer set in BigQueryConfiguration.configureBigQueryInput.
7. Fixed BigQueryInputFormat to only delete the input BigQuery table if a
query was actually run AND mapred.bq.query.results.table.delete is true;
changed the latter's default value from 'true' to 'false'.
8. Fixed a bug where JsonRecordReader.initialize was getting called twice,
which resulted in extraneous GCS API calls.
9. Implemented a new "sharded" export mode for the BigQueryInputFormat which
allows the MapReduce to progress concurrently with the BigQuery export;
RecordReaders will read files as they become available, or block if more
files are expected but are not yet available. Toggled on/off using
"mapred.bq.input.sharded.export.enable" = [true|false]. The number of
export directories is based on "mapred.map.tasks", but may automatically
choose a smaller number of shards if the BigQuery table is small.
This mode is now the default export mode for BigQueryInputFormat.
10.Changed credential configuration values to allow per connector overrides.
To use the metadata service, no extra configuration is required. To a use
PKCS12 private key file, specify "mapred.bq.auth.service.account.email"
and "mapred.bq.auth.service.account.keyfile". To use the installed app
workflow, set "mapred.bq.service.account.enable" to "false" and
"mapred.bq.auth.client.id", "mapred.bq.auth.client.secret" and
"mapred.bq.auth.client.file" to appropriate values. Both PKCS12 and
installed app workflows require the credential file to exist on all
nodes and at the same path.
11.Removed cleanup of things related to the BigQueryInputFormat from the
BigQueryOutputCommitter so that Input/Output formats operate independently.
Now, the main class must call BigQueryInputFormat.cleanupJob(JobContext)
at the end of a job explicitly.
0.2.0 - 2014-02-12
1. Compiled connector examples and scripts for running the sample mapreduce
now available in bq-toolkit-0.2.0.tar.gz.
2. Added low-level retries for transient server errors.
3. Improved debug logging.
4. Fixed a bug where the connector failed to perform the BigQuery export if
DEFAULT_FS was set to ‘hdfs’.
5. GCS export paths no longer contain the jobIdentifier, and instead are
written to gs:///hadoop/tmp/bigquery/data-/data-*.json.
© 2015 - 2025 Weber Informatics LLC | Privacy Policy