com.github.mjakubowski84.parquet4s.ParquetStreams.scala Maven / Gradle / Ivy
The newest version!
package com.github.mjakubowski84.parquet4s
/** Holds factory of Akka Streams / Pekko Streams sources and sinks that allow reading from and writing to Parquet
* files.
*/
object ParquetStreams {
/** Creates a [[com.github.mjakubowski84.parquet4s.ScalaCompat.stream.scaladsl.Source]] that reads Parquet data from
* the specified path. If there are multiple files at path then the order in which files are loaded is determined by
* underlying filesystem.
Path can refer to local file, HDFS, AWS S3, Google Storage, Azure, etc. Please refer
* to Hadoop client documentation or your data provider in order to know how to configure the connection.
Can
* read also partitioned directories. Filter applies also to partition values. Partition values are set as
* fields in read entities at path defined by partition name. Path can be a simple column name or a dot-separated
* path to nested field. Missing intermediate fields are automatically created for each read record.
Allows to
* turn on a projection over original file schema in order to boost read performance if not all columns are
* required to be read.
Provides explicit API for both custom data types and generic records.
* @return
* Builder of the source.
*/
def fromParquet: ParquetSource.FromParquet = ParquetSource.FromParquetImpl
/** Creates a [[com.github.mjakubowski84.parquet4s.ScalaCompat.stream.scaladsl.Sink]] that writes Parquet data to
* single file at the specified path (including file name).
Path can refer to local file, HDFS, AWS S3, Google
* Storage, Azure, etc. Please refer to Hadoop client documentation or your data provider in order to know how to
* configure the connection.
Provides explicit API for both custom data types and generic records.
* @return
* Builder of a sink that writes Parquet file
*/
def toParquetSingleFile: SingleFileParquetSink.ToParquet = SingleFileParquetSink.ToParquetImpl
/** Builds a flow that: - Is designed to write Parquet files indefinitely
- Is able to (optionally)
* partition data by a list of provided fields
- Flushes and rotates files after given number of rows is
* written to the partition or given time period elapses
- Outputs incoming message after it is written but
* can write an effect of provided message transformation.
Provides explicit API for both custom
* data types and generic records.
* @return
* Builder of the flow.
*/
def viaParquet: ParquetPartitioningFlow.ViaParquet = ParquetPartitioningFlow.ViaParquetImpl
}
© 2015 - 2024 Weber Informatics LLC | Privacy Policy