com.github.mjakubowski84.parquet4s.ParquetStreams.scala Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of parquet4s-akka_2.12 Show documentation
parquet4s-akka
The newest version!
package com.github.mjakubowski84.parquet4s

/** Holds factory of Akka Streams / Pekko Streams sources and sinks that allow reading from and writing to Parquet
  * files.
  */
object ParquetStreams {

  /** Creates a [[com.github.mjakubowski84.parquet4s.ScalaCompat.stream.scaladsl.Source]] that reads Parquet data from
    * the specified path. If there are multiple files at path then the order in which files are loaded is determined by
    * underlying filesystem. 
 Path can refer to local file, HDFS, AWS S3, Google Storage, Azure, etc. Please refer
    * to Hadoop client documentation or your data provider in order to know how to configure the connection. 
 Can
    * read also partitioned directories. Filter applies also to partition values. Partition values are set as
    * fields in read entities at path defined by partition name. Path can be a simple column name or a dot-separated
    * path to nested field. Missing intermediate fields are automatically created for each read record. 
 Allows to
    * turn on a projection over original file schema in order to boost read performance if not all columns are
    * required to be read. 
 Provides explicit API for both custom data types and generic records.
    * @return
    *   Builder of the source.
    */
  def fromParquet: ParquetSource.FromParquet = ParquetSource.FromParquetImpl

  /** Creates a [[com.github.mjakubowski84.parquet4s.ScalaCompat.stream.scaladsl.Sink]] that writes Parquet data to
    * single file at the specified path (including file name). 
 Path can refer to local file, HDFS, AWS S3, Google
    * Storage, Azure, etc. Please refer to Hadoop client documentation or your data provider in order to know how to
    * configure the connection. 
 Provides explicit API for both custom data types and generic records.
    * @return
    *   Builder of a sink that writes Parquet file
    */
  def toParquetSingleFile: SingleFileParquetSink.ToParquet = SingleFileParquetSink.ToParquetImpl

  /** Builds a flow that:  Is designed to write Parquet files indefinitely
 Is able to (optionally)
    * partition data by a list of provided fields
 Flushes and rotates files after given number of rows is
    * written to the partition or given time period elapses
 Outputs incoming message after it is written but
    * can write an effect of provided message transformation.
 
 
 Provides explicit API for both custom
    * data types and generic records.
    * @return
    *   Builder of the flow.
    */
  def viaParquet: ParquetPartitioningFlow.ViaParquet = ParquetPartitioningFlow.ViaParquetImpl
}