All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.github.mjakubowski84.parquet4s.ParquetStreams.scala Maven / Gradle / Ivy

The newest version!
package com.github.mjakubowski84.parquet4s

/** Holds factory of Akka Streams / Pekko Streams sources and sinks that allow reading from and writing to Parquet
  * files.
  */
object ParquetStreams {

  /** Creates a [[com.github.mjakubowski84.parquet4s.ScalaCompat.stream.scaladsl.Source]] that reads Parquet data from
    * the specified path. If there are multiple files at path then the order in which files are loaded is determined by
    * underlying filesystem. 
Path can refer to local file, HDFS, AWS S3, Google Storage, Azure, etc. Please refer * to Hadoop client documentation or your data provider in order to know how to configure the connection.
Can * read also partitioned directories. Filter applies also to partition values. Partition values are set as * fields in read entities at path defined by partition name. Path can be a simple column name or a dot-separated * path to nested field. Missing intermediate fields are automatically created for each read record.
Allows to * turn on a projection over original file schema in order to boost read performance if not all columns are * required to be read.
Provides explicit API for both custom data types and generic records. * @return * Builder of the source. */ def fromParquet: ParquetSource.FromParquet = ParquetSource.FromParquetImpl /** Creates a [[com.github.mjakubowski84.parquet4s.ScalaCompat.stream.scaladsl.Sink]] that writes Parquet data to * single file at the specified path (including file name).
Path can refer to local file, HDFS, AWS S3, Google * Storage, Azure, etc. Please refer to Hadoop client documentation or your data provider in order to know how to * configure the connection.
Provides explicit API for both custom data types and generic records. * @return * Builder of a sink that writes Parquet file */ def toParquetSingleFile: SingleFileParquetSink.ToParquet = SingleFileParquetSink.ToParquetImpl /** Builds a flow that:
  1. Is designed to write Parquet files indefinitely
  2. Is able to (optionally) * partition data by a list of provided fields
  3. Flushes and rotates files after given number of rows is * written to the partition or given time period elapses
  4. Outputs incoming message after it is written but * can write an effect of provided message transformation.

Provides explicit API for both custom * data types and generic records. * @return * Builder of the flow. */ def viaParquet: ParquetPartitioningFlow.ViaParquet = ParquetPartitioningFlow.ViaParquetImpl }




© 2015 - 2024 Weber Informatics LLC | Privacy Policy