com.intel.analytics.bigdl.ppml.examples.tpch.README.md Maven / Gradle / Ivy
The newest version!
# tpch-spark
TPC-H queries implemented in Spark using the DataFrames API running with BigDL PPML.
## Generating tables
Go to [TPC Download](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) site, choose `TPC-H` source code, then download the TPC-H toolkits.
After you download the tpc-h tools zip and uncompressed the zip file. Go to `dbgen` directory, and create a makefile based on `makefile.suite`, and run `make`.
This should generate an executable called `dbgen`
```
./dbgen -h
```
gives you the various options for generating the tables. The simplest case is running:
```
./dbgen
```
which generates tables with extension `.tbl` with scale 1 (default) for a total of rougly 1GB size across all tables. For different size tables you can use the `-s` option:
```
./dbgen -s 10
```
will generate roughly 10GB of input data.
You can then either upload your data to remote file system or read them locally.
## Encrypt Data
Encrypt data with specified Key Management Service (`SimpleKeyManagementService`, or `EHSMKeyManagementService` , or `AzureKeyManagementService`)
The example code of encrypt data with `SimpleKeyManagementService` is like below:
```
export BIGDL_HOME=XXX
java -cp '$BIGDL_HOME/lib/bigdl-ppml-VERSION-jar-with-dependencies.jar \
-Xmx10g \
com.intel.analytics.bigdl.ppml.examples.tpch.EncryptFiles \
--kmsType SimpleKeyManagementService \
--simpleAPPID xxxxxxxxxxxx \
--simpleAPIKEY xxxxxxxxxxxx \
--inputPath xxx/dbgen \
--outputPath xxx/dbgen-encrypted
```
## Running
Make sure you set the INPUT_DIR and OUTPUT_DIR in `TpchQuery` class before compiling to point to the
location the of the input data and where the output should be saved.
The example script to run a query is like:
```
secure_password=`openssl rsautl -inkey /ppml/trusted-big-data-ml/work/password/key.txt -decrypt
© 2015 - 2025 Weber Informatics LLC | Privacy Policy