SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Building Robust ETL
Pipelines with Apache Spark
Xiao Li
Spark Summit | SF | Jun 2017
2
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
22
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
3
About Me
• Apache Spark Committer
• Software Engineer at Databricks
• Ph.D. in University of Florida
• Previously, IBM Master Inventor, QRep, GDPS A/A and STC
• Spark SQL, Database Replication, Information Integration
• Github: gatorsmile
4
Overview
1. What’s an ETL Pipeline?
2. Using Spark SQL for ETL
- Extract: Dealing with Dirty Data (Bad Records or Files)
- Extract: Multi-line JSON/CSV Support
- Transformation: High-order functions in SQL
- Load: Unified write paths and interfaces
3. New Features in Spark 2.3
- Performance (Data Source API v2, Python UDF)
5
What is a Data Pipeline?
1. Sequence of transformations on data
2. Source data is typically semi-structured/unstructured
(JSON, CSV etc.) and structured (JDBC, Parquet, ORC, the
other Hive-serde tables)
3. Output data is integrated, structured and curated.
– Ready for further data processing, analysis and reporting
6
Example of a Data Pipeline
Aggregate Reporting
Applications
ML
Model
Ad-hoc Queries
Database
Cloud
Warehouse
Kafka, Log
Kafka, Log
7
ETL is the First Step in a Data Pipeline
1. ETL stands for EXTRACT, TRANSFORM and LOAD
2. Goal is to clean or curate the data
- Retrieve data from sources (EXTRACT)
- Transform data into a consumable format (TRANSFORM)
- Transmit data to downstream consumers (LOAD)
8
An ETL Query in Apache Spark
spark.read.json("/source/path")
.filter(...)
.agg(...)
.write.mode("append")
.parquet("/output/path")
EXTRACT
TRANSFORM
LOAD
9
An ETL Query in Apache Spark
Extract
EXTRACT
TRANSFORM
LOAD
val csvTable = spark.read.csv("/source/path")
val jdbcTable = spark.read.format("jdbc")
.option("url", "jdbc:postgresql:...")
.option("dbtable", "TEST.PEOPLE")
.load()
csvTable
.join(jdbcTable, Seq("name"), "outer")
.filter("id <= 2999")
.write
.mode("overwrite")
.format("parquet")
.saveAsTable("outputTableName")
10
What’s so hard about ETL
Queries?
11
Why is ETL Hard?
1. Too complex
2. Error-prone
3. Too slow
4. Too expensive
1. Various sources/formats
2. Schema mismatch
3. Different representation
4. Corrupted files and data
5. Scalability
6. Schema evolution
7. Continuous ETL
12
This is why ETL is important
Consumers of this data don’t want to deal with this
messiness and complexity
13
Using Spark SQL for ETL
14
Structured
Streaming
Spark SQL's flexible APIs,
support for a wide
variety of datasources,
build-in support for
structured streaming,
state of art catalyst
optimizer and tungsten
execution engine make it
a great framework for
building end-to-end ETL
pipelines.
15
Data Source Supports
1. Built-in connectors in Spark:
– JSON, CSV, Text, Hive, Parquet, ORC, JDBC
2. Third-party data source connectors:
– https://spark-packages.org
3. Define your own data source connectors by
Data Source APIs
– Ref link: https://youtu.be/uxuLRiNoDio
16
{"a":1, "b":2, "c":3}
{"e":2, "c":3, "b":5}
{"a":5, "d":7}
spark.read
.json("/source/path”)
.printSchema()
Schema Inference – semi-structured files
17
{"a":1, "b":2, "c":3.1}
{"e":2, "c":3, "b":5}
{"a":"5", "d":7}
spark.read
.json("/source/path”)
.printSchema()
Schema Inference – semi-structured files
18
{"a":1, "b":2, "c":3}
{"e":2, "c":3, "b":5}
{"a":5, "d":7}
val schema = new StructType()
.add("a", "int")
.add("b", "int")
spark.read
.json("/source/path")
.schema(schema)
.show()
User-specified Schema
19
{"a":1, "b":2, "c":3}
{"e":2, "c":3, "b":5}
{"a":5, "d":7}
Availability: Apache Spark 2.2
spark.read
.json("/source/path")
.schema("a INT, b INT")
.show()
User-specified DDL-format Schema
20
Corrupt
Files
java.io.IOException. For example, java.io.EOFException: Unexpected end of input
stream at org.apache.hadoop.io.compress.DecompressorStream.decompress
java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too
small)
spark.sql.files.ignoreCorruptFiles = true
[SPARK-17850] If true, the Spark jobs will
continue to run even when it encounters
corrupt files. The contents that have
been read will still be returned.
Dealing with Bad Data: Skip Corrupt Files
21
Missing or
Corrupt
Records
[SPARK-12833][SPARK-
13764] TextFile formats
(JSON and CSV) support
3 different ParseModes
while reading data:
1. PERMISSIVE
2. DROPMALFORMED
3. FAILFAST
Dealing with Bad Data: Skip Corrupt Records
22
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "_corrupt_record")
.json(corruptRecords)
.show() The default can be configured via
spark.sql.columnNameOfCorruptRecord
Json: Dealing with Corrupt Records
23
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "DROPMALFORMED")
.json(corruptRecords)
.show()
Json: Dealing with Corrupt Records
24
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "FAILFAST")
.json(corruptRecords)
.show()
org.apache.spark.sql.catalyst.json
.SparkSQLJsonProcessingException:
Malformed line in FAILFAST mode:
{"a":{, b:3}
Json: Dealing with Corrupt Records
25
spark.read
.option("mode", "FAILFAST")
.csv(corruptRecords)
.show()
java.lang.RuntimeException:
Malformed line in FAILFAST mode:
2015,Chevy,Volt
CSV: Dealing with Corrupt Records
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
26
spark.read.
.option("mode", "PERMISSIVE")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
27
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
spark.read
.option("header", true)
.option("mode", "PERMISSIVE")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
28
val schema = "col1 INT, col2 STRING, col3 STRING, col4 STRING, " +
"col5 STRING, __corrupted_column_name STRING"
spark.read
.option("header", true)
.option("mode", "PERMISSIVE")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
29
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
spark.read
.option("mode", ”DROPMALFORMED")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
30
Functionality: Better Corruption Handling
badRecordsPath: a user-specified path to store exception files for
recording the information about bad records/files.
- A unified interface for both corrupt records and files
- Enabling multi-phase data cleaning
- DROPMALFORMED + Exception files
- No need an extra column for corrupt records
- Recording the exception data, reasons and time.
Availability: Databricks Runtime 3.0
31
Functionality: Better JSON and CSV Support
[SPARK-18352] [SPARK-19610] Multi-line JSON and CSV Support
- Spark SQL currently reads JSON/CSV one line at a time
- Before 2.2, it requires custom ETL
spark.read
.option(”multiLine",true)
.json(path)
Availability: Apache Spark 2.2
spark.read
.option(”multiLine",true)
.json(path)
32
Transformation: Higher-order Function in SQL
Transformation on complex objects like arrays, maps and
structures inside of columns.
UDF ? Expensive data serialization
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
33
Transformation: Higher order function in SQL
1) Check for element existence
SELECT EXISTS(values, e -> e > 30) AS v
FROM tbl_nested;
2) Transform an array
SELECT TRANSFORM(values, e -> e * e) AS v
FROM tbl_nested;
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
Transformation on complex objects like arrays, maps and
structures inside of columns.
34
4) Aggregate an array
SELECT REDUCE(values, 0, (value, acc) -> value + acc) AS sum
FROM tbl_nested;
Ref Databricks Blog: http://dbricks.co/2rUKQ1A
More cool features available in DB Runtime 3.0: http://dbricks.co/2rhPM4c
Availability: Databricks Runtime 3.0
3) Filter an array
SELECT FILTER(values, e -> e > 30) AS v
FROM tbl_nested;
Transformation: Higher order function in SQL
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
35
Users can create Hive-serde tables using
DataframeWriter APIs
Availability: Apache Spark 2.2
New Format in DataframeWriter API
df.write.format("parquet")
.saveAsTable("tab")
df.write.format("hive")
.option("fileFormat", "avro")
.saveAsTable("tab")
CREATE Hive-serde tables CREATE data source tables
36
Availability: Apache Spark 2.2
Unified CREATE TABLE [AS SELECT]
CREATE TABLE t1(a INT, b INT)
USING ORC
CREATE TABLE t1(a INT, b INT)
USING hive
OPTIONS(fileFormat 'ORC')
CREATE Hive-serde tables CREATE data source tables
CREATE TABLE t1(a INT, b INT)
STORED AS ORC
37
CREATE [TEMPORARY] TABLE [IF NOT EXISTS]
[db_name.]table_name
USING table_provider
[OPTIONS table_property_list]
[PARTITIONED BY (col_name, col_name, ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[AS select_statement];
Availability: Apache Spark 2.2
Unified CREATE TABLE [AS SELECT]
Apache Spark preferred syntax
38
Apache Spark 2.3+
Massive focus on building ETL-friendly pipelines
39
[SPARK-15689] Data Source API v2
1. [SPARK-20960] An efficient column batch interface for data
exchanges between Spark and external systems.
o Cost for conversion to and from RDD[Row]
o Cost for serialization/deserialization
o Publish the columnar binary formats
2. Filter pushdown and column pruning
3. Additional pushdown: limit, sampling and so on.
Target: Apache Spark 2.3
40
Performance: Python UDFs
1. Python is the most popular language for ETL
2. Python UDFs are often used to express elaborate data
conversions/transformations
3. Any improvements to python UDF processing will ultimately
improve ETL.
4. Improve data exchange between Python and JVM
5. Block-level UDFs
o Block-level arguments and return types
Target: Apache Spark 2.3
41
Recap
1. What’s an ETL Pipeline?
2. Using Spark SQL for ETL
- Extract: Dealing with Dirty Data (Bad Records or Files)
- Extract: Multi-line JSON/CSV Support
- Transformation: High-order functions in SQL
- Load: Unified write paths and interfaces
3. New Features in Spark 2.3
- Performance (Data Source API v2, Python UDF)
42
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
4242
DATABRICKS RUNTIME 3.0
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today.
databricks.com
Questions?
Xiao Li (lixiao@databricks.com)

Más contenido relacionado

La actualidad más candente

Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mark Kromer
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 

La actualidad más candente (20)

Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 

Similar a Building Robust ETL Pipelines with Apache Spark

Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsMicrosoft Tech Community
 
Jdbc 4.0 New Features And Enhancements
Jdbc 4.0 New Features And EnhancementsJdbc 4.0 New Features And Enhancements
Jdbc 4.0 New Features And Enhancementsscacharya
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleMateusz Dymczyk
 

Similar a Building Robust ETL Pipelines with Apache Spark (20)

Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
ETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developersETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developers
 
User Group3009
User Group3009User Group3009
User Group3009
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Jdbc 4.0 New Features And Enhancements
Jdbc 4.0 New Features And EnhancementsJdbc 4.0 New Features And Enhancements
Jdbc 4.0 New Features And Enhancements
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 

Último (20)

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 

Building Robust ETL Pipelines with Apache Spark

  • 1. Building Robust ETL Pipelines with Apache Spark Xiao Li Spark Summit | SF | Jun 2017
  • 2. 2 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 22 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 3. 3 About Me • Apache Spark Committer • Software Engineer at Databricks • Ph.D. in University of Florida • Previously, IBM Master Inventor, QRep, GDPS A/A and STC • Spark SQL, Database Replication, Information Integration • Github: gatorsmile
  • 4. 4 Overview 1. What’s an ETL Pipeline? 2. Using Spark SQL for ETL - Extract: Dealing with Dirty Data (Bad Records or Files) - Extract: Multi-line JSON/CSV Support - Transformation: High-order functions in SQL - Load: Unified write paths and interfaces 3. New Features in Spark 2.3 - Performance (Data Source API v2, Python UDF)
  • 5. 5 What is a Data Pipeline? 1. Sequence of transformations on data 2. Source data is typically semi-structured/unstructured (JSON, CSV etc.) and structured (JDBC, Parquet, ORC, the other Hive-serde tables) 3. Output data is integrated, structured and curated. – Ready for further data processing, analysis and reporting
  • 6. 6 Example of a Data Pipeline Aggregate Reporting Applications ML Model Ad-hoc Queries Database Cloud Warehouse Kafka, Log Kafka, Log
  • 7. 7 ETL is the First Step in a Data Pipeline 1. ETL stands for EXTRACT, TRANSFORM and LOAD 2. Goal is to clean or curate the data - Retrieve data from sources (EXTRACT) - Transform data into a consumable format (TRANSFORM) - Transmit data to downstream consumers (LOAD)
  • 8. 8 An ETL Query in Apache Spark spark.read.json("/source/path") .filter(...) .agg(...) .write.mode("append") .parquet("/output/path") EXTRACT TRANSFORM LOAD
  • 9. 9 An ETL Query in Apache Spark Extract EXTRACT TRANSFORM LOAD val csvTable = spark.read.csv("/source/path") val jdbcTable = spark.read.format("jdbc") .option("url", "jdbc:postgresql:...") .option("dbtable", "TEST.PEOPLE") .load() csvTable .join(jdbcTable, Seq("name"), "outer") .filter("id <= 2999") .write .mode("overwrite") .format("parquet") .saveAsTable("outputTableName")
  • 10. 10 What’s so hard about ETL Queries?
  • 11. 11 Why is ETL Hard? 1. Too complex 2. Error-prone 3. Too slow 4. Too expensive 1. Various sources/formats 2. Schema mismatch 3. Different representation 4. Corrupted files and data 5. Scalability 6. Schema evolution 7. Continuous ETL
  • 12. 12 This is why ETL is important Consumers of this data don’t want to deal with this messiness and complexity
  • 14. 14 Structured Streaming Spark SQL's flexible APIs, support for a wide variety of datasources, build-in support for structured streaming, state of art catalyst optimizer and tungsten execution engine make it a great framework for building end-to-end ETL pipelines.
  • 15. 15 Data Source Supports 1. Built-in connectors in Spark: – JSON, CSV, Text, Hive, Parquet, ORC, JDBC 2. Third-party data source connectors: – https://spark-packages.org 3. Define your own data source connectors by Data Source APIs – Ref link: https://youtu.be/uxuLRiNoDio
  • 16. 16 {"a":1, "b":2, "c":3} {"e":2, "c":3, "b":5} {"a":5, "d":7} spark.read .json("/source/path”) .printSchema() Schema Inference – semi-structured files
  • 17. 17 {"a":1, "b":2, "c":3.1} {"e":2, "c":3, "b":5} {"a":"5", "d":7} spark.read .json("/source/path”) .printSchema() Schema Inference – semi-structured files
  • 18. 18 {"a":1, "b":2, "c":3} {"e":2, "c":3, "b":5} {"a":5, "d":7} val schema = new StructType() .add("a", "int") .add("b", "int") spark.read .json("/source/path") .schema(schema) .show() User-specified Schema
  • 19. 19 {"a":1, "b":2, "c":3} {"e":2, "c":3, "b":5} {"a":5, "d":7} Availability: Apache Spark 2.2 spark.read .json("/source/path") .schema("a INT, b INT") .show() User-specified DDL-format Schema
  • 20. 20 Corrupt Files java.io.IOException. For example, java.io.EOFException: Unexpected end of input stream at org.apache.hadoop.io.compress.DecompressorStream.decompress java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too small) spark.sql.files.ignoreCorruptFiles = true [SPARK-17850] If true, the Spark jobs will continue to run even when it encounters corrupt files. The contents that have been read will still be returned. Dealing with Bad Data: Skip Corrupt Files
  • 21. 21 Missing or Corrupt Records [SPARK-12833][SPARK- 13764] TextFile formats (JSON and CSV) support 3 different ParseModes while reading data: 1. PERMISSIVE 2. DROPMALFORMED 3. FAILFAST Dealing with Bad Data: Skip Corrupt Records
  • 22. 22 {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", "PERMISSIVE") .option("columnNameOfCorruptRecord", "_corrupt_record") .json(corruptRecords) .show() The default can be configured via spark.sql.columnNameOfCorruptRecord Json: Dealing with Corrupt Records
  • 23. 23 {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", "DROPMALFORMED") .json(corruptRecords) .show() Json: Dealing with Corrupt Records
  • 24. 24 {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", "FAILFAST") .json(corruptRecords) .show() org.apache.spark.sql.catalyst.json .SparkSQLJsonProcessingException: Malformed line in FAILFAST mode: {"a":{, b:3} Json: Dealing with Corrupt Records
  • 25. 25 spark.read .option("mode", "FAILFAST") .csv(corruptRecords) .show() java.lang.RuntimeException: Malformed line in FAILFAST mode: 2015,Chevy,Volt CSV: Dealing with Corrupt Records year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt
  • 26. 26 spark.read. .option("mode", "PERMISSIVE") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt
  • 27. 27 year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt spark.read .option("header", true) .option("mode", "PERMISSIVE") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records
  • 28. 28 val schema = "col1 INT, col2 STRING, col3 STRING, col4 STRING, " + "col5 STRING, __corrupted_column_name STRING" spark.read .option("header", true) .option("mode", "PERMISSIVE") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records
  • 29. 29 year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt spark.read .option("mode", ”DROPMALFORMED") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records
  • 30. 30 Functionality: Better Corruption Handling badRecordsPath: a user-specified path to store exception files for recording the information about bad records/files. - A unified interface for both corrupt records and files - Enabling multi-phase data cleaning - DROPMALFORMED + Exception files - No need an extra column for corrupt records - Recording the exception data, reasons and time. Availability: Databricks Runtime 3.0
  • 31. 31 Functionality: Better JSON and CSV Support [SPARK-18352] [SPARK-19610] Multi-line JSON and CSV Support - Spark SQL currently reads JSON/CSV one line at a time - Before 2.2, it requires custom ETL spark.read .option(”multiLine",true) .json(path) Availability: Apache Spark 2.2 spark.read .option(”multiLine",true) .json(path)
  • 32. 32 Transformation: Higher-order Function in SQL Transformation on complex objects like arrays, maps and structures inside of columns. UDF ? Expensive data serialization tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false)
  • 33. 33 Transformation: Higher order function in SQL 1) Check for element existence SELECT EXISTS(values, e -> e > 30) AS v FROM tbl_nested; 2) Transform an array SELECT TRANSFORM(values, e -> e * e) AS v FROM tbl_nested; tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false) Transformation on complex objects like arrays, maps and structures inside of columns.
  • 34. 34 4) Aggregate an array SELECT REDUCE(values, 0, (value, acc) -> value + acc) AS sum FROM tbl_nested; Ref Databricks Blog: http://dbricks.co/2rUKQ1A More cool features available in DB Runtime 3.0: http://dbricks.co/2rhPM4c Availability: Databricks Runtime 3.0 3) Filter an array SELECT FILTER(values, e -> e > 30) AS v FROM tbl_nested; Transformation: Higher order function in SQL tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false)
  • 35. 35 Users can create Hive-serde tables using DataframeWriter APIs Availability: Apache Spark 2.2 New Format in DataframeWriter API df.write.format("parquet") .saveAsTable("tab") df.write.format("hive") .option("fileFormat", "avro") .saveAsTable("tab") CREATE Hive-serde tables CREATE data source tables
  • 36. 36 Availability: Apache Spark 2.2 Unified CREATE TABLE [AS SELECT] CREATE TABLE t1(a INT, b INT) USING ORC CREATE TABLE t1(a INT, b INT) USING hive OPTIONS(fileFormat 'ORC') CREATE Hive-serde tables CREATE data source tables CREATE TABLE t1(a INT, b INT) STORED AS ORC
  • 37. 37 CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name USING table_provider [OPTIONS table_property_list] [PARTITIONED BY (col_name, col_name, ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [AS select_statement]; Availability: Apache Spark 2.2 Unified CREATE TABLE [AS SELECT] Apache Spark preferred syntax
  • 38. 38 Apache Spark 2.3+ Massive focus on building ETL-friendly pipelines
  • 39. 39 [SPARK-15689] Data Source API v2 1. [SPARK-20960] An efficient column batch interface for data exchanges between Spark and external systems. o Cost for conversion to and from RDD[Row] o Cost for serialization/deserialization o Publish the columnar binary formats 2. Filter pushdown and column pruning 3. Additional pushdown: limit, sampling and so on. Target: Apache Spark 2.3
  • 40. 40 Performance: Python UDFs 1. Python is the most popular language for ETL 2. Python UDFs are often used to express elaborate data conversions/transformations 3. Any improvements to python UDF processing will ultimately improve ETL. 4. Improve data exchange between Python and JVM 5. Block-level UDFs o Block-level arguments and return types Target: Apache Spark 2.3
  • 41. 41 Recap 1. What’s an ETL Pipeline? 2. Using Spark SQL for ETL - Extract: Dealing with Dirty Data (Bad Records or Files) - Extract: Multi-line JSON/CSV Support - Transformation: High-order functions in SQL - Load: Unified write paths and interfaces 3. New Features in Spark 2.3 - Performance (Data Source API v2, Python UDF)
  • 42. 42 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) 4242 DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today. databricks.com