CI/CD with Azure DevOps and Azure Databricks

•

2 recomendaciones•864 vistas

GoDataDriven

Presentation given during Data Council meetup

Ingeniería

CI/CD with
Azure DevOps, Pre-Commit, and Azure Databricks
30-10-2019

A typical pipeline
Automate everything
• Deploy to production Efficiently & Reliably
• Allow everyone in the team to do so
• Smaller increments
• Roll-forward don’t Roll-back
2
Trigger
Version control
Test
Code
Build
Artifact
Deploy
Dev
Integration tests
Deploy
Prod
User facing
Measure
Capture performance

Overall project structure
• src, containing the library
• input, data used while testing
• notebook, containing the application
• tests, for tests
4

Testing
Our approach
• Use Pre-Commit
• Apply Black, Flake8
• Run PySpark tests in a Docker container
5
• Checkout code
• Install requirements
• Apply linters
• Run unit-tests
• Publish test/coverage

Pre-Commit
Eg, solving the “Fixing lint issues” commit
• Framework for creating Git Hooks
• Eg, scripts that run on each commit
• Compare it to a local-CI

Pre-Commit
.pre-commit-config.yaml
• In our case
• run black/Flake8 on each commit
• run pytest on each push
repos:
- repo: https://github.com/psf/black
rev: 19.3b0
hooks:
- id: black
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: flake8
- id: check-merge-conflict
- repo: https://github.com/godatadriven/pre-commit-docker-pyspark
rev: master
hooks:
- id: pyspark-docker
name: Run tests
entry: /entrypoint.sh python setup.py test
language: docker
pass_filenames: false
stages: [push]

8
python setup.py test
setup.py setup.cfg conftest.py test_etl.py
docker

$9 Testing PySpark conftest.py create pytest fixture called spark def test_load_df(spark): df = load_df(spark, "input/data.csv") assert df.count() == 891 assert df.filter(df.Name == "Sandstrom, Miss. Marguerite Rut").count() == 1 def test_fill_na(spark): input_df = spark.createDataFrame( [(None, None, None)], "Age: double, Cabin: string, Fare: double" ) output_df = fill_na(input_df) output = df_to_list_dict(output_df) expected_output = [{"Age": -0.5, "Cabin": "N", "Fare": -0.5}] assert output == expected_output$

Test output
Integrates with Azure Devops
• Which test frequently fail
• Full stack traces of a failed test
• Code coverage
10

Building
Our approach
• Python wheel of library
• Modify notebook/version.py
• Create a build artifact of notebook
11
• Checkout code
• Build wheel
• Authenticate with Azure Devops Artifacts
• Push wheel
• Publish notebook folder as Build Artifact

Deployment
Our approach
• Copy version.py to the DEV workspace
• After a manual step
• Copy notebook/* to the Prod workspace
12
• Authenticate with Databricks cli
• Copy notebook/version.py to the DEV workspace
• Authenticate with Databricks cli
• Copy notebook/* to the PROD workspace

13
version.py
• A successful change to master results in a new version of the library
• Deploy that version to DEV
• and maybe at a later time to PROD
Azure DevOps
Pipeline
Azure DevOps
Artifacts
Dev
Notebook
Prod
Notebook
Azure Databricks
Version: 1.0.100
Version: 1.0.200

14
• On Dev only version.py is deployed by our CI/CD
• On Prod the whole notebook folder
• e.g. our application
• Using dbutils and version.py
• We can install a specific version of our library
dbutils.library

15
The complete pipeline
Run black, flake8 and
pytest using pre-commit
Upload wheel to DevOps
artifacts, export Notebook
folder with modified
version.py
Copy version.py to the
DEV workspace
Copy the Notebook folder
to the PROD workspace

Your Data Career
16
Check out the job opportunities
WE ARE HIRING
GoDataDriven.com/Careers

Más contenido relacionado

La actualidad más candente

Azure DevOpsSurasuk Oakkharaamonphong

Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent

Neo4j GraphDay Seattle- Sept19- neo4j basic trainingNeo4j

Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys

Introduction to Azure DatabricksJames Serra

Using Azure DevOps to continuously build, test, and deploy containerized appl...Adrian Todorov

Using Databricks as an Analysis PlatformDatabricks

Neo4j Training Series - Spring Data Neo4jNeo4j

Azure DevOpsMichael Jesse

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

Azure DevOps in ActionCallon Campbell

Azure DevOps CI/CD For BeginnersRahul Nath

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

Data Migration to AzureSanjay B. Bhakta

Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks

Azure DevOpsFelipe Artur Feltes

Introducing Databricks DeltaDatabricks

Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j

ONNX and MLflowamesar0

Airbyte @ Airflow Summit - The new modern data stackMichel Tricot

La actualidad más candente (20)

Azure DevOps

Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...

Neo4j GraphDay Seattle- Sept19- neo4j basic training

Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...

Introduction to Azure Databricks

Using Azure DevOps to continuously build, test, and deploy containerized appl...

Using Databricks as an Analysis Platform

Neo4j Training Series - Spring Data Neo4j

Azure DevOps

Data Lakehouse, Data Mesh, and Data Fabric (r2)

Azure DevOps in Action

Azure DevOps CI/CD For Beginners

Architect’s Open-Source Guide for a Data Mesh Architecture

Data Migration to Azure

Performance Analysis of Apache Spark and Presto in Cloud Environments

Azure DevOps

Introducing Databricks Delta

Neo4j: What's Under the Hood & How Knowing This Can Help You

ONNX and MLflow

Airbyte @ Airflow Summit - The new modern data stack

Similar a CI/CD with Azure DevOps and Azure Databricks

Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWERIndrajit Poddar

Continuous Deployment of your Application @jSession#5Marcin Grzejszczak

Modern Web-site Development PipelineGlobalLogic Ukraine

habitat at docker budMandi Walls

Continuous Deployment to the cloudVMware Tanzu

CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...E. Camden Fisher

Automated Deployment and Configuration Engines. AnsibleAlberto Molina Coballes

Continuous Deployment of your Application @SpringOneciberkleid

Django and DockerDocker, Inc.

Docker at Djangocon 2013 | Talk by Ken CochranedotCloud

Continuous Deployment To The Cloud @DevoxxPL 2017 Marcin Grzejszczak

Containers and Microservices for RealistsOracle Developers

Containers and microservices for realistsKarthik Gaekwad

CI/CD on AWSBhargav Amin

Fluo CICD OpenStack SummitMiguel Zuniga

Detailed Introduction To Dockernklmish

Docker based-Pipelines with CodefreshCodefresh

Using Grunt with Drupalarithmetric

DockerCon 15 Keynote - Day 2Docker, Inc.

DCEU 18: Building Your Development PipelineDocker, Inc.

Similar a CI/CD with Azure DevOps and Azure Databricks (20)

Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER

Continuous Deployment of your Application @jSession#5

Modern Web-site Development Pipeline

habitat at docker bud

Continuous Deployment to the cloud

CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...

Automated Deployment and Configuration Engines. Ansible

Continuous Deployment of your Application @SpringOne

Django and Docker

Docker at Djangocon 2013 | Talk by Ken Cochrane

Continuous Deployment To The Cloud @DevoxxPL 2017

Containers and Microservices for Realists

Containers and microservices for realists

CI/CD on AWS

Fluo CICD OpenStack Summit

Detailed Introduction To Docker

Docker based-Pipelines with Codefresh

Using Grunt with Drupal

DockerCon 15 Keynote - Day 2

DCEU 18: Building Your Development Pipeline

Más de GoDataDriven

Streamlining Data Science Workflows with a Feature CatalogGoDataDriven

Visualizing Big Data in a Small ScreenGoDataDriven

Building a Scalable and reliable open source ML Platform with MLFlowGoDataDriven

Training Taster: Leading the way to become a data-driven organizationGoDataDriven

My Path From Data Engineer to Analytics EngineerGoDataDriven

dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven

Workshop on Google Cloud Data PlatformGoDataDriven

How to create a Devcontainer for your Python projectGoDataDriven

Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...GoDataDriven

Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022GoDataDriven

MLOps CodeBreakfast on AWS - GoDataFest 2022GoDataDriven

MLOps CodeBreakfast on Azure - GoDataFest 2022GoDataDriven

Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022GoDataDriven

Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022GoDataDriven

AWS Well-Architected Webinar Security - Ben de HaanGoDataDriven

The 7 Habits of Effective Data Driven CompaniesGoDataDriven

DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...GoDataDriven

Artificial intelligence in actions: delivering a new experience to Formula 1 ...GoDataDriven

Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofGoDataDriven

Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019GoDataDriven

Más de GoDataDriven (20)

Streamlining Data Science Workflows with a Feature Catalog

Visualizing Big Data in a Small Screen

Building a Scalable and reliable open source ML Platform with MLFlow

Training Taster: Leading the way to become a data-driven organization

My Path From Data Engineer to Analytics Engineer

dbt Python models - GoDataFest by Guillermo Sanchez

Workshop on Google Cloud Data Platform

How to create a Devcontainer for your Python project

Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...

Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022

MLOps CodeBreakfast on AWS - GoDataFest 2022

MLOps CodeBreakfast on Azure - GoDataFest 2022

Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022

Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022

AWS Well-Architected Webinar Security - Ben de Haan

The 7 Habits of Effective Data Driven Companies

DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...

Artificial intelligence in actions: delivering a new experience to Formula 1 ...

Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof

Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019

Último

Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar

Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1

THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian

FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar

US Department of Education FAFSA Week of ActionMebane Rash

Earthing details of Electrical Substationstephanwindworld

Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl

Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A

Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork

Designing pile caps according to ACI 318-19.pptxErbil Polytechnic University

Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1

Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201

High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531

Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University

2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon

Input Output Management in Operating SystemRashmi Bhat

Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar

TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar

Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14

CI/CD with Azure DevOps and Azure Databricks

1. CI/CD with Azure DevOps, Pre-Commit, and Azure Databricks 30-10-2019

2. A typical pipeline Automate everything • Deploy to production Efficiently & Reliably • Allow everyone in the team to do so • Smaller increments • Roll-forward don’t Roll-back 2 Trigger Version control Test Code Build Artifact Deploy Dev Integration tests Deploy Prod User facing Measure Capture performance

3. 3 Today’s pipeline

4. Overall project structure • src, containing the library • input, data used while testing • notebook, containing the application • tests, for tests 4

5. Testing Our approach • Use Pre-Commit • Apply Black, Flake8 • Run PySpark tests in a Docker container 5 • Checkout code • Install requirements • Apply linters • Run unit-tests • Publish test/coverage

6. Pre-Commit Eg, solving the “Fixing lint issues” commit • Framework for creating Git Hooks • Eg, scripts that run on each commit • Compare it to a local-CI

7. Pre-Commit .pre-commit-config.yaml • In our case • run black/Flake8 on each commit • run pytest on each push repos: - repo: https://github.com/psf/black rev: 19.3b0 hooks: - id: black - repo: https://github.com/pre-commit/pre-commit-hooks rev: v2.3.0 hooks: - id: flake8 - id: check-merge-conflict - repo: https://github.com/godatadriven/pre-commit-docker-pyspark rev: master hooks: - id: pyspark-docker name: Run tests entry: /entrypoint.sh python setup.py test language: docker pass_filenames: false stages: [push]

8. 8 python setup.py test setup.py setup.cfg conftest.py test_etl.py docker

9. 9 Testing PySpark conftest.py create pytest fixture called spark def test_load_df(spark): df = load_df(spark, "input/data.csv") assert df.count() == 891 assert df.filter(df.Name == "Sandstrom, Miss. Marguerite Rut").count() == 1 def test_fill_na(spark): input_df = spark.createDataFrame( [(None, None, None)], "Age: double, Cabin: string, Fare: double" ) output_df = fill_na(input_df) output = df_to_list_dict(output_df) expected_output = [{"Age": -0.5, "Cabin": "N", "Fare": -0.5}] assert output == expected_output

10. Test output Integrates with Azure Devops • Which test frequently fail • Full stack traces of a failed test • Code coverage 10

11. Building Our approach • Python wheel of library • Modify notebook/version.py • Create a build artifact of notebook 11 • Checkout code • Build wheel • Authenticate with Azure Devops Artifacts • Push wheel • Publish notebook folder as Build Artifact

12. Deployment Our approach • Copy version.py to the DEV workspace • After a manual step • Copy notebook/* to the Prod workspace 12 • Authenticate with Databricks cli • Copy notebook/version.py to the DEV workspace • Authenticate with Databricks cli • Copy notebook/* to the PROD workspace

13. 13 version.py • A successful change to master results in a new version of the library • Deploy that version to DEV • and maybe at a later time to PROD Azure DevOps Pipeline Azure DevOps Artifacts Dev Notebook Prod Notebook Azure Databricks Version: 1.0.100 Version: 1.0.200

14. 14 • On Dev only version.py is deployed by our CI/CD • On Prod the whole notebook folder • e.g. our application • Using dbutils and version.py • We can install a specific version of our library dbutils.library

15. 15 The complete pipeline Run black, flake8 and pytest using pre-commit Upload wheel to DevOps artifacts, export Notebook folder with modified version.py Copy version.py to the DEV workspace Copy the Notebook folder to the PROD workspace

16. Your Data Career 16 Check out the job opportunities WE ARE HIRING GoDataDriven.com/Careers

CI/CD with Azure DevOps and Azure Databricks

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a CI/CD with Azure DevOps and Azure Databricks

Similar a CI/CD with Azure DevOps and Azure Databricks (20)

Más de GoDataDriven

Más de GoDataDriven (20)

Último

Último (20)

CI/CD with Azure DevOps and Azure Databricks