SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
© 2020, Amazon Web Services, Inc. or its Affiliates.
Chris Fregly
Developer Advocate AI/ML
Amazon Web Services
Data Science on AWS
© 2020, Amazon Web Services, Inc. or its Affiliates.
Who Am I?
Based in San Francisco
Meetup Organizer,Advanced SageMaker and Kubeflow Meetup:
https://meetup.com/Advanced-Kubeflow/ (12,000+ Members)
O’Reilly Author, Data Science on AWS: https://datascienceonaws.com
Former Engineer at Netflix (Video Streaming) and Databricks (Spark Streaming)
© 2020, Amazon Web Services, Inc. or its Affiliates.
Agenda
Why Choose AWS for Data Science?
Amazon Managed Services for Data Science
DEMOs!
Analyze Amazon Customer Reviews Dataset
Ingest S3 Data with Amazon Athena and Redshift
Data Analysis with Pandas, Matplotlib, and Amazon SageMaker Notebooks
Data Quality Checks with Apache Spark and Amazon SageMaker Processing Jobs
© 2020, Amazon Web Services, Inc. or its Affiliates.
Why Choose AWS for Data
Science?
© 2020, Amazon Web Services, Inc. or its Affiliates.
Most secure
infrastructure and
certifications
Most scalable and
cost effective
options
Easiest to build
data science
solutions
Most
comprehensive and
open
Why Choose AWS for Data Science?
© 2020, Amazon Web Services, Inc. or its Affiliates.
Build secure data lakes in days (vs. months or years)
A single storage layer (S3) for all analytics and ML
Deep integration across all AWS analytics and machine learning services
including federated queries across different services
The fastest way to go from Zero to Business Insights,
covering all data for all users
Easiest to Build Data Science Solutions
© 2020, Amazon Web Services, Inc. or its Affiliates.
Compliance
AWS Artifact
Amazon Inspector
Amazon Cloud HSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWSWAF
Amazon Macie
VPC
Encryption
AWS Certification Manager
AWS Key Management Service
Encryption at rest
Encryption in transit
Bring your own keys,
HSM support
Identity
AWS IAM
AWS SSO
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
Customers need to have multiple levels of security, identity and access management, encryption, and
compliance to secure their data lake
Most Secure Infrastructure
© 2020, Amazon Web Services, Inc. or its Affiliates.
CSA
Cloud Security
Alliance Controls
ISO 9001
Global Quality
Standard
ISO 27001
Security Management
Controls
ISO 27017
Cloud Specific
Controls
ISO 27018
Personal Data
Protection
PCI DSS Level 1
Payment Card
Standards
SOC 1
Audit Controls
Report
SOC 2
Security, Availability, &
Confidentiality Report
SOC 3
General Controls
Report
Global United States
CJIS
Criminal Justice
Information Services
DoD SRG
DoD Data
Processing
FedRAMP
Government Data
Standards
FERPA
Educational
Privacy Act
FIPS
Government Security
Standards
FISMA
Federal Information
Security Management
GxP
Quality Guidelines
and Regulations
ISO FFIEC
Financial Institutions
Regulation
HIPPA
Protected Health
Information
ITAR
International Arms
Regulations
MPAA
Protected Media
Content
NIST
National Institute of
Standards and Technology
SEC Rule 17a-4(f)
Financial Data
Standards
VPAT/Section 508
Accountability
Standards
Asia Pacific
FISC [Japan]
Financial Industry
Information Systems
IRAP [Australia]
Australian Security
Standards
K-ISMS [Korea]
Korean Information
Security
MTCS Tier 3 [Singapore]
Multi-Tier Cloud
Security Standard
My Number Act [Japan]
Personal Information
Protection
Europe
C5 [Germany]
Operational Security
Attestation
Cyber Essentials
Plus [UK]
Cyber Threat
Protection
G-Cloud [UK]
UK Government
Standards
IT-Grundschutz
[Germany]
Baseline Protection
Methodology
X P
G
Most Certifications
© 2020, Amazon Web Services, Inc. or its Affiliates.
Migration & Streaming Services
Infrastructure Data Catalog
& ETL
Security &
Management
Data
Warehousing
Big Data
Processing
Interactive
Query
Operational
Analytics
Real time
Analytics
Serverless
Data processing
Data movement
Analytics
Data lake infrastructure & management
Dashboards Predictive Analytics
Data, visualization, engagement, & machine learning
Digital User EngagementData
Most Comprehensive and Open
© 2020, Amazon Web Services, Inc. or its Affiliates.
Five highly
available storage
tiers including
intelligent tiering
Industry leading choice
of 200+ instance types
to meet workload
needs
On-demand,
Reserved, and
Spot instances to
reduce costs
100 Gbps
bandwidth
network interfaces
for performance
Most Scalable, Cost-Effective, Performant Infrastructure
© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon Managed Services
for Data Science
© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon SageMaker Notebooks –Web-Based Environment
• Compatible with Jupyter and JupyterLab Notebooks
Access your notebooks in
seconds
Administrators manage access
and permissions
Share notebooks
with a single click
Dial up or down
compute resources
Start your notebooks
without spinning up
compute resources
© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon S3 – Data Lake
• Massively Scalable Object Storage
• 99.99999999999% Durability (11 9’s)
• Global Replication
• Cost-Effective Storage Options
• Many Partner Integrations
© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon Athena – Data Queries
• Serverless, Interactive Query Service
• Dynamically Scalable to Large Workloads
Pay per query
Pay only for queries run
Save 30–90% on per-query costs
through compression
Use S3 storage
ANSI SQL
JDBC/ODBC drivers
Multiple formats, compression
types, and complex joins and
data types
SQL
Serverless: zero infrastructure,
zero administration
Integrated with QuickSight
EasyQuery instantly
Zero setup cost
Point to S3 and start querying
© 2020, Amazon Web Services, Inc. or its Affiliates.
Best performance,
most scalable
3x faster with RA3*
10x faster with AQUA*
Adds unlimited compute capacity on-
demand to meet unlimited concurrent
access
Lowest cost
Cost-optimized workloads
by paying compute and
storage separately
1/10th cost ofTraditional
DW at $1000/TB/year
Up to 75% less than other cloud
data warehouses & predictable
costs
Data lake &
AWS integration
Analyze exabytes of data across data
warehouse, data lakes, and
operational database
Query data across various analytics
services
Most secure
& compliant
AWS-grade security (eg.VPC,
encryption with KMS, CloudTrail)
All major certifications such
as SOC, PCI, DSS, ISO,
FedRAMP, HIPPA
• Most Popular Cloud Data Warehouse
*vs other cloud DWs
Amazon Redshift – DataWarehouse
© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon SageMaker Processing Jobs – Data Processing
• Large-Scale Data Processing
• Supports Apache Spark
Use SageMaker’s built-in
containers or bring your own
Bring your own script for
feature engineering
Custom processing
Achieve distributed
processing for clusters
Your resources are created,
configured, & terminated
automatically
Leverage SageMaker’s
security & compliance
features
© 2020, Amazon Web Services, Inc. or its Affiliates.
AWS Open Source Libraries
AWS DataWrangler
Simplify data querying and processing in Python and Pandas
• https://github.com/awslabs/aws-data-wrangler
• https://aws-data-wrangler.readthedocs.io/en/latest/
AWS Deequ
Data Quality Checks for Your Pipelines
• https://github.com/awslabs/aws-data-wrangler
• https://aws-data-wrangler.readthedocs.io/en/latest/
© 2020, Amazon Web Services, Inc. or its Affiliates.
DEMOs!
https://github.com/data-science-on-aws/workshop
Amazon Customer Reviews Dataset
(150+ Million Reviews)
https://s3.amazonaws.com/amazon-reviews-pds/readme.html
© 2020, Amazon Web Services, Inc. or its Affiliates.
Thank you!
Chris Fregly @cfregly
https://data-science-on-aws/workshop
https://linkedin.com/in/cfregly/

Más contenido relacionado

Más de Chris Fregly

KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...Chris Fregly
 
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...Chris Fregly
 
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...Chris Fregly
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Chris Fregly
 
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...Chris Fregly
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Chris Fregly
 
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...Chris Fregly
 
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...Chris Fregly
 
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017Chris Fregly
 

Más de Chris Fregly (20)

KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
 
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
 
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
 
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
 
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
 
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
 
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
 

Último

New ThousandEyes Product Features and Release Highlights: March 2024
New ThousandEyes Product Features and Release Highlights: March 2024New ThousandEyes Product Features and Release Highlights: March 2024
New ThousandEyes Product Features and Release Highlights: March 2024ThousandEyes
 
Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Incrobinwilliams8624
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptkinjal48
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorShane Coughlan
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024Mind IT Systems
 
React 19: Revolutionizing Web Development
React 19: Revolutionizing Web DevelopmentReact 19: Revolutionizing Web Development
React 19: Revolutionizing Web DevelopmentBOSC Tech Labs
 
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageSales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageDista
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilVICTOR MAESTRE RAMIREZ
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadIvo Andreev
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...OnePlan Solutions
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampVICTOR MAESTRE RAMIREZ
 
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)Jonathan Katz
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxJoão Esperancinha
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?AmeliaSmith90
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Jaydeep Chhasatia
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.Sharon Liu
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesShyamsundar Das
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionsNirav Modi
 

Último (20)

New ThousandEyes Product Features and Release Highlights: March 2024
New ThousandEyes Product Features and Release Highlights: March 2024New ThousandEyes Product Features and Release Highlights: March 2024
New ThousandEyes Product Features and Release Highlights: March 2024
 
Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Inc
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.ppt
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS Calculator
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024
 
React 19: Revolutionizing Web Development
React 19: Revolutionizing Web DevelopmentReact 19: Revolutionizing Web Development
React 19: Revolutionizing Web Development
 
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageSales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-Council
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - Datacamp
 
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptx
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in Trivandrum
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security Challenges
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspections
 

Data Science on AWS - Collision Conference - June 2020

  • 1. © 2020, Amazon Web Services, Inc. or its Affiliates. Chris Fregly Developer Advocate AI/ML Amazon Web Services Data Science on AWS
  • 2. © 2020, Amazon Web Services, Inc. or its Affiliates. Who Am I? Based in San Francisco Meetup Organizer,Advanced SageMaker and Kubeflow Meetup: https://meetup.com/Advanced-Kubeflow/ (12,000+ Members) O’Reilly Author, Data Science on AWS: https://datascienceonaws.com Former Engineer at Netflix (Video Streaming) and Databricks (Spark Streaming)
  • 3. © 2020, Amazon Web Services, Inc. or its Affiliates. Agenda Why Choose AWS for Data Science? Amazon Managed Services for Data Science DEMOs! Analyze Amazon Customer Reviews Dataset Ingest S3 Data with Amazon Athena and Redshift Data Analysis with Pandas, Matplotlib, and Amazon SageMaker Notebooks Data Quality Checks with Apache Spark and Amazon SageMaker Processing Jobs
  • 4. © 2020, Amazon Web Services, Inc. or its Affiliates. Why Choose AWS for Data Science?
  • 5. © 2020, Amazon Web Services, Inc. or its Affiliates. Most secure infrastructure and certifications Most scalable and cost effective options Easiest to build data science solutions Most comprehensive and open Why Choose AWS for Data Science?
  • 6. © 2020, Amazon Web Services, Inc. or its Affiliates. Build secure data lakes in days (vs. months or years) A single storage layer (S3) for all analytics and ML Deep integration across all AWS analytics and machine learning services including federated queries across different services The fastest way to go from Zero to Business Insights, covering all data for all users Easiest to Build Data Science Solutions
  • 7. © 2020, Amazon Web Services, Inc. or its Affiliates. Compliance AWS Artifact Amazon Inspector Amazon Cloud HSM Amazon Cognito AWS CloudTrail Security Amazon GuardDuty AWS Shield AWSWAF Amazon Macie VPC Encryption AWS Certification Manager AWS Key Management Service Encryption at rest Encryption in transit Bring your own keys, HSM support Identity AWS IAM AWS SSO Amazon Cloud Directory AWS Directory Service AWS Organizations Customers need to have multiple levels of security, identity and access management, encryption, and compliance to secure their data lake Most Secure Infrastructure
  • 8. © 2020, Amazon Web Services, Inc. or its Affiliates. CSA Cloud Security Alliance Controls ISO 9001 Global Quality Standard ISO 27001 Security Management Controls ISO 27017 Cloud Specific Controls ISO 27018 Personal Data Protection PCI DSS Level 1 Payment Card Standards SOC 1 Audit Controls Report SOC 2 Security, Availability, & Confidentiality Report SOC 3 General Controls Report Global United States CJIS Criminal Justice Information Services DoD SRG DoD Data Processing FedRAMP Government Data Standards FERPA Educational Privacy Act FIPS Government Security Standards FISMA Federal Information Security Management GxP Quality Guidelines and Regulations ISO FFIEC Financial Institutions Regulation HIPPA Protected Health Information ITAR International Arms Regulations MPAA Protected Media Content NIST National Institute of Standards and Technology SEC Rule 17a-4(f) Financial Data Standards VPAT/Section 508 Accountability Standards Asia Pacific FISC [Japan] Financial Industry Information Systems IRAP [Australia] Australian Security Standards K-ISMS [Korea] Korean Information Security MTCS Tier 3 [Singapore] Multi-Tier Cloud Security Standard My Number Act [Japan] Personal Information Protection Europe C5 [Germany] Operational Security Attestation Cyber Essentials Plus [UK] Cyber Threat Protection G-Cloud [UK] UK Government Standards IT-Grundschutz [Germany] Baseline Protection Methodology X P G Most Certifications
  • 9. © 2020, Amazon Web Services, Inc. or its Affiliates. Migration & Streaming Services Infrastructure Data Catalog & ETL Security & Management Data Warehousing Big Data Processing Interactive Query Operational Analytics Real time Analytics Serverless Data processing Data movement Analytics Data lake infrastructure & management Dashboards Predictive Analytics Data, visualization, engagement, & machine learning Digital User EngagementData Most Comprehensive and Open
  • 10. © 2020, Amazon Web Services, Inc. or its Affiliates. Five highly available storage tiers including intelligent tiering Industry leading choice of 200+ instance types to meet workload needs On-demand, Reserved, and Spot instances to reduce costs 100 Gbps bandwidth network interfaces for performance Most Scalable, Cost-Effective, Performant Infrastructure
  • 11. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon Managed Services for Data Science
  • 12. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon SageMaker Notebooks –Web-Based Environment • Compatible with Jupyter and JupyterLab Notebooks Access your notebooks in seconds Administrators manage access and permissions Share notebooks with a single click Dial up or down compute resources Start your notebooks without spinning up compute resources
  • 13. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon S3 – Data Lake • Massively Scalable Object Storage • 99.99999999999% Durability (11 9’s) • Global Replication • Cost-Effective Storage Options • Many Partner Integrations
  • 14. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon Athena – Data Queries • Serverless, Interactive Query Service • Dynamically Scalable to Large Workloads Pay per query Pay only for queries run Save 30–90% on per-query costs through compression Use S3 storage ANSI SQL JDBC/ODBC drivers Multiple formats, compression types, and complex joins and data types SQL Serverless: zero infrastructure, zero administration Integrated with QuickSight EasyQuery instantly Zero setup cost Point to S3 and start querying
  • 15. © 2020, Amazon Web Services, Inc. or its Affiliates. Best performance, most scalable 3x faster with RA3* 10x faster with AQUA* Adds unlimited compute capacity on- demand to meet unlimited concurrent access Lowest cost Cost-optimized workloads by paying compute and storage separately 1/10th cost ofTraditional DW at $1000/TB/year Up to 75% less than other cloud data warehouses & predictable costs Data lake & AWS integration Analyze exabytes of data across data warehouse, data lakes, and operational database Query data across various analytics services Most secure & compliant AWS-grade security (eg.VPC, encryption with KMS, CloudTrail) All major certifications such as SOC, PCI, DSS, ISO, FedRAMP, HIPPA • Most Popular Cloud Data Warehouse *vs other cloud DWs Amazon Redshift – DataWarehouse
  • 16. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon SageMaker Processing Jobs – Data Processing • Large-Scale Data Processing • Supports Apache Spark Use SageMaker’s built-in containers or bring your own Bring your own script for feature engineering Custom processing Achieve distributed processing for clusters Your resources are created, configured, & terminated automatically Leverage SageMaker’s security & compliance features
  • 17. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS Open Source Libraries AWS DataWrangler Simplify data querying and processing in Python and Pandas • https://github.com/awslabs/aws-data-wrangler • https://aws-data-wrangler.readthedocs.io/en/latest/ AWS Deequ Data Quality Checks for Your Pipelines • https://github.com/awslabs/aws-data-wrangler • https://aws-data-wrangler.readthedocs.io/en/latest/
  • 18. © 2020, Amazon Web Services, Inc. or its Affiliates. DEMOs! https://github.com/data-science-on-aws/workshop Amazon Customer Reviews Dataset (150+ Million Reviews) https://s3.amazonaws.com/amazon-reviews-pds/readme.html
  • 19. © 2020, Amazon Web Services, Inc. or its Affiliates. Thank you! Chris Fregly @cfregly https://data-science-on-aws/workshop https://linkedin.com/in/cfregly/