SlideShare a Scribd company logo
1 of 42
Open for Business…
WHO AM I
• Big Data / Analytics / BI & Cloud Solutions Specialist

• http://www.linkedin.com/in/JulioPhilippe
• Skills

Architecture
Business Intelligence
IT Transformation

Cloud Computing
IT Solutions

Management
Mentoring

Big Data

Analytics

Business Development

Hadoop
Datacenter
Optimization

Data Warehousing
2

Big Data with Not Only SQL
BIG DATA MANAGEMENT INSIGHT

« Data don’t spring relevant,
they become though ! »

3

Big Data with Not Only SQL
DATA-DRIVEN ON-LINE WEBSITES
• To run the apps : messages, posts, blog
entries, video clips, maps, web graph...

• To give the data context : friends
networks, social networks, collaborative
filtering...
• To keep the applications running : web
logs, system logs, system metrics, database
query logs...

4

Big Data with Not Only SQL
BIG DATA – NOT ONLY DATA VOLUME
• Improve analytics and statistics
models
• Extract business value by
analyzing large volumes of multistructured data from various
sources such as
databases, websites, blogs, social
media, smart sensors...
• Have efficient
architectures, massively
parallel, highly scalable and
available to handle very large
data volumes up to several
petabytes
5

Thematics
•
•
•
•
•
•

Web Technologies
Database Scale-out
Relational Data Analytics
Distributed Data Analytics
Distributed File Systems
Real Time Analytics

Big Data with Not Only SQL
BIG DATA APPLICATIONS DOMAINS
• Digital marketing optimization (e.g., web analytics, attribution, golden
path analysis)
• Data exploration and discovery (e.g., identifying new data-driven
products, new markets)
• Fraud detection and prevention (e.g., revenue protection, site integrity
& uptime)
• Social network and relationship analysis (e.g., influencer marketing,
outsourcing, attrition prediction)
• Machine-generated data analytics (e.g., remote device insight, remote
sensing, location-based intelligence)
• Data retention (e.g. long term conservation, data archiving

6

Big Data with Not Only SQL
SOME BIG DATA USE CASES BY INDUSTRY
Energy

Telecommunications

Retail



Smart meter analytics



Network performance



Dynamic price optimization



Distribution load forecasting & scheduling



New products & services creation



Localized assortment



Condition-based maintenance



Call Detail Records (CDRs) analysis



Supply-chain management



Customer relationship



Customer relationship management

management

Manufacturing

Banking

Insurance



Supply chain management



Fraud detection



Catastrophe modeling



Customer Care Call Centers



Trade surveillance



Claims fraud



Preventive Maintenance and Repairs



Compliance and regulatory



Reputation management



Customer relationship management



Customer relationship management



Customer relationship management

Public

Media

Healthcare



Fraud detection



Large-scale clickstream analytics



Clinical trials data analysis



Fighting criminality



Abuse and click-fraud prevention



Patient care quality and program analysis



Threats detection



Social graph analysis and profile segmentation



Supply chain management



Cyber security



Campaign management and loyalty programs



Drug discovery and development analysis

7

Big Data with Not Only SQL
TOP 10 BIG DATA SOURCES
1. Social network profiles
2. Social influencers
3. Activity-generated data
4. SaaS & Cloud Apps
5. Public web information
6. MapReduce results
7. Data warehouse appliances
8. Columnar/NoSQL databases
9. Network and in-stream monitoring technologies

10. Legacy documents

8

Big Data with Not Only SQL
NEW DATA AND MANAGEMENT ECONOMICS
Compute Trends

Storage Trends

New Analytics

New Data Structure

(Massively Parallel Processing, Algorithms…)

Distributed File Systems, NoSQL Database, NewSQL…)

Logical
Data Warehouse
Master/Slave

Enterprise
data warehouse

Objects storage

Multi-Structured
Data
Master/Master

General purpose
data warehouse
Proprietary and dedicated
data warehouse

Distributed File Systems

OLTP is the
data warehouse

Master Data Management, Data Quality, Data Integration

9

Big Data with Not Only SQL

Federated/
Sharded
MOVING COMPUTATION TO STORAGE
General Purpose Storage Servers
•

Combine server with disks & networking for reducing latency

•

Specialized software enables general purpose systems designs to provide high
performance data services

Moving Data processing to Storage
Legacy

Emerging

Next Gen.

Application

Application

Application

Data Processing

Data Processing

Metadata Mgmt

Network
Data Processing
Metadata Mgmt
Storage

Metadata Mgmt

Storage

Storage

Storage Array (SAN, NAS)

10

Big Data with Not Only SQL

Servers
BIG DATA ARCHITECTURE
BI & DWH Architecture - Conventional
• SQL based
• High availability
• Enterprise database
• Right design for structured data
• Current storage hardware (SAN, NAS, DAS)

Analytics Architecture – Next Generation
• Not only SQL based
• High scalability, availability and flexibility
• Compute and storage in the same box for
reducing the network latency
• Right design for semi-structured and
unstructured data

App
Servers
Edge
Nodes
Network
Switches
Network
Switches
Database
Servers

Storage Array
SAN
Switch

11

Data
Nodes

Big Data with Not Only SQL
DATA WAREHOUSE

• Data Warehouse appliances
– EMC Greenplum
– Microsoft Parallel Data
Warehouse
– IBM Netezza
– Oracle Exadata
– SAP HANA
– ParAccel Analytic Database
– Teradata
– HP Vertica

12

• SQL Database

• Massively Parallel Processing
• Hadoop Connectivity
• Column-Oriented database
• In-Memory database

Big Data with Not Only SQL
MAPREDUCE ALGORITHMS
MapReduce
• MapReduce is the programming
paradigm popularized by Google
researchers
• Open-source Hadoop
implementation of MapReduce by
Yahoo
• Open source software framework for
distributed computation
• Parallel computation (Map) on each
block (Split) of data in an DFS file and
output a stream of (Key, Value) pairs
to the local file system
• JobTracker schedules and manages
jobs
• TaskTracker executes individual map()
and reduce() tasks on each cluster
node

13

Algorithms
• Association Rule Learning
Algorithms
• Genetic Algorithms
• Neural Network Algorithms
• Statistical Algorithms (Pandas)
• Machine Learning Algorithms
(Mahout, Weka, Scikit Learn)
• Natural Language Processing
Algorithms
• Trading Algorithms
• Clinical design Algorithms
• Searching Algorithms (Lucene,
Solr, Katta, ElasicSearch,
OpenSearchServer…)

Big Data with Not Only SQL

Languages
• PHP
• Erlang
• Python
• Ruby
• R
• Java
DISTRIBUTED FILE SYSTEMS
• System that permanently store data
• Divided into logical units
(files, shards, chunks, blocks…)

• A file path joins file and directory names into
a relative or absolute address to identify a
file

Master

Slave

Slave

• Support access to file and remote servers
• Support concurrency

App

• Support distribution
• Support replication
• NFS, GPFS, Hadoop
HDFS, GlusterFS, MogileFS, MooseFS….

14

Big Data with Not Only SQL

Slave
NOSQL DATABASES CATEGORIES
Column
BigTable (Google), HBase,
Cassandra (DataStax),
Hypertable…

NoSQL = Not only SQL
•

Key-Value
Redis, Riak (Basho), CouchBase,
Voldemort (LinkedIn)
MemcacheDB…

Popular name for a subset of structured storage
software that is designed with the intention of delivering
increased optimization for high-performance operations
on large datasets

•

Basically, available, scalable, eventually consistent

•

Easy to use

•

Tolerant of scale by way of horizontal distribution

Graph
Neo4j (Neo Technology), Jena,
InfiniteGraph (Objectivity),
FlockDB (Twitter)…

15

Big Data with Not Only SQL

Document
MongoDB (10Gen),
CouchDB, Terrastore,
SimpleDB (AWS) …
NOSQL DATABASES CATEGORIES
Key-Value

Column

Document

Graph
















Store items as
alphanumeric identifier
(Key)
Associate values in a
simple standalone
tables
Values must be (string,
list, set)
Data search base on key
Fast and highly scalable
to retrieve a value






BigTable-style database
Column-oriented data
structure that
accommodates multiple
attributes per key
Petabyte scale
Domains: Distributed
data storage, Versioning
with timestamp,
Sorting, Parsing
Data exploration






Domains: managing
user profiles, retrieving
product name…

Documents (objects) map
nicely to programming
language data types
Value =
Collection>Document>Field
Embedded documents and
arrays reduce need for
joins
Dynamically-typed for
easy schema evolution
No joins and no multidocument transactions for
high performance and
easy scalability






Structured relational
graphs of
interconnected keyvalue pairings
Object-oriented
network of nodes
(Node), Nodes
Relationship (Edge),
Properties (nodes
attributes expressed as
key-value pairs)
Relation between data
Domains: social
networks,
recommendations,
investigations,
relationships…

Collection
Key

Value

User001

Peter

User002

Paul

User003

Key

Timestamp

Type

Size

Document

Name

Age

12

Zebra

Medium

Doc001

Paul

30

11

Lion

Big

Doc002

Jacques

35

E2

13

Bird

Small

NoSQL Data Modeling Techniques
Geo hashing, Index table, Composite keys aggregation, Materialized paths…
http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/

16

Big Data with Not Only SQL

Node

Name

Age

X

John

30

Y

E1

Rick

Node

Bob

50

Edge

a

b

X

Y

Y

X
NEW SQL
• Relational database with horizontal scalability
• MySQL Ecosystem

• Distributed database with MySQL compliance: Cubrid
• Analytic database: InfiniDB
• In-Memory database with MySQL compliance: VoltDB

17

Big Data with Not Only SQL
BIG DATA ARCHITETURE OVERVIEW
ADMINISTRATOR

ENGINEERS

ANALYSTS

BUSINESS USERS

Development

Data
Management

DATA SCIENTISTS

Data Modeling

BI / Analytics

Activity
Reporting

Data Quality
Master Data
Management

MOBILE CLIENTS

Mobile Apps

Data Analysis & Visualization

NoSQL

SQL

Unstructured and structured Data Warehouse,
MPP, No SQL Engine, Distributed File Systems
Share-Nothing Architecture, Algorithms

Structured Data Warehouse and OLAP Cubes,
MPP, In-Memory, Columns Database, SQL
Engine, Share-Nothing Architecture

Data
Transfer

Data Integration

Files

18

Web Data

RDBMS

Data sources

Big Data with Not Only SQL
HDFS & MAPREDUCE
•

Clients

Hadoop Distributed File System
-

Asynchronous replication

-

Write-once and read-many (WORM)

-

Hadoop cluster with 3 DataNodes minimum

-

Data divided into blocks, each block replicated 3 times
(default)

-

No RAID required for DataNode

-

Interfaces: Java, Thrift, C
Library, FUSE, WebDAV, HTTP, FTP

-

NameNode holds filesystem metadata

-

•

A scalable, Fault tolerant, High performance distributed
file system

Files are broken up and spread over the DataNodes

Hadoop MapReduce
-

Software framework for distributed computation

-

Input | Map() | Copy/Sort | Reduce() | Output

-

JobTracker schedules and manages jobs

-

19

Master Node

TaskTracker executes individual map() and reduce() tasks
on each cluster node

Big Data with Not Only SQL

Worker Nodes
HBASE
•
•
•
•
•
•
•

•
•
•
•
•
•

Clone of Big Table (Google)
Implemented in Java (Clients : Java, C++, Ruby...)
Data is stored “Column‐oriented”
Distributed over many servers
Tolerant of machine failure
Layered over HDFS
Strong consistency

It's not a relational database (No joins)
Sparse data – nulls are stored for free
Semi-structured or unstructured data
Data changes through time
Versioned data
Scalable – Goal of billions of rows x millions
of columns

Table
Row

Timestamp

Animal

Repair

Type
Enclosure1
Enclosure2
Key

Cost

12
Region

Size

Zebra

Medium

1000€

11

Lion

Big

13

Monkey

Small
Family

Column

1500€
Cell

(Table, Row_Key, Family, Column, Timestamp) = Cell (Value)

20

Big Data with Not Only SQL
HBASE
• Table
-

Regions for scalability, defined by
row [start-key, end-key)
Store for efficiency, 1 per Family
- 1..n StoreFiles
(HFile format on HDFS)

• Everything is byte
• Rows are ordered sequentially by
key
• Special tables -ROOT- , .META.
-

Tell clients where to find user
data

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

21

Big Data with Not Only SQL
HADOOP INFRASTRUCTURE

Network Switches

2 x Apps Server
•
2 CPU 6 core
•
96 GB RAM
•
6 x HDD 600GB 15K Raid1

22

2 x NameNode/BackupNode/Admin
•
2 CPU 6 core
•
96 GB RAM
•
6 x HDD 600GB 15K Raid1

Big Data with Not Only SQL

3 to n x DataNode
•
2 CPU 6 core
•
48 GB RAM
•
12 x HDD
MOGILEFS OVERVIEW
•
•

Asynchronous Replication

•

No Single Point of Failure

•

Automatic file replication (3 replications recommended)

•

Better than RAID

•

Flat NameSpace

•

Share-Nothing

•

No RAID required

•

Local filesystem agnostic

•

Tracker client transfer (mogilefsd) - Replication -- Deletion
- Query - Reaper - Monitor

Clients

A scalable, Fault tolerant, High performance distributed file
system

Tracker

Host1

Host4

Tracker

•

DBNode MySQL stores the MogileFS metadata (the
namespace, and which files are where)

•

Host2

Storage Node

Host5

Files are broken up and spread over the
Storage Node (mogstored) HTTP and WebDAV server

•

Storage Node

Client Library : Ruby, Perl, Java, Python, PHP…

DBNode

Host3

23

Big Data with Not Only SQL

Storage Node

Host6
MOGILEFS ARCHITECTURE
Database

Client Library
Tracker

Tracker

Storage Node

24

Storage Node

Big Data with Not Only SQL
MOGILEFS INFRASTRUCTURE

Network Switches

°°°
2 x Apps Server
•
2 CPU 6 core
•
48 GB RAM
•
6 x HDD 600GB 15K Raid1

25

2 x DB Node + 2 to n x Tracker
•
2 CPU 6 core
•
32 GB RAM
•
6 x HDD 600GB 15K Raid1

Big Data with Not Only SQL

3 to n x Storage Node
•
2 CPU 6 core
•
32 GB RAM
•
12 x HDD
GLUSTERFS OVERVIEW
•

A scalable, Fault tolerant, High performance distributed and replicated
file system

•

No Single Point of Failure

•

Synchronous replication of volumes across storage servers

•

Asynchronous replication across geographically distributed clusters

•

Easily accessible usage quotas

•

No Meta-Data Server (fully distributed architecture - Elastic Hash)

•

Distributed / Distributed Replicated / Distributed Striped

•

POSIX compliant

•

FUSE (Standard)

•

GlusterFS native, NFS, CIFS, HTTP, FTP, WebDAV, ZFS, EXT4…

•

No proprietary format to store files on disk

•

NameSpace : The unified global namespace aggregates disk and
memory resources into a single pool, virtualizing the underlying
hardware

GlusterFS
Server

Host1

GlusterFS
Server

•

Data Store : Data is stored in logical volumes that are abstracted from
the hardware and logically partitioned from each other

•

Development: API, Command Line Interface, Python, Ruby, PHP
languages

26

Clients

Big Data with Not Only SQL

Host2

GlusterFS
Server

Host3

GlusterFS
Server

Host4

GlusterFS
Server

Host5

GlusterFS
Server

Host6
GLUSTERFS ARCHITECTURE

27

Big Data with Not Only SQL
GLUSTERFS INFRASTRUCTURE

Network Switches

2 x Apps Server
•
2 CPU 6 core
•
48 GB RAM
•
6 x HDD 600GB 15K Raid1

28

2 x Backup Node / Admin
•
2 CPU 6 core
•
32 GB RAM
•
6 x HDD 600GB 15K Raid1

Big Data with Not Only SQL

3 to n x GlusterFS Server
•
2 CPU 6 core
•
32 GB RAM
•
12 x HDD
MOOSEFS OVERVIEW
•
•
•
•
•
•
•
•
•
•

•

•

29

A scalable, Fault tolerant, High performance distributed and
replicated file system
Spread data over several physical servers which are visible to the
user as one resource
No Single Point of Failure
Distribution of data across data servers via chunks
Maximum chunks size = 64MB
File duplication (1 to 3 and more if necessary)
POSIX compliant
FUSE Interface
No proprietary format to store files on disk
Master Server: a single machine managing the whole
filesystem, storing metadata for every file (information on
size, attributes and file location(s), including all information about
non-regular files, i.e. directories, sockets, pipes and devices.
Metadata is stored in memory
Metalogger Server: any number of servers, all of which store
metadata changelogs and periodically downloading main metadata
file; so as to promote these servers to the role of the Managing
server when primary master stops working
Data Server any number of commodity servers storing files data
and synchronizing it among themselves

Big Data with Not Only SQL

Clients

Master
Server

Host1

Data
Server

Host2

Data
Server

Host3

Metalogger
Server

Host4

Data
Server

Host5

Data
Server

Host6
MOOSEFS READ PROCESS

Read Process
1. Where is the data
2. The data is on x chunks
servers
3. Send me the data
4. The Data

http://www.moosefs.org/
30

Big Data with Not Only SQL
MOOSEFS WRITE PROCESS

Write Process
1. Where to write the data
2. Create new chunk on x
chunk server
3. Success
4. Write the data
5. Synchronize the data
6. Success
7. Success
8. Send write session end
signal

http://www.moosefs.org/
31

Big Data with Not Only SQL
MOOSEFS INFRASTRUCTURE

Network Switches

2 x Apps Server
•
2 CPU 6 core
•
48 GB RAM
•
6 x HDD 600GB 15K Raid1

32

2 x Master/ Metalogger/ Admin Server
•
2 CPU 6 core
•
96 GB RAM
•
6 x HDD 600GB 15K Raid1

Big Data with Not Only SQL

3 to n x Data Server
•
2 CPU 6 core
•
32 GB RAM
•
12 x HDD
CASSANDRA OVERVIEW
• Every node play the same role

Cassandra API

• Highly Available

Storage Layer

• Really fast reads, really fast writes
• Flexible schemas

Partitioner

Replicator

Failure Detector

Cluster Membership

Messaging Layer

• Distributed, Replicated
• No Master, no Slaves
• No Single Point of Failure
• Client can talk to any node
• Written in Java

33

Tools

Big Data with Not Only SQL
CASSANDRA – COLUMN-ORIENTED
Key

SuperColumn
Column

Column
•

Column
+Name
+Value
+Timestamp

•
•

•
•
•

34

Column

Column Family
• Think of it as a DB table
Column
• Key-Value Pair (not just a value, like a DB column)
• Timestamp
SuperColumn
• Columns inside a column
• The value are columns
• No timestamp
Keyspace – like a namespace, generally 1 per app
Indexes
Queries

Big Data with Not Only SQL
CASSANDRA INFRASTRUCTURE

Network Switches

Cassandra Nodes
•
•
•

35

2 CPU 6 core
32 GB RAM
12 x HDD Raid0

Big Data with Not Only SQL
MONGODB OVERVIEW
Clients

• Documents database oriented, High performance, scalability and
availability
• Support MapReduce
• Shard: hold a portion of the total data. Reads and writes are
automatically routed to the appropriate shard(s). Each shard is
backed by a replica set – which just holds the data for that shard
• Replica: set is one or more servers, each holding copies of the
same data. At any given time one is primary and the rest are
secondaries. If the primary goes down one of the secondaries
takes over automatically as primary. All writes and consistent
reads go to the primary, and all eventually consistent reads are
distributed amongst all the secondaries. Replica set is an
asynchronous cluster replication technology
• Config: multiple config servers, each one holds a copy of the
meta data indicating which data lives on which shard
• Router: one or more routers, each one acts as a server for one or
more clients. Clients issue queries/updates to a router and the
router routes them to the appropriate shard while consulting the
config servers

• Client: one or more clients, each one is (part of) the user's
application and issues commands to a router via the mongo
client library (driver) for its language

36

Big Data with Not Only SQL

mongos
Servers

Router

mongod
Servers

Config

mongod
Servers

Shard

mongos
Servers

Router

mongod
Servers

Config

mongod
Servers

Shard
MONGODB DEPLOYMENT
Shard

Secondary

Shard

Shard

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

Primary

Shard

mongod

mongod

mongod

Replica set

Config
mongod

Router
mongos

mongos

mongod
mongod
App

37

….

Big Data with Not Only SQL

….
MONGODB INFRASTRUCTURE

Network Switches

1 to n Router server
2 CPU 6 core
96 GB RAM
6 x HDD 600GB 15K Raid10

38

1 to n Config servers
2 CPU 6 core
96 GB RAM
6 x HDD 600GB 15K Raid10

Big Data with Not Only SQL

1 to n Shard servers
2 CPU 6 core
48 GB RAM
12 x HDD 1TB 7.2K
COUCHDB OVERVIEW

Clients

•
•
•
•
•
•
•
•
•
•
•
•

Open Source Distributed Database
RESTful API
Schema-less document store (document in JSON format)
Multi-Version-Concurrency-Control model
User-defined query structured as map/reduce
Incremental Index Update mechanism
Multi-Master Replication model
Written in Erlang
Support MapReduce
Easy to use data storage
Easy to integrate with web applications : JavaScript, JSON
Scalability for large web applications : Incremental
Replication, bi-directional conflict detection and
management
• Query-able and index-able
• Offline by default

39

Big Data with Not Only SQL

CouchDB
Servers

Master

CouchDB
Servers

Slave

CouchDB
Servers

Slave
•
•
•
•
•

CouchDB
Servers

Master

CouchDB
Servers

Slave

CouchDB
Servers

Slave

Master → Slave replication
Master ↔ Master replication
Filtered Replication
Incremental and bi-directional replication
Conflict management
COUCHDB FUNCTIONALITIES
• Document storage
– CouchDB server hosts named databases, which store documents

• ACID Properties
– CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent
state

• Compaction
– On schedule, or when the database file exceeds a certain amount of wasted space, the compaction process clones all the
active data to a new file and then discards the old file

• Views (Model, Function, Index)
– View model is the method of aggregating and reporting on the documents in a database, and are built on-demand to
aggregate, join and report on database documents

– View function takes a CouchDB document as an argument and then does whatever computation it needs to do to
determine the data that is to be made available through the view, if any. It can add multiple rows to the view based on a
single document, or it can add no rows at all
– View index is a dynamic representation of the actual document contents of a database, and CouchDB makes it easy to
create useful views of data. But generating a view of a database with hundreds of thousands or millions of documents is
time and resource consuming, it's not something the system should do from scratch each time

• Security
– To protect who can read and update documents, CouchDB has a simple reader access and update validation model that can
be extended to implement custom security models

• Distributed update and replication
– CouchDB is a peer-based distributed database system, it allows for users and servers to access and update the same shared
data while disconnected and then bi-directionally replicate those changes later

40

Big Data with Not Only SQL
COUCHDB INFRASTRUCTURE

Network Switches

1 to n Router server
2 CPU 6 core
96 GB RAM
6 x HDD 600GB 15K Raid10

41

1 to n Master servers
2 CPU 6 core
96 GB RAM
6 x HDD 600GB 15K Raid10

Big Data with Not Only SQL

1 to n Slaves servers
2 CPU 6 core
48 GB RAM
12 x HDD 1TB 7.2K
THANK YOU

More Related Content

What's hot

The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architectureJoseph D'Antoni
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Casesboorad
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresDATAVERSITY
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...StampedeCon
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016StampedeCon
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data SolutionJames Serra
 
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...DataWorks Summit/Hadoop Summit
 
Seeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing ForeverSeeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing ForeverInside Analysis
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteMark van Rijmenam
 
Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Jordan Chung
 
Extending Data Lake using the Lambda Architecture June 2015
Extending Data Lake using the Lambda Architecture June 2015Extending Data Lake using the Lambda Architecture June 2015
Extending Data Lake using the Lambda Architecture June 2015DataWorks Summit
 

What's hot (20)

Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data Solution
 
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
 
Seeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing ForeverSeeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing Forever
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
 
Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Big-Data Server Farm Architecture
Big-Data Server Farm Architecture
 
Extending Data Lake using the Lambda Architecture June 2015
Extending Data Lake using the Lambda Architecture June 2015Extending Data Lake using the Lambda Architecture June 2015
Extending Data Lake using the Lambda Architecture June 2015
 

Viewers also liked

Failure drives Disruptive Innovation
Failure drives Disruptive InnovationFailure drives Disruptive Innovation
Failure drives Disruptive InnovationLeslie Barry
 
Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study HARMAN Services
 
Startup Financing Introduction - Australia 2014 General Assembly
Startup Financing Introduction - Australia 2014 General AssemblyStartup Financing Introduction - Australia 2014 General Assembly
Startup Financing Introduction - Australia 2014 General AssemblyLeslie Barry
 
From 100 card walls to none and back again
From 100 card walls to none and back againFrom 100 card walls to none and back again
From 100 card walls to none and back againEd Cortis
 
How to Innovate for Profit - insideinnovation.co
How to Innovate for Profit - insideinnovation.coHow to Innovate for Profit - insideinnovation.co
How to Innovate for Profit - insideinnovation.coLeslie Barry
 
Finding Problem Solution Fit by Interviewing Customers - 5 Minutes for Lean S...
Finding Problem Solution Fit by Interviewing Customers - 5 Minutes for Lean S...Finding Problem Solution Fit by Interviewing Customers - 5 Minutes for Lean S...
Finding Problem Solution Fit by Interviewing Customers - 5 Minutes for Lean S...Leslie Barry
 
Technology impact and the Exponential Future
Technology impact and the Exponential FutureTechnology impact and the Exponential Future
Technology impact and the Exponential FutureLeslie Barry
 
Telco Big Data Workshop Sample
Telco Big Data Workshop SampleTelco Big Data Workshop Sample
Telco Big Data Workshop SampleAlan Quayle
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationJen Stirrup
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Evolution Of The Computers
Evolution Of The ComputersEvolution Of The Computers
Evolution Of The Computerspanitiaict
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 

Viewers also liked (20)

Failure drives Disruptive Innovation
Failure drives Disruptive InnovationFailure drives Disruptive Innovation
Failure drives Disruptive Innovation
 
About CDAP
About CDAPAbout CDAP
About CDAP
 
Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study
 
Startup Financing Introduction - Australia 2014 General Assembly
Startup Financing Introduction - Australia 2014 General AssemblyStartup Financing Introduction - Australia 2014 General Assembly
Startup Financing Introduction - Australia 2014 General Assembly
 
Level up
Level upLevel up
Level up
 
From 100 card walls to none and back again
From 100 card walls to none and back againFrom 100 card walls to none and back again
From 100 card walls to none and back again
 
How to Innovate for Profit - insideinnovation.co
How to Innovate for Profit - insideinnovation.coHow to Innovate for Profit - insideinnovation.co
How to Innovate for Profit - insideinnovation.co
 
Finding Problem Solution Fit by Interviewing Customers - 5 Minutes for Lean S...
Finding Problem Solution Fit by Interviewing Customers - 5 Minutes for Lean S...Finding Problem Solution Fit by Interviewing Customers - 5 Minutes for Lean S...
Finding Problem Solution Fit by Interviewing Customers - 5 Minutes for Lean S...
 
Technology impact and the Exponential Future
Technology impact and the Exponential FutureTechnology impact and the Exponential Future
Technology impact and the Exponential Future
 
Telco Big Data Workshop Sample
Telco Big Data Workshop SampleTelco Big Data Workshop Sample
Telco Big Data Workshop Sample
 
Big data and its impact on indian business
Big data and its impact on indian businessBig data and its impact on indian business
Big data and its impact on indian business
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
 
3.1.2 classification of network
3.1.2 classification of network3.1.2 classification of network
3.1.2 classification of network
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Evolution Of The Computers
Evolution Of The ComputersEvolution Of The Computers
Evolution Of The Computers
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Layered Software Architecture
Layered Software ArchitectureLayered Software Architecture
Layered Software Architecture
 

Similar to Big Data with Not Only SQL

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxAIMLSEMINARS
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsAbhishekKumarAgrahar2
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Introduction to BIG DATA
Introduction to BIG DATA Introduction to BIG DATA
Introduction to BIG DATA Zeeshan Khan
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Martin Bém
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 

Similar to Big Data with Not Only SQL (20)

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Introduction to BIG DATA
Introduction to BIG DATA Introduction to BIG DATA
Introduction to BIG DATA
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 

Recently uploaded

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Big Data with Not Only SQL

  • 2. WHO AM I • Big Data / Analytics / BI & Cloud Solutions Specialist • http://www.linkedin.com/in/JulioPhilippe • Skills Architecture Business Intelligence IT Transformation Cloud Computing IT Solutions Management Mentoring Big Data Analytics Business Development Hadoop Datacenter Optimization Data Warehousing 2 Big Data with Not Only SQL
  • 3. BIG DATA MANAGEMENT INSIGHT « Data don’t spring relevant, they become though ! » 3 Big Data with Not Only SQL
  • 4. DATA-DRIVEN ON-LINE WEBSITES • To run the apps : messages, posts, blog entries, video clips, maps, web graph... • To give the data context : friends networks, social networks, collaborative filtering... • To keep the applications running : web logs, system logs, system metrics, database query logs... 4 Big Data with Not Only SQL
  • 5. BIG DATA – NOT ONLY DATA VOLUME • Improve analytics and statistics models • Extract business value by analyzing large volumes of multistructured data from various sources such as databases, websites, blogs, social media, smart sensors... • Have efficient architectures, massively parallel, highly scalable and available to handle very large data volumes up to several petabytes 5 Thematics • • • • • • Web Technologies Database Scale-out Relational Data Analytics Distributed Data Analytics Distributed File Systems Real Time Analytics Big Data with Not Only SQL
  • 6. BIG DATA APPLICATIONS DOMAINS • Digital marketing optimization (e.g., web analytics, attribution, golden path analysis) • Data exploration and discovery (e.g., identifying new data-driven products, new markets) • Fraud detection and prevention (e.g., revenue protection, site integrity & uptime) • Social network and relationship analysis (e.g., influencer marketing, outsourcing, attrition prediction) • Machine-generated data analytics (e.g., remote device insight, remote sensing, location-based intelligence) • Data retention (e.g. long term conservation, data archiving 6 Big Data with Not Only SQL
  • 7. SOME BIG DATA USE CASES BY INDUSTRY Energy Telecommunications Retail  Smart meter analytics  Network performance  Dynamic price optimization  Distribution load forecasting & scheduling  New products & services creation  Localized assortment  Condition-based maintenance  Call Detail Records (CDRs) analysis  Supply-chain management  Customer relationship  Customer relationship management management Manufacturing Banking Insurance  Supply chain management  Fraud detection  Catastrophe modeling  Customer Care Call Centers  Trade surveillance  Claims fraud  Preventive Maintenance and Repairs  Compliance and regulatory  Reputation management  Customer relationship management  Customer relationship management  Customer relationship management Public Media Healthcare  Fraud detection  Large-scale clickstream analytics  Clinical trials data analysis  Fighting criminality  Abuse and click-fraud prevention  Patient care quality and program analysis  Threats detection  Social graph analysis and profile segmentation  Supply chain management  Cyber security  Campaign management and loyalty programs  Drug discovery and development analysis 7 Big Data with Not Only SQL
  • 8. TOP 10 BIG DATA SOURCES 1. Social network profiles 2. Social influencers 3. Activity-generated data 4. SaaS & Cloud Apps 5. Public web information 6. MapReduce results 7. Data warehouse appliances 8. Columnar/NoSQL databases 9. Network and in-stream monitoring technologies 10. Legacy documents 8 Big Data with Not Only SQL
  • 9. NEW DATA AND MANAGEMENT ECONOMICS Compute Trends Storage Trends New Analytics New Data Structure (Massively Parallel Processing, Algorithms…) Distributed File Systems, NoSQL Database, NewSQL…) Logical Data Warehouse Master/Slave Enterprise data warehouse Objects storage Multi-Structured Data Master/Master General purpose data warehouse Proprietary and dedicated data warehouse Distributed File Systems OLTP is the data warehouse Master Data Management, Data Quality, Data Integration 9 Big Data with Not Only SQL Federated/ Sharded
  • 10. MOVING COMPUTATION TO STORAGE General Purpose Storage Servers • Combine server with disks & networking for reducing latency • Specialized software enables general purpose systems designs to provide high performance data services Moving Data processing to Storage Legacy Emerging Next Gen. Application Application Application Data Processing Data Processing Metadata Mgmt Network Data Processing Metadata Mgmt Storage Metadata Mgmt Storage Storage Storage Array (SAN, NAS) 10 Big Data with Not Only SQL Servers
  • 11. BIG DATA ARCHITECTURE BI & DWH Architecture - Conventional • SQL based • High availability • Enterprise database • Right design for structured data • Current storage hardware (SAN, NAS, DAS) Analytics Architecture – Next Generation • Not only SQL based • High scalability, availability and flexibility • Compute and storage in the same box for reducing the network latency • Right design for semi-structured and unstructured data App Servers Edge Nodes Network Switches Network Switches Database Servers Storage Array SAN Switch 11 Data Nodes Big Data with Not Only SQL
  • 12. DATA WAREHOUSE • Data Warehouse appliances – EMC Greenplum – Microsoft Parallel Data Warehouse – IBM Netezza – Oracle Exadata – SAP HANA – ParAccel Analytic Database – Teradata – HP Vertica 12 • SQL Database • Massively Parallel Processing • Hadoop Connectivity • Column-Oriented database • In-Memory database Big Data with Not Only SQL
  • 13. MAPREDUCE ALGORITHMS MapReduce • MapReduce is the programming paradigm popularized by Google researchers • Open-source Hadoop implementation of MapReduce by Yahoo • Open source software framework for distributed computation • Parallel computation (Map) on each block (Split) of data in an DFS file and output a stream of (Key, Value) pairs to the local file system • JobTracker schedules and manages jobs • TaskTracker executes individual map() and reduce() tasks on each cluster node 13 Algorithms • Association Rule Learning Algorithms • Genetic Algorithms • Neural Network Algorithms • Statistical Algorithms (Pandas) • Machine Learning Algorithms (Mahout, Weka, Scikit Learn) • Natural Language Processing Algorithms • Trading Algorithms • Clinical design Algorithms • Searching Algorithms (Lucene, Solr, Katta, ElasicSearch, OpenSearchServer…) Big Data with Not Only SQL Languages • PHP • Erlang • Python • Ruby • R • Java
  • 14. DISTRIBUTED FILE SYSTEMS • System that permanently store data • Divided into logical units (files, shards, chunks, blocks…) • A file path joins file and directory names into a relative or absolute address to identify a file Master Slave Slave • Support access to file and remote servers • Support concurrency App • Support distribution • Support replication • NFS, GPFS, Hadoop HDFS, GlusterFS, MogileFS, MooseFS…. 14 Big Data with Not Only SQL Slave
  • 15. NOSQL DATABASES CATEGORIES Column BigTable (Google), HBase, Cassandra (DataStax), Hypertable… NoSQL = Not only SQL • Key-Value Redis, Riak (Basho), CouchBase, Voldemort (LinkedIn) MemcacheDB… Popular name for a subset of structured storage software that is designed with the intention of delivering increased optimization for high-performance operations on large datasets • Basically, available, scalable, eventually consistent • Easy to use • Tolerant of scale by way of horizontal distribution Graph Neo4j (Neo Technology), Jena, InfiniteGraph (Objectivity), FlockDB (Twitter)… 15 Big Data with Not Only SQL Document MongoDB (10Gen), CouchDB, Terrastore, SimpleDB (AWS) …
  • 16. NOSQL DATABASES CATEGORIES Key-Value Column Document Graph           Store items as alphanumeric identifier (Key) Associate values in a simple standalone tables Values must be (string, list, set) Data search base on key Fast and highly scalable to retrieve a value    BigTable-style database Column-oriented data structure that accommodates multiple attributes per key Petabyte scale Domains: Distributed data storage, Versioning with timestamp, Sorting, Parsing Data exploration     Domains: managing user profiles, retrieving product name… Documents (objects) map nicely to programming language data types Value = Collection>Document>Field Embedded documents and arrays reduce need for joins Dynamically-typed for easy schema evolution No joins and no multidocument transactions for high performance and easy scalability    Structured relational graphs of interconnected keyvalue pairings Object-oriented network of nodes (Node), Nodes Relationship (Edge), Properties (nodes attributes expressed as key-value pairs) Relation between data Domains: social networks, recommendations, investigations, relationships… Collection Key Value User001 Peter User002 Paul User003 Key Timestamp Type Size Document Name Age 12 Zebra Medium Doc001 Paul 30 11 Lion Big Doc002 Jacques 35 E2 13 Bird Small NoSQL Data Modeling Techniques Geo hashing, Index table, Composite keys aggregation, Materialized paths… http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/ 16 Big Data with Not Only SQL Node Name Age X John 30 Y E1 Rick Node Bob 50 Edge a b X Y Y X
  • 17. NEW SQL • Relational database with horizontal scalability • MySQL Ecosystem • Distributed database with MySQL compliance: Cubrid • Analytic database: InfiniDB • In-Memory database with MySQL compliance: VoltDB 17 Big Data with Not Only SQL
  • 18. BIG DATA ARCHITETURE OVERVIEW ADMINISTRATOR ENGINEERS ANALYSTS BUSINESS USERS Development Data Management DATA SCIENTISTS Data Modeling BI / Analytics Activity Reporting Data Quality Master Data Management MOBILE CLIENTS Mobile Apps Data Analysis & Visualization NoSQL SQL Unstructured and structured Data Warehouse, MPP, No SQL Engine, Distributed File Systems Share-Nothing Architecture, Algorithms Structured Data Warehouse and OLAP Cubes, MPP, In-Memory, Columns Database, SQL Engine, Share-Nothing Architecture Data Transfer Data Integration Files 18 Web Data RDBMS Data sources Big Data with Not Only SQL
  • 19. HDFS & MAPREDUCE • Clients Hadoop Distributed File System - Asynchronous replication - Write-once and read-many (WORM) - Hadoop cluster with 3 DataNodes minimum - Data divided into blocks, each block replicated 3 times (default) - No RAID required for DataNode - Interfaces: Java, Thrift, C Library, FUSE, WebDAV, HTTP, FTP - NameNode holds filesystem metadata - • A scalable, Fault tolerant, High performance distributed file system Files are broken up and spread over the DataNodes Hadoop MapReduce - Software framework for distributed computation - Input | Map() | Copy/Sort | Reduce() | Output - JobTracker schedules and manages jobs - 19 Master Node TaskTracker executes individual map() and reduce() tasks on each cluster node Big Data with Not Only SQL Worker Nodes
  • 20. HBASE • • • • • • • • • • • • • Clone of Big Table (Google) Implemented in Java (Clients : Java, C++, Ruby...) Data is stored “Column‐oriented” Distributed over many servers Tolerant of machine failure Layered over HDFS Strong consistency It's not a relational database (No joins) Sparse data – nulls are stored for free Semi-structured or unstructured data Data changes through time Versioned data Scalable – Goal of billions of rows x millions of columns Table Row Timestamp Animal Repair Type Enclosure1 Enclosure2 Key Cost 12 Region Size Zebra Medium 1000€ 11 Lion Big 13 Monkey Small Family Column 1500€ Cell (Table, Row_Key, Family, Column, Timestamp) = Cell (Value) 20 Big Data with Not Only SQL
  • 21. HBASE • Table - Regions for scalability, defined by row [start-key, end-key) Store for efficiency, 1 per Family - 1..n StoreFiles (HFile format on HDFS) • Everything is byte • Rows are ordered sequentially by key • Special tables -ROOT- , .META. - Tell clients where to find user data http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html 21 Big Data with Not Only SQL
  • 22. HADOOP INFRASTRUCTURE Network Switches 2 x Apps Server • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K Raid1 22 2 x NameNode/BackupNode/Admin • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K Raid1 Big Data with Not Only SQL 3 to n x DataNode • 2 CPU 6 core • 48 GB RAM • 12 x HDD
  • 23. MOGILEFS OVERVIEW • • Asynchronous Replication • No Single Point of Failure • Automatic file replication (3 replications recommended) • Better than RAID • Flat NameSpace • Share-Nothing • No RAID required • Local filesystem agnostic • Tracker client transfer (mogilefsd) - Replication -- Deletion - Query - Reaper - Monitor Clients A scalable, Fault tolerant, High performance distributed file system Tracker Host1 Host4 Tracker • DBNode MySQL stores the MogileFS metadata (the namespace, and which files are where) • Host2 Storage Node Host5 Files are broken up and spread over the Storage Node (mogstored) HTTP and WebDAV server • Storage Node Client Library : Ruby, Perl, Java, Python, PHP… DBNode Host3 23 Big Data with Not Only SQL Storage Node Host6
  • 24. MOGILEFS ARCHITECTURE Database Client Library Tracker Tracker Storage Node 24 Storage Node Big Data with Not Only SQL
  • 25. MOGILEFS INFRASTRUCTURE Network Switches °°° 2 x Apps Server • 2 CPU 6 core • 48 GB RAM • 6 x HDD 600GB 15K Raid1 25 2 x DB Node + 2 to n x Tracker • 2 CPU 6 core • 32 GB RAM • 6 x HDD 600GB 15K Raid1 Big Data with Not Only SQL 3 to n x Storage Node • 2 CPU 6 core • 32 GB RAM • 12 x HDD
  • 26. GLUSTERFS OVERVIEW • A scalable, Fault tolerant, High performance distributed and replicated file system • No Single Point of Failure • Synchronous replication of volumes across storage servers • Asynchronous replication across geographically distributed clusters • Easily accessible usage quotas • No Meta-Data Server (fully distributed architecture - Elastic Hash) • Distributed / Distributed Replicated / Distributed Striped • POSIX compliant • FUSE (Standard) • GlusterFS native, NFS, CIFS, HTTP, FTP, WebDAV, ZFS, EXT4… • No proprietary format to store files on disk • NameSpace : The unified global namespace aggregates disk and memory resources into a single pool, virtualizing the underlying hardware GlusterFS Server Host1 GlusterFS Server • Data Store : Data is stored in logical volumes that are abstracted from the hardware and logically partitioned from each other • Development: API, Command Line Interface, Python, Ruby, PHP languages 26 Clients Big Data with Not Only SQL Host2 GlusterFS Server Host3 GlusterFS Server Host4 GlusterFS Server Host5 GlusterFS Server Host6
  • 28. GLUSTERFS INFRASTRUCTURE Network Switches 2 x Apps Server • 2 CPU 6 core • 48 GB RAM • 6 x HDD 600GB 15K Raid1 28 2 x Backup Node / Admin • 2 CPU 6 core • 32 GB RAM • 6 x HDD 600GB 15K Raid1 Big Data with Not Only SQL 3 to n x GlusterFS Server • 2 CPU 6 core • 32 GB RAM • 12 x HDD
  • 29. MOOSEFS OVERVIEW • • • • • • • • • • • • 29 A scalable, Fault tolerant, High performance distributed and replicated file system Spread data over several physical servers which are visible to the user as one resource No Single Point of Failure Distribution of data across data servers via chunks Maximum chunks size = 64MB File duplication (1 to 3 and more if necessary) POSIX compliant FUSE Interface No proprietary format to store files on disk Master Server: a single machine managing the whole filesystem, storing metadata for every file (information on size, attributes and file location(s), including all information about non-regular files, i.e. directories, sockets, pipes and devices. Metadata is stored in memory Metalogger Server: any number of servers, all of which store metadata changelogs and periodically downloading main metadata file; so as to promote these servers to the role of the Managing server when primary master stops working Data Server any number of commodity servers storing files data and synchronizing it among themselves Big Data with Not Only SQL Clients Master Server Host1 Data Server Host2 Data Server Host3 Metalogger Server Host4 Data Server Host5 Data Server Host6
  • 30. MOOSEFS READ PROCESS Read Process 1. Where is the data 2. The data is on x chunks servers 3. Send me the data 4. The Data http://www.moosefs.org/ 30 Big Data with Not Only SQL
  • 31. MOOSEFS WRITE PROCESS Write Process 1. Where to write the data 2. Create new chunk on x chunk server 3. Success 4. Write the data 5. Synchronize the data 6. Success 7. Success 8. Send write session end signal http://www.moosefs.org/ 31 Big Data with Not Only SQL
  • 32. MOOSEFS INFRASTRUCTURE Network Switches 2 x Apps Server • 2 CPU 6 core • 48 GB RAM • 6 x HDD 600GB 15K Raid1 32 2 x Master/ Metalogger/ Admin Server • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K Raid1 Big Data with Not Only SQL 3 to n x Data Server • 2 CPU 6 core • 32 GB RAM • 12 x HDD
  • 33. CASSANDRA OVERVIEW • Every node play the same role Cassandra API • Highly Available Storage Layer • Really fast reads, really fast writes • Flexible schemas Partitioner Replicator Failure Detector Cluster Membership Messaging Layer • Distributed, Replicated • No Master, no Slaves • No Single Point of Failure • Client can talk to any node • Written in Java 33 Tools Big Data with Not Only SQL
  • 34. CASSANDRA – COLUMN-ORIENTED Key SuperColumn Column Column • Column +Name +Value +Timestamp • • • • • 34 Column Column Family • Think of it as a DB table Column • Key-Value Pair (not just a value, like a DB column) • Timestamp SuperColumn • Columns inside a column • The value are columns • No timestamp Keyspace – like a namespace, generally 1 per app Indexes Queries Big Data with Not Only SQL
  • 35. CASSANDRA INFRASTRUCTURE Network Switches Cassandra Nodes • • • 35 2 CPU 6 core 32 GB RAM 12 x HDD Raid0 Big Data with Not Only SQL
  • 36. MONGODB OVERVIEW Clients • Documents database oriented, High performance, scalability and availability • Support MapReduce • Shard: hold a portion of the total data. Reads and writes are automatically routed to the appropriate shard(s). Each shard is backed by a replica set – which just holds the data for that shard • Replica: set is one or more servers, each holding copies of the same data. At any given time one is primary and the rest are secondaries. If the primary goes down one of the secondaries takes over automatically as primary. All writes and consistent reads go to the primary, and all eventually consistent reads are distributed amongst all the secondaries. Replica set is an asynchronous cluster replication technology • Config: multiple config servers, each one holds a copy of the meta data indicating which data lives on which shard • Router: one or more routers, each one acts as a server for one or more clients. Clients issue queries/updates to a router and the router routes them to the appropriate shard while consulting the config servers • Client: one or more clients, each one is (part of) the user's application and issues commands to a router via the mongo client library (driver) for its language 36 Big Data with Not Only SQL mongos Servers Router mongod Servers Config mongod Servers Shard mongos Servers Router mongod Servers Config mongod Servers Shard
  • 38. MONGODB INFRASTRUCTURE Network Switches 1 to n Router server 2 CPU 6 core 96 GB RAM 6 x HDD 600GB 15K Raid10 38 1 to n Config servers 2 CPU 6 core 96 GB RAM 6 x HDD 600GB 15K Raid10 Big Data with Not Only SQL 1 to n Shard servers 2 CPU 6 core 48 GB RAM 12 x HDD 1TB 7.2K
  • 39. COUCHDB OVERVIEW Clients • • • • • • • • • • • • Open Source Distributed Database RESTful API Schema-less document store (document in JSON format) Multi-Version-Concurrency-Control model User-defined query structured as map/reduce Incremental Index Update mechanism Multi-Master Replication model Written in Erlang Support MapReduce Easy to use data storage Easy to integrate with web applications : JavaScript, JSON Scalability for large web applications : Incremental Replication, bi-directional conflict detection and management • Query-able and index-able • Offline by default 39 Big Data with Not Only SQL CouchDB Servers Master CouchDB Servers Slave CouchDB Servers Slave • • • • • CouchDB Servers Master CouchDB Servers Slave CouchDB Servers Slave Master → Slave replication Master ↔ Master replication Filtered Replication Incremental and bi-directional replication Conflict management
  • 40. COUCHDB FUNCTIONALITIES • Document storage – CouchDB server hosts named databases, which store documents • ACID Properties – CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent state • Compaction – On schedule, or when the database file exceeds a certain amount of wasted space, the compaction process clones all the active data to a new file and then discards the old file • Views (Model, Function, Index) – View model is the method of aggregating and reporting on the documents in a database, and are built on-demand to aggregate, join and report on database documents – View function takes a CouchDB document as an argument and then does whatever computation it needs to do to determine the data that is to be made available through the view, if any. It can add multiple rows to the view based on a single document, or it can add no rows at all – View index is a dynamic representation of the actual document contents of a database, and CouchDB makes it easy to create useful views of data. But generating a view of a database with hundreds of thousands or millions of documents is time and resource consuming, it's not something the system should do from scratch each time • Security – To protect who can read and update documents, CouchDB has a simple reader access and update validation model that can be extended to implement custom security models • Distributed update and replication – CouchDB is a peer-based distributed database system, it allows for users and servers to access and update the same shared data while disconnected and then bi-directionally replicate those changes later 40 Big Data with Not Only SQL
  • 41. COUCHDB INFRASTRUCTURE Network Switches 1 to n Router server 2 CPU 6 core 96 GB RAM 6 x HDD 600GB 15K Raid10 41 1 to n Master servers 2 CPU 6 core 96 GB RAM 6 x HDD 600GB 15K Raid10 Big Data with Not Only SQL 1 to n Slaves servers 2 CPU 6 core 48 GB RAM 12 x HDD 1TB 7.2K