10 Reasons to Start Your Analytics Project with PostgreSQL

10 Reasons To Start
Your Analytics Project
with PostgreSQL
Satoshi Nagayasu
@snaga
HKOSCon 2016

Agenda
• Collecting Data / Database Federation
• Building Data Warehouse and Data Mart
• Writing Queries / SQL Features
• Performance
• In-Database Analytics

Collecting Data / Database
Federation
Foreign Data Wrapper
Unlogged Table

Foreign Data Wrapper
• Connects external data sources (RDBMS, NoSQL, files,
etc) to the PostgreSQL executor.
• Allows SELECT/INSERT/UPDATE/DELETE operations
for external tables.
PostgreSQL
Oracle
MySQL
HDFS
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Unlogged Table
• Does not record XLOG.
• Has better performance compared to regular
table.
• Will be truncated after crash
recovery.
http://pgsnaga.blogspot.jp/2011/10/data-loading-into-unlogged-tables-and.html

Building Data Warehouse
and Data Mart
Materialized Views
Transactional DDLs

Materialized View
• Defines a view with caching records.
• Allows to avoid running complicated queries and
aggregations every time.
• Requires updating cache by the users.
Table
View
Table Table
Materialized
View
Table
Query Query
Cache

Transactional DDLs
• Most of DDLs can be performed in transaction in
PostgreSQL.
• Schema can be modified with keeping atomicity
even online. (commit or rollback)
• Transactional DDLs would help DBAs manage
their schema easier.

Writing Queries / SQL
Features
Rich SQL features
Compatibility with SQL standard

Writing Queries / SQL Features
• Rich SQL features
– Subqueries
– WITH clauses (Common Table Expressions, CTEs)
– Many aggregation functions
– Window functions
• JSON support
• Compatibility with the SQL standard

WITH clause
• Defines a temporary table for a query.
• May make a better performance compared to
using the same subquery more than once.
WITH foo AS (
SELECT ... FROM ... GROUP BY ...
)
SELECT ... FROM foo WHERE ...
UNION ALL
SELECT ... FROM foo WHERE ...;
https://www.postgresql.org/docs/9.5/static/queries-with.html

Many Aggregations
• New in 9.4
– percentile_cont()
– percentile_disc()
– mode()
– rank()
– dense_rank()
– percent_rank()
– cume_dist()
• New in 9.5
– ROLLUP()
– CUBE()
– GROUPING SETS()
https://www.postgresql.org/docs/9.5/static/functions-aggregate.html

ROLLUP
• Calculates total/subtotal values

CUBE
• Calculates for all combinations of the
specified columns

GROUPING SETS
• Runs multiple GROUP BY queries at once
Two GROUP BYs
at once.

JSON data type
testdb=# create table t1 ( j jsonb );
CREATE TABLE
testdb=# insert into t1 values ('{ "key1": "value1", "key2":
"value2" }');
INSERT 0 1
testdb=# select * from t1;
j
--------------------------------------
{"key1": "value1", "key2": "value2"}
(1 row)
testdb=# select j->>'key2' key2 from t1;
key2
--------
value2
(1 row)

JSON data type
testdb=# select n_nationkey,n_name from nation where
n_nationkey = 12;
n_nationkey | n_name
-------------+---------------------------
12 | JAPAN
(1 row)
testdb=# select jsonb_build_object('n_nationkey', n_nationkey,
'n_name', n_name) from nation where n_nationkey = 12;
jsonb_build_object
------------------------------------------------------------
{"n_name": "JAPAN ", "n_nationkey": 12}
(1 row)

JSON data type
Operator Description
9.4
-> Get an element by key as a JSON object
->> Get an element by key as a text object
#> Get an element by path as a JSON object
#>> Get an element by path as a text object
<@, @> Evaluate whether a JSON object contains a key/value pair
? Evaluate whether a JSON object contains a key or a value
?| Evaluate whether a JSON object contains ANY of keys or values
?& Evaluate whether a JSON object contains ALL of keys or values
9.5
|| Insert or Update an element to a JSON object
- Delete an element by key from a JSON object
#- Delete an element by path from a JSON object
http://www.postgresql.org/docs/9.5/static/functions-json.html

JSON data type
• Allows to collect data without defining schema.
• “Schema-less”, “Schema on Read” or “Schema-
later”.
• Still accessible with SQL.
JSON
Data Type
Fluentd
pg-Json plugin
View
(Schema)
App
App
Fluentd

Performance
3 types of Join
Full text search (n-gram)
Table Partition
BRIN Index
Table Sample
Parallel Queries

3 types of Join
• Nested Loop (NL) Join
– Works good when joining small number of records
between tables with indexes.
• Merge Join
• Hash Join
– Works better than NL when joining large number of
records between large tables.

Full-text search (n-gram)
• Splits a text into N-char tokens and build an index.
– Pg_trgm: Tri-gram (3-char)
– Pg_bigm: Bi-gram (2-char)
• CJK has lots of 2-char words, so Bi-gram may be
useful rather than Tri-gram.
– CJK: Chinese, Japanese and Korean.
Pg_trgm: https://www.postgresql.org/docs/9.5/static/pgtrgm.html
Pg_bigm: http://pgbigm.osdn.jp/index_en.html

Pg_bigm performance
• Wikipedia title data (2,789,266 records)
– https://dumps.wikimedia.org/zhwiki/20160601/
– zhwiki-20160601-pages-articles-multistream-index.txt.bz2
zhwikidb=> select * from zhwiki_index where title like '%香港%';
id1 | id2 | title
----------+-------+----------------------------------------
5693863 | 2087 | 香港特別行政區基本法第二十三條
11393231 | 4323 | 香港特别行政区
12830042 | 5085 | 香港大学列表
14349335 | 6088 | 香港行政区划
14349335 | 6090 | 香港行政區劃
14349335 | 6091 | 香港十八区
14349335 | 6092 | 香港十八區
16084672 | 7168 | 香港兒童文學作家
18110426 | 8206 | 北區 (香港)
18110426 | 8236 | 東區 (香港)
19537078 | 9528 | 香港專業教育學院
19537078 | 9567 | 香港中文大學

Pg_bigm performance
Aggregate (actual time=481.512..481.541 rows=1 loops=1)
-> Seq Scan on zhwiki_index (actual time=1.458..478.326 rows=317 loops=1)
Filter: (title ~~ '%香港電影%'::text)
Rows Removed by Filter: 2788949
Planning time: 0.125 ms
Execution time: 481.654 ms
(6 rows)
select count(*) from zhwiki_index
where title like '%香港電影%';

Pg_bigm performance
Aggregate (actual time=1.790..1.792 rows=1 loops=1)
-> Bitmap Heap Scan on zhwiki_index (actual time=0.299..1.225 rows=317
loops=1)
Recheck Cond: (title ~~ '%香港電影%'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=191
-> Bitmap Index Scan on zhwiki_index_title_idx (actual
time=0.258..0.258 rows=318 loops=1)
Index Cond: (title ~~ '%香港電影%'::text)
Planning time: 0.103 ms
Execution time: 1.833 ms
(9 rows)
select count(*) from zhwiki_index
where title like '%香港電影%';
481.6ms → 1.8ms.
200x faster than a regular LIKE.

Table Partition
• Table Partitioning by Range or List
– Called “Constraint Exclusion”
• Does not scan unnecessary partitions
– Determined by the “constraints”.
• Is able to eliminate “full table scan” for large tables
entirely.
https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html

BRIN Index
• Block Range INdex (New in 9.5)
– Holds "summary“ data, instead of raw data.
– Reduces index size tremendously.
– Also reduces creation/maintenance cost.
– Needs extra tuple fetch to get the exact record.
0
50,000
100,000
150,000
200,000
250,000
300,000
Btree BRIN
Elapsedtime(ms)
Index Creation
0
50,000
100,000
150,000
200,000
250,000
300,000
Btree BRIN
NumberofBlocks
Index Size
0
2
4
6
8
10
12
14
16
18
Btree BRIN
Elapsedtime(ms)
Select 1 record
https://gist.github.com/snaga/82173bd49749ccf0fa6c

BRIN Index
• Structure of BRIN Index
Table File
Block Range 1 (128 Blocks)
Block Range 2
Block Range 3
Block
Range
Min. Value Max. Value
1 1992-01-02 1992-01-28
2 1992-01-27 1992-02-08
3 1992-02-08 1992-02-16
… … …
Holds only min/max values
for “Block Ranges”,
128 blocks each.
(in case a date
column)

TABLESAMPLE
• Allows to get approximate results for aggregations by
sampling.
• BERNOULLI
– Accurate
– Sample by Tuple
• SYSTEM
– Performance
– Sample by Block
http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/

TABLESAMPLE
• Calculating the average of total price.
– The actual value and the approximate ones

TABLESAMPLE
Without TABLESAMPLE
1787ms
SYSTEM Sampl.
22ms
BERNOULLI Sampl.
405ms

Parallel Queries
• The leader process cooperates with those worker
processes for:
– Sequential scan
– Joins (Nested Loop & Hash)
– Aggregations
• Will be shipped with 9.6
– 9.6 is beta2 as of today
Leader
Worker Worker
Client
Data
Read &
Examine
Query
Result
Launch & Gather

Parallel Aggregation
Performance & Scalability
• count(*) on 30M rows
– Shows a good parallel scalability

In-Database Analytics
User Defined Functions
Apache MADlib

• In-Database Analytics?
– Performs analytics workload in the database
without pulling the data out of the server.
• Advantages of In-Database Analytics
– No need to move “BigData” between server and
client for analytics.
– Higher performance hardware resources (CPU,
memory, storage) compared to client PCs.

• User defined functions
– PL/Python, PL/R, PL/v8, ... or C lang.
– Allow you to run (almost) any logics within the
database.
• Apache MADlib
– Machine Learning Library for PostgreSQL

UDF by Python
CREATE OR REPLACE FUNCTION dumpenv(OUT text, OUT text)
RETURNS SETOF record
AS $$
import os
for e in os.environ:
plpy.notice(str(e) + ": " + os.environ[e])
yield(e, os.environ[e])
$$ LANGUAGE plpythonu;

UDF by Python
CREATE OR REPLACE FUNCTION dumpenv(OUT text, OUT text)
RETURNS SETOF record
AS $$
import os
for e in os.environ:
plpy.notice(str(e) + ": " + os.environ[e])
yield(e, os.environ[e])
$$ LANGUAGE plpythonu;
testdb=# select * from dumpenv() order by 1 limit 10;
column1 | column2
--------------------+-----------------------
G_BROKEN_FILENAMES | 1
HISTCONTROL | ignoredups
HISTSIZE | 1000
HOME | /home/snaga
HOSTNAME | localhost.localdomain
LANG | ja_JP.UTF-8
LC_COLLATE | C
LC_CTYPE | C
LC_MESSAGES | C
LC_MONETARY | C
(10 rows)

Apache MADlib
• An Open Source Machine Learning Library
– Can run in PostgreSQL, Greenplum Database and
Apache HAWQ.
– Supports many ML algorithms.
http://madlib.incubator.apache.org/

Others
Strict type checking and
constraints.
Industry Standard Interface (for
BI tools)

Others
• Strict type checking and constraints.
– Avoid “Garbage in, garbage out.”
• Industry Standard Interface (for BI tools)
– ODBC, JDBC

Summary
• PostgreSQL has already had lots of features that
help your analytics project
– In terms of productivity and performance.
• And more “BigData” features are coming in the
future release.
– Parallel query must be a big-shot.
• Let’s start your analytic project with PostgreSQL
and join our community. 
– PostgreSQL 9.6 beta2 is available now!

Resources
• http://www.postgresql.org
• http://wiki.postgresql.org
• http://planet.postgresql.org
• http://pgcon.org

pgDay Asia 2016
• pgDay Asia 2016 / FOSSASIA 2016
– March 17-19 in Singapore
• Speakers:
– 19+ speakers from 9 countries
• Sessions:
– 19 Regular Sessions.
– Plus, lightning talks
• Attendees:
– Around 100 attendees

pgDay Asia 2017
• FOSSASIA 2017 (March, 2017)
– Probably, the same format, in the same season, in the
same region.
• Do not miss the next one!
– Will be better and bigger. 
• Join us at:
– http://pgday.asia
– https://www.facebook.com/pgdayasia

10 Reasons to Start Your Analytics Project with PostgreSQL

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a 10 Reasons to Start Your Analytics Project with PostgreSQL

Similar a 10 Reasons to Start Your Analytics Project with PostgreSQL (20)

Más de Satoshi Nagayasu

Más de Satoshi Nagayasu (20)

Último

Último (20)

10 Reasons to Start Your Analytics Project with PostgreSQL