ICDE2010 Nb-GCLOCK

Nb-GCLOCK:
A Non-blocking Buffer Management
based on the Generalized CLOCK
Makoto YUI1, Jun MIYAZAKI2, Shunsuke UEMURA3
and Hayato YAMANA4
1 .Research fellow, JSPS (Japan Society for the Promotion of Science) /
Visiting Postdoc at Waseda University, Japan and CWI, Netherlands
2. Nara Institute of Science and Technology
3. Nara Sangyo University
4. Waseda University / National Institute of Informatics

Outline
• Background
• Our approach
– Non-Blocking Synchronization
– Nb-GCLOCK
• Experimental Evaluation
• Related Work
• Conclusion

2

Background – Recent trends in CPU development

# of CPU cores in a chip Many-Core CPU
is doubling in two year cycles UltraSparc T2
Azul Vega
Larrabee?

Multi-Core CPU

Nehalem
Single-Core CPU
Core2
Power4
Pentium
2000 Many-core era is coming.
1990

3

Background – Recent trends in CPU development

# of CPU cores in a chip Many-Core CPU
is doubling in two year cycles UltraSparc T2
Azul Vega
Larrabee?

Multi-Core CPU

Nehalem
Single-Core CPU
Core2
Power4
Pentium
2000 Many-core era is coming.
1990 - Niagara T2 – 8 cores x 8 SMT = 64 processors
- Azul Vega3 – 54 cores x 16 chips = 864 processors

4

Background – CPU Scalability of open source DBs
Open source DBs have faced CPU scalability problems
Ryan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”,
In Proc. EDBT, 2009.

5


10
PostgreSQL
8 MySQL
BDB
6

4

2

0
1 4 8 12 16 24 32

Microbenchmark on UltraSparc T1 (32 procs) 6


10
PostgreSQL
8 MySQL
BDB

Throughput 6
(normalized)
4

2

0 Concurrent
1 4 8 12 16 24 32 threads

Gain after 16 threads
10 is less than 5 %
PostgreSQL
8 MySQL
BDB

Throughput 6
(normalized)
4

2

0 Concurrent
1 4 8 12 16 24 32 threads

Gain after 16 threads
10 is less than 5 %
PostgreSQL
8 MySQL
BDB

Throughput 6
(normalized)
4

2
You might think…
What about TPC-C ?
0 Concurrent
1 4 8 12 16 24 32 threads

CPU scalability of PostgreSQL
TPC-C benchmark result on a high-end Linux machine of Unisys
（Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage)
Doug Tolbert, David Strong, Johney Tsai (Unisys),
“Scaling PostgreSQL on SMP Architectures”, PGCON 2007.

10


TPS

Version 8.2 CPU cores
Version 8.1
Version 8.0 11


TPS Gain after 16 CPU cores
is less than 5%

Version 8.1
Version 8.0 12


is less than 5%

Q. What PostgreSQL community did?
Version 8.1
Version 8.0 13


is less than 5%

Q. What PostgreSQL community did?
Version 8.1 Revised their synchronization mechanisms
in the buffer management module
Version 8.0 14

Synchronization in Buffer Management Module
Several empirical studies have revealed that the largest bottleneck is …
synchronization in buffer management module
[1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki:
“Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008.
[2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker:
OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008.


CPU Page requests

reduces disk access
by caching database pages
Buffer
Memory Manager

HDD Database
Files


CPU Page requests Page requests

reduces disk access Buffer Manager
(1) Looking-up hash table
Buffer
Memory Manager misses hits

(2) Page replacement algorithm

HDD Database Database
Files Files 20



Buffer


Files Files 18



Buffer


Files Files 19

Naive buffer management schemes
Page requests Page requests

Hash Hash Hash Hash
Looking-up hash table bucket bucket bucket bucket

misses hits misses hits

Page replacement algorithm Page replacement algorithm
(Least Recently Used) (Least Recently Used)

Database Database
Files Files

PostgreSQL 8.0 PostgreSQL 8.1

20

Giant lock sucks!

Hash Hash Hash Hash



Database Database
Files Files


21

Giant lock sucks!

Hash Hash Hash Hash



LRU list always needs to be
Database Database
Files locked when it is accessed Files


22

Giant lock sucks! Striped a lock
into buckets
Hash Hash Hash Hash



Database Database


23

Giant lock sucks! Striped a lock
into buckets
Hash Hash Hash Hash



Database Database

 Did not scale at all  Scales up to 8 processors
24

Less naive buffer management schemes


Hash Hash Hash Hash Hash Hash Hash Hash
bucket bucket bucket bucket bucket bucket bucket bucket


(Least Recently Used) (CLOCK)

Always needs to be locked
Database when it is accessed Database
Files Files

 Scales up to 8 processors
25

Less naive buffer management schemes

Page requests
CLOCK does not require a lock Page requests
when an entry is touched

Hash Hash Hash Hash Hash Hash Hash Hash
bucket bucket bucket bucket bucket bucket bucket bucket


(Least Recently Used) (CLOCK)

Always needs to be locked
Database when it is accessed Database
Files Files

 Scales up to 8 processors  Scales up to 16 processors
26

Outline
• Background
• Our approach
– Nb-GCLOCK
• Related Work
• Conclusion

27

Core idea of our approach
Previous approaches Our optimistic approach

Request pages Request pages
CPU

Buffer Buffer
Memory Manager Manager

HDD Database
files Database
files
28

○Reducing disk I/Os
× locks are contended

CPU

Buffer Buffer

HDD Database
files Database
files
29


CPU

Buffer Buffer

intuition
HDD Database
files Database
files
30


CPU

Enough
processors
Buffer Buffer
Disk bandwidth
is not utilized

HDD Database
files Database
files
31


CPU

Enough
processors
Buffer Buffer
Disk bandwidth
is not utilized

HDD Database
files Database
files
32


CPU

Enough
processors
Buffer Buffer
Disk bandwidth
is not utilized
Reduced lock granularity to
one CPU instruction and
HDD remove the bottleneck
Database
files Database
files
33

○Reducing disk I/Os △ # of I/O slightly increases
× locks are contended ○ no contention on locks

CPU

Enough
processors
Buffer Buffer
Disk bandwidth
is not utilized
Reduced lock granularity to
one CPU instruction and
HDD remove the bottleneck
Database
files Database
files
34

Major Difference to Previous Approaches

Their goal is …

35


Their goal is …

Improve buffer hit-rates
for reducing I/Os
Unique goal for many decades.
Is this goal valid for many core
era? There are also SSDs.
36


Their goal is … Our goal is …

Improve buffer hit-rates
for reducing I/Os
37



Improve buffer hit-rates Improve throughputs by
for reducing I/Os utilizing (many) CPUs.
38



Improve buffer hit-rates Improve throughputs by
for reducing I/Os utilizing (many) CPUs.
Unique goal for many decades. Use Non-blocking synchronization
Is this goal valid for many core instead of acquiring locks!
39

What’s non-blocking and lock-free?
 Formally:

40

 Formally:
 Stopping one thread will not prevent global progress.
Individual threads make progress without waiting.

41

 Formally:
 Less Formally:

42

 Formally:
 Less Formally:
 No thread 'locks' any resource
 No 'critical sections', locks, mutexs, spin-locks, etc

43

 Formally:
 Less Formally:
Lock-free if every successful step makes Global Progress
and completes within finite time (ensuring liveness)

44

 Formally:
 Less Formally:
Lock-free if every successful step makes Global Progress
and completes within finite time (ensuring liveness)
Wait-free if every step makes Global Progress
and completes within finite time (ensuring fairness)
45

Non-blocking synchronization

Synchronization method that does not acquire any lock,
enabling concurrent accesses to shared resources
 Utilize atomic CPU primitives

 Utilize memory barriers

46


 CAS (compare-and-swap) cmpxchg on X86

47



Blocking
acquire_lock(lock);
counter++;
release_lock(lock);

48



Blocking Non-Blocking
acquire_lock(lock); int old;
counter++; do {
release_lock(lock); old = *counter;
} while (!CAS(counter, old, old+1));
counter is incremented if the value
was equals to old
49

Making the buffer manager non-blocking

Page requests

Hash Hash Hash Hash
bucket bucket bucket bucket

misses hits

Page replacement algorithm
(GCLOCK)

lock; lseek; read; unlock

Database
Files

50


Page requests 1. Utilized existing lock-free
hash table
Hash Hash Hash Hash

misses hits

(GCLOCK)


Database
Files

51


Page requests 1. Utilized existing lock-free
hash table
Hash Hash Hash Hash

misses hits

2. Removing locks on cache
(GCLOCK) misses (in fig. 6)

Database
Files

52


Page requests

Hash Hash Hash Hash

misses hits

(GCLOCK)


Database
Files

53


3. Need to keep consistency
Page requests
between lookup hash table and GCLOCK
(in the right half of fig. 3)

Hash Hash Hash Hash

misses hits

(GCLOCK)


Database
Files

54


Page requests

Hash Hash Hash Hash
bucket bucket bucket bucket Reference in buffer lookup table
misses hits still has a different page identifier
immediately after changing the
Page replacement algorithm page allocation of a buffer frame
(GCLOCK)


Database
Files

55


Page requests

Hash Hash Hash Hash
bucket bucket bucket bucket Reference in buffer lookup table
misses hits still has a different page identifier
immediately after changing the
Page replacement algorithm page allocation of a buffer frame
(GCLOCK)


4. Avoided locks on I/Os
Database
Files by utilizing pread, CAS, and memory barriers
(in fig. 5)
56

State Machine-based Reasoning for selecting replacement victim

Construct algorithm from many 'steps'
─ build a State Machine for ensuring
glabal progress

57


58


E: entry action evicted Fix in pool swapped
Check whether
Evicted E: CAS value
success
!null E: move the
clock hand
!evicted ! swapped
Check whether evicted
Pinned
Select a frame
Try to evict

E: evict
!evicted
pinned !pinned
null --refcount<=0
Try to decrement
continue the refcount
E: decrement
E: try next entry
the refcount

--refcount>0

59


Check whether
success
!null E: move the
Start finding a ! swapped
clock hand
!evicted
replacement Check whether evicted
Pinned
victim Select a frame
Try to evict

E: evict
!evicted
pinned !pinned
null --refcount<=0
Try to decrement
E: decrement
E: try next entry
the refcount

--refcount>0

60


Check whether
success
!null E: move the
clock hand
!evicted
Pinned
Try to evict

E: evict
!evicted
pinned !pinned
null --refcount<=0
Try to decrement
E: decrement
E: try next entry
the refcount

--refcount>0
Decrement weight count
of a buffer page
61

Return a replacement
E: entry action evicted victim
Check whether
Fix in pool swapped
success
!null E: move the
clock hand
!evicted
Pinned
Try to evict

E: evict
!evicted
pinned !pinned
null --refcount<=0
Try to decrement
E: decrement
E: try next entry
the refcount

--refcount>0
of a buffer page
62

Return a replacement
Check whether
Fix in pool swapped
success
!null E: move the
clock hand
!evicted
Pinned
Try to evict

E: evict
!evicted
pinned !pinned
null --refcount<=0
Try to decrement
E: decrement
E: try next entry
the refcount

--refcount>0
Advance CLOCK hand
of a buffer page
(check the next candidate)
63

Thread A Return a replacement
Check whether
Fix in pool swapped
success
!null E: move the
clock hand
!evicted
Pinned
Try to evict

E: evict
!evicted
pinned !pinned
null --refcount<=0
Try to decrement
E: decrement
E: try next entry
the refcount

--refcount>0
Advance CLOCK hand
of a buffer page
64

Check whether
Fix in pool swapped
success
!null E: move the
clock hand
!evicted
replacement Check whether evicted Thread B
Pinned
Try to evict

E: evict
!evicted
pinned !pinned
null --refcount<=0
Try to decrement
E: decrement
E: try next entry
the refcount

--refcount>0
Advance CLOCK hand
of a buffer page
65

Check whether
Fix in pool swapped
success
!null E: move the
clock hand
!evicted
Pinned Oops! Candidate
victim Select a frame isTry to evict
intercepted.
E: evict
!evicted
pinned !pinned
null --refcount<=0
Try to decrement
E: decrement
E: try next entry
the refcount

--refcount>0
Advance CLOCK hand
of a buffer page
66

Check whether
Fix in pool swapped
success
!null E: move the
clock hand
!evicted
Pinned
Try to evict

E: evict
!evicted
pinned !pinned
null --refcount<=0
Try to decrement
E: decrement
E: try next entry
the refcount

--refcount>0
Advance CLOCK hand
of a buffer page
67

Outline
• Background
• Our approach
– Nb-GCLOCK
• Related Work
• Conclusion

68

Experimental settings

 Workload
 Zipf 80/20 distribution (a famous power law)
containing 20% of sequential scans
dataset size is 32GB in total
 Machine used: UltraSPARC T2

64 processors

69

Experimental settings

 Workload
 Zipf 80/20 distribution (a famous power law)
containing 20% of sequential scans
dataset size is 32GB in total
 Machine used: UltraSPARC T2

64 processors

We also performed evaluation
on various X86 settings in the
paper.
70

Performance comparison on moderate I/Os (of fig.9)

Throughput
(normalized by LRU)

6.0
LRU
5.0
GCLOCK
4.0 Nb-GCLOCK

3.0

2.0

1.0

0.0
8 16 32 64 Processors
71


Throughput
(normalized by LRU)

6.0
LRU
5.0
GCLOCK
4.0 Nb-GCLOCK

3.0

2.0

1.0

CPU0.0
utilization
 Previous approach: Low, about 20%
 Nb-GCLOCK: High, more than 95％
72


Throughput More difference in CPU time can be
(normalized by LRU) expected when # of CPU increases
➜ We expect more throughput
6.0
LRU
5.0
GCLOCK
4.0 Nb-GCLOCK

3.0

2.0

1.0

CPU0.0
utilization
 Previous approach: Low, about 20%
 Nb-GCLOCK: High, more than 95％
73

Maximum throughput to processors
Scalability to processors when pages are resident in memory
intending to see the scalability limit expected by each algorithm

74


Throughput
(log scale)

8 (1) 16 (2) 32 (4) 64 (8) Processors
2Q 890992 819975 866009 662782
GCLOCK 1758605 1912000 1931268 1817748 (cores)
Nb-GCLOCK 3409819 7331722 14245524 25834449
75


Throughput
(log scale) Achieved almost linear scalability,
at least, up to 64 processors!
 This is the first attempt that
removed locks in buffer management

8 (1) 16 (2) 32 (4) 64 (8) Processors
2Q 890992 819975 866009 662782
GCLOCK 1758605 1912000 1931268 1817748 (cores)
Nb-GCLOCK 3409819 7331722 14245524 25834449
76


Throughput
(log scale) Achieved almost linear scalability,
at least, up to 64 processors!
 This is the first attempt that
removed locks in buffer management

8 (1) 16 (2) 32 (4) 64 (8) Processors
2Q Interesting here is GCLOCK has662782
890992 819975 866009 CPU-
GCLOCK scalability limit on around 16 1817748
1758605 1912000 1931268 (cores)
Nb-GCLOCK 3409819 Caching solutions 25834449
processors. 7331722 14245524 using
GCLOCK have scalability limit there. 77

Max thoughput (operation/sec) evaluation
 Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs)
 Accesses issued from 64 threads in 60 seconds
 Thus, ideally 64 x 60 = 3,840 seconds can be used

78


79


Most of CPU time is used
because our Nb-GCLOCK
is non-blocking!

80

About 10-20% of CPU
Time is used!

is non-blocking!

81

About 10-20% of CPU
Time is used!

is non-blocking!

The CPU utilization would be more differs when # of
processors grows. It would causes contentions! 82

TPC-C evaluation using Apache Derby

1400

1300
Transaction
per minutes 1200

tpmC
1100
Derby
1000
Nb-GCLOCK

900

800
8 16 32 64 128

# of terminals (threads)

Sang Kyun Cha et al. Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-
Memory Multiprocessor Systems. In Proc. VLDB, 2001. 83


1400

1300
Transaction
per minutes 1200

tpmC
1100
Derby
1000
Nb-GCLOCK

900

800
8 16 32 64 128
The original scheme of Derby (CLOCK)
decreased throughput.#On the other hand,
of terminals (threads)

ours scheme showed better result.

Throughput to buffer management module reduced a
latch on root page of B+-tree
➜ We would require a concurrent B+-tree (see OLFIT)
1400

1300
Transaction
per minutes 1200

tpmC
1100
Derby
1000
Nb-GCLOCK

900

800
8 16 32 64 128

# of terminals (threads)


Outline
• Background
• Our approach
– Nb-GCLOCK
• Related Work
• Conclusion

86

Xiaoning Ding, Song Jiang, and Xiaodong Zhang:
Bp-wrapper Bp-Wrapper: A System Framework Making Any Replacement
Algorithms (Almost) Lock Contention Free, Proc. ICDE, 2009.

eliminates lock contention on buffer hits
Page requests
by using a batching and prefetching technique

Hash Hash Hash Hash

hits
misses
Recording access

(any)

Database
Files

87


Page requests

Hash Hash Hash Hash
postpones the physical work
bucket bucket bucket bucket (adjusting the buffer replacement list)
hits and immediately returns
misses the logical operation
Recording access called Lazy synchronization in the literature

(any)

Database
Files

88


Page requests

Hash Hash Hash Hash
postpones the physical work
bucket bucket bucket bucket (adjusting the buffer replacement list)
hits and immediately returns
misses the logical operation
Recording access called Lazy synchronization in the literature
Pros.
- works with any page replacement algorithm
(any)
Cons.
- Does not increase throughputs of CLOCK variants
because CLOCK does not require locks on buffer hits
Database - Cache misses involve batching
Files larger lock holding time makes more contentions

89

Conclusions

 Proposed a lock-free variant of the GCLOCK page
replacement algorithm, named Nb-GCLOCK.

 Linearizability and lock-freedom are proven in the paper

90

Conclusions

 almost linear scalability to processors up to 64 processors
while existing locking-based schemes do not scale beyond 16 processors
 The first attempt that introduce non-blocking synchronization
to database buffer management
 Optimistic I/Os using pread, CAS and memory barriers


91

Conclusions


 The lock-freedom guarantees a certain throughput:
any active thread taking a bounded number of steps ensures global progress.

92

Conclusions


 The lock-freedom guarantees a certain throughput:
any active thread taking a bounded number of steps ensures global progress.

This work is also useful for any caching solution
that requires high throughput (e.g., C10K accesses) 93

Thank you for your attention!

94

ICDE2010 Nb-GCLOCK

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (20)

Similar a ICDE2010 Nb-GCLOCK

Similar a ICDE2010 Nb-GCLOCK (20)

Más de Makoto Yui

Más de Makoto Yui (20)

Último

Último (20)

ICDE2010 Nb-GCLOCK