SlideShare una empresa de Scribd logo
1 de 51
Descargar para leer sin conexión
Modern Database Systems
Lecture 1
Aristides Gionis
Michael Mathioudakis
T.A.: Orestis Kostakis
Spring 2016
logistics
assignment will be up by Monday
(you will receive email)
due Feb 12th
if you’re not registered...
I will post material (slides and assignments) also at
http://michalis.co/moderndb/
2
in this lecture...
review past material
relational model and sql
storage and indexing
access cost analysis
hash index
b+ tree
3
relational model and SQL
relational model and sql
what is the relational model?
tabular representation of data
why do we study it?
supports simple and intuitive querying
good for educational purposes
most widely used
5
definitions
relational database
a set of relations
relation
example!
schema
name of relation + name and
type of each field
fields as columns
instance
a table with rows and columns
6
example relation: students
cardinality (number of rows) = 3,
degree (number of fields/columns) = 5
> can we have the same value twice in the same column?
schema
students(sid: integer, name: string, username: string, age: integer, gpa: real)
sid name username age gpa
53666 Sam Jones jones 22 3.4
53688 Alice Smith smith 22 3.8
53650 Jon Edwards jon 23 2.4
7
querying
major strength of relational model
simple, intuitive, precise querying of data
the DBMS is responsible for efficient evaluation
Standard Query Language (SQL)
the standard language for relational queries
developed by IBM in the 1970s
was standardized in 1986
latest standard in 2011
example!
8
example SQL query
to find student records of age 23
SELECT *
FROM students
WHERE age=23
to find just names and usernames
SELECT name, username
FROM students
WHERE age=23
sid name username age gpa
53666 Kate Jones jones 22 3.4
53688 Alice Smith smith 22 3.8
53650 Jon Edward jon 23 2.4
sid name username age gpa
53650
Jon
Edward
jon 23 2.4
name username
Jon
Edward
jon
9
creating, altering, and destroying, relations in SQL
CREATE TABLE students
(sid CHAR(20), name CHAR(20),
username CHAR(10), age INTEGER,
gpa REAL);
the type of each column is enforced by the DBMS
DROP TABLE students;
ALTER TABLE students
ADD COLUMN firstYear integer;
every tuple in the current instance is extended with a null value in the new column
CREATE TABLE course
(sid CHAR(20), points integer,
grade CHAR(2));
destroy relation students (schema and instance)
10
adding and deleting tuples
> what do the following statements do?
INSERT INTO students(sid, name, username, age, gpa)
VALUES (12345, “Kate Doe”, “kate”, 23, 4.0);
DELETE
FROM students
WHERE name = ‘Jane Smith’;
11
candidate keys
a set of fields is a candidate key (aka ‘key’) for a relation if...
1)  distinct tuples cannot have same values in all key fields, and
2)  this is not true for any subset of the key
if only part (1) from above is true... we have a superkey
possibly many candidate keys for a relation
DBMS admin chooses one (1) of them as primary key
an integrity constraint
condition must be true for any instance of the database
other integrity constraints?
12
candidate keys
in SQL, use
PRIMARY KEY to specify primary key
UNIQUE to specify candidate keys
example
relation enrolled holds information about student enrollment to courses
compare the following ‘create table’ statements
use ICs carefully - they might forbid database instances that could arise in practice
CREATE TABLE Enrolled
(sid CHAR(20),
cid CHAR(20),
grade CHAR(2),
PRIMARY KEY (sid,cid))
CREATE TABLE Enrolled
(sid CHAR(20)
cid CHAR(20),
grade CHAR(2),
PRIMARY KEY (sid),
UNIQUE (cid, grade)) 13
storage and indexing
14
storage
setting
the DBMS uses disks as external storage to store relations into files of records
disks retrieve random page at fixed cost
cheaper to retrieve several consecutive pages than each by random access
why?
file organization
method of arranging a file of records on external storage
record: one row of a relation
record is internally assigned a record id (rid)
rid is sufficient to physically locate record
(address)
15
alternative file organizations
heap files
random order
suitable when typical access is a file scan to retrieve all records
sorted files
records are sorted - typically by column value(s)
suitable if records must be retrieved by same order
indexes
data structures that allow organized access to records…
... via search keys - typically column value(s)
updates are faster than in sorted files -- why?
16
data structures that allow us to find rids
of records with specified column values
any subset of the columns of a relation
can be the search key for an index
search key is not same as primary / candidate key
indexes
an index contains a collection of data entries
supports efficient retrieval of data entries k*
with a given key value k
index entries
data entries
data records
index file
data file
17
types of data entries
three alternatives
1. data record with key value k
2. (k, rid of data record with search key k)
3. (k, list of rids of data records with search key k)
type of data entries is orthogonal to index structure
example of index structure B+ trees or hash tables
18
data entries of type 1
index structure is a file
organization for data records
we just have an ‘index file’
index entries
data records
index file
> how many indexes of a
relation can be of type 1?
19
types of data entries - types 2 & 3
data entries typically much smaller than data records
> why?
index entries
data entries
data records
index file
data file
type 3 is more compact than type 2
> why?
20
index classes
primary vs secondary
primary: if search key contains a primary key
unique index: search key contains a candidate key
clustered vs unclustered
if order of data records is same as that of data entries
makes big difference for some queries!
> can alternative 1 indexes be unclustered?
unclustered clustered
21
hash-based indexes
retrieve records with exactly specified search-key values
suitable for equality queries
index is collection of buckets
bucket = 1 or more disk pages
hashing function h
h(r) = bucket where record r belongs, based on its column values
data entries are …
... type 1: the buckets contain data records
... type 2 or 3: the buckets contain (key, rid) or (key, rids) pairs
22
hash-based indexes
Smith, 44, 3000
Jones, 40, 6003
Tracy, 44, 5004
Ashby, 25, 3000
Basu, 33, 4003
Kate, 29, 2007
Cass, 50, 5004
Basu, 33, 6003
age h1
relation employes(name CHAR(100), age INTEGER, salary INTEGER)
3000
3000
5004
5004
4003
2007
6003
6003
salaryh2
clustered (type 1) hash index on age unclustered (type 2) hash index on salary 23
leaf pages contain data entries, and are chained (prev & next)
non-leaf pages have index entries; only used to direct searches
P0 K 1 P 1 K 2 P 2 K m P m
index entry
b+ tree indexes
non-leaf
pages
leaf
pages
(sorted by search key)
24
example b+ tree
find 28*? 29*? all > 15* and < 30*?
insert/delete
find data entry in leaf, then update it
need to adjust parent sometimes
change sometimes bubbles up the tree
2* 3*
root
17
30
14* 16* 33* 34* 38* 39*
135
7*5* 8* 22* 24*
27
27* 29*
entries < 17 entries >= 17
note that data entries
in leaf level are sorted
access-cost analysis
26
access-cost model
● relation students
○  B: number of data pages, R: number of records per page
● execute typical select-from-where query
○  D: (average) time to read or write one disk page
SELECT *
FROM students
WHERE <...>
● estimate running time of query
○  ignore cpu costs
○  number of disk accesses (read/writes) is the bottleneck 27
file organizations
heap file (random order; inserts at eof)
sorted file, sorted on <age, gpa>
clustered B+ tree file (type 1 data entries) on
search key <age, gpa>
heap file with unclustered B+ tree index
on search key <age, gpa>
heaf file with unclustered hash index on
search key <age, gpa>
28
queries to compare
insert record
SELECT * FROM students
SELECT * FROM students
WHERE age = 22 and gpa = 4.0
SELECT * FROM student
WHERE age >= 20
INSERT INTO
STUDENTS (sid, name, username, age, gpa)
VALUES (12345, “Michael”, “mike”, 32, 2.6)
scan - fetch all records
equality search
range search
29
cost analysis
what is the estimated time for each query to run?
under simplified model
how many disk pages are accessed?
time = #disk-accesses x D
30
cost analysis
scan equality range insert
heap
sorted
clustered
unclustered
b+ tree
unclustered
hash
31
heap file
operation cost and explanation
scan B; simply retrieve all pages
equality
search
B in worst case; if we know that exactly one such
record exists, the cost is 0.5B in expectation
range search B; must retrieve all records
insert 2; fetch and store back the last page of the file
32
sorted file
operation cost and explanation
scan B; simply retrieve all pages
equality search
log2B + #qualifying-pages; since the condition matches the index, we can
find the page of the record with binary search that retrieves log2B pages; if
more than one records qualify, retrieve sequentially #qualifying-pages after
the first
range search
log2B + #qualifying-pages; as above, log2B pages are retrieved to find the
first matching record, followed possibly by a number (#qualifying-pages) of
pages with qualifying records
insert
log2B + B; find the position of the record in the file (log2B); then, read the
second half of the file, insert the record, write the second half back (0.5B +
0.5B in expectation)
33
clustered b+ tree
operation cost and explanation
scan 1.5B; simply retrieve all record pages
equality search
logF1.5B + #qualifying-pages; find the first qualifying record and
retrieve consecutive qualifying ones
range search
logF1.5B + #qualifying-pages; find the first qualifying record and
retrieve consecutive qualifying ones
insert
logF1.5B + 1; search for record page (logF1.5B) and add record
to it (1)
assumptions: 2/3 = 67% occupancy of record pages, i.e. 1.5B record
pages; fanout F
34
unclustered b+ tree
operation cost and explanation
scan
B(R+0.15); scan the leaf level of the index (0.15B); for
each data entry, fetch the page with the corresponding
data record (6.7R x 0.15B = BR)
equality search
logF0.15B + #qualifying-records; locate the first data
entry (logF0.15B) and do one disk access for every
qualifying record (#qualifying-records)
range search
logF0.15B + #qualifying-records; locate the first data
entry (logF0.15B) and do one disk access for every
qualifying record (#qualifying-records)
insert
3 + logF0.15B;insert at end of heap file (2), find page for
data entry (logF0.15B) and update it (1)
assumptions: the size of one data entry is 10% the size of one record; also, index
pages have 2/3=67% occupancy; therefore, number of index leaf pages is 0.1*1.5B
= 0.15B and number of data entries in one page are 10*0.67R = 6.7R
35
unclustered hash index
operation cost and explanation
scan
B(R+0.125); retrieve pages that contain data entries
(0.125B); for each data entry, fetch the page with the
corresponding data record
equality search
2; retrieve page with data entry (1) and page with data
record (1)
range search
0.125B + #qualifying-records; the hash index offers no help
- scan index (0.125B) and retrieve pages of matching
records; typically it’s better to scan entire heapfile (B)
insert
4; insert record into heap file (1 read+1 write); insert record
into hash index (1 read + 1 write)
assumptions: the size of one data entry is 10% the size of one record; static
hashing, no overflow pages (one bucket is one page); 4/5 = 80% occupancy;
therefore , 0.1*1.25B = 0.125B pages for data entries and the number of data
entries in a page is 10*0.8R = 8R
36
cost analysis
scan equality range insert
heap B B B 2
sorted B log2B +
#qualifying-
pages
log2B +
#qualifying-
pages
log2B + B
clustered 1.5B logF1.5B +
#qualifying-
pages
logF1.5B +
#qualifying-
pages
logF1.5B + 1
unclustered
b+ tree
B(R+0.15) logF0.15B +
#qualifying-
records
logF0.15B +
#qualifying-
records
3 + logF0.15B
unclustered
hash
B(R+0.125) 2 0.125B +
#qualifying-
records
4
note we made several assumptions to obtain these numbers 37
the morale
different queries have different cost
for different file organizations
> how would you use this analysis as a db admin?
discuss
38
the morale
know your workload
what queries? how often?
on what relations? what file organizations?
what indexes would speed-up response times for your workload?
hint: see WHERE clause for index key candidates
why?
what trade-offs will you face?
hint: queries are faster but updates take time, index takes space
we’ll see more complex cases in ‘query optimization’
39
indexes with composite search keys
composite search keys
search on a combination of fields
equality query
every field value is equal to a constant
e.g., age=20 and sal =75, wrt <sal,age> index
range query
some field value is not a constant
e.g., age =20; or age=20 and sal > 10, wrt <sal,age> index
data entries in index sorted by
search key to support range queries
(e.g., b+ trees) <sal, age>
<age>
<sal>
data records
sorted by name
data entries
sorted by <sal,age>
data entries
sorted by <sal>
examples of composite
key indexes
11,80
12,10
12,20
13,75
10,12
20,12
75,13
80,11
11
12
12
13
10
20
75
80
name age sal
bob 12 10
cal 11 80
joe 12 20
sue 13 75
<age,sal>
remember also
composite indexes are larger,
updated more often
40
composite search keys
if condition is: 3000<sal<5000:
<age,sal> index does not help! why?
because the index does not match the selection condition
index matches selection (condition ∧ ... ∧ ... ∧ condition) when:
for hash index: only equality conditions for all fields
for tree index: includes equality or range condition for a prefix of the search key
41
to retrieve employee records with age=30 AND sal=4000,
an index on <age,sal> or <sal, age> would be better than
an index on <age> or an index on <sal>
if condition is: age=30 AND 3000<sal<5000:
<age,sal> index much better than <sal,age> index! why?
hint: allows us to allocate answer with contiguous data entries
order can make a difference depending on the selectivity of each condition
if condition is: 20<age<30 AND 3000<sal<5000:
tree index on <age,sal> or <sal,age> make no difference
if selectivity of each condition is the same
composite search keys
42
index-only plans
some queries can be answered
without retrieving any data records
if a suitable index is available
example
employees
(name CHAR(100), depnum INTEGER,
age INTEGER, salary INTEGER)
SELECT depnum, COUNT(*)
FROM employees
GROUP BY depnum
SELECT AVG(salary)
FROM employees
WHERE age=25 AND
salary BETWEEN 3000 AND 5000
index on
<depnum>
b+ tree index on
<age,salary>
43
index-only plans are possible with both
<dno,age> or <age,dno>
tree index
<age, dno> is better
why?
SELECT E.dno, COUNT (*)
FROM Emp E
WHERE E.age=30
GROUP BY E.dno
index-only plans
44
summary
45
summary
●  relational model and SQL
○  tabular representation
■  one record per row
■  schema determines names and types of columns
○  simple, intuitive querying language
■  statements to select records that satisfy a condition
■  specify columns to project
■  statements to insert and delete tuples
46
●  storage
○  a DBMS might use different file organizations to store relations
○  heap file, sorted file, index
○  different queries have different access costs
for different file organizations
○  having the right index can make a big difference in execution time
●  commonly used indexes
○  B+ tree and hash-based index
next
b+ trees and hash-based index
external sorting
joins
query optimization
47
references
●  “cowbook”, database management systems, by ramakrishnan and gehrke
●  “elmasri”, fundamentals of database systems, elmasri and navathe
●  other database textbooks
●  disk access analysis
○  cowbook, chapter 8
●  b+ tree and hashing algorithms
○  elmasri
■  section 18.2: hash indexes
■  section 18.3.2: b+ trees
○  cowbook
■  chapters 10 and 11
48
credits
slides based on material from
database management systems, by ramakrishnan and gehrke
49
joins
sid name username age gpa
53666 Sam Jones jones 22 3.4
53688 Alice Smith smith 22 3.8
53650 Jon Edwards jon 23 2.4
students
sid points grade
53666 92 A
53688 35 D
53650 65 C
course
what does this compute?
SELECT S.name, C.grade
FROM Students S,Course C
WHERE S.sid = C.sid AND
C.points > 60
S.name C.grade
Sam Jones A
Jon Edwards C
50
index-only plans
SELECT E.dno, COUNT (*)
FROM Emp E
WHERE E.age>30
GROUP BY E.dno
what if we consider the second query?
we’ll come back to this after external sorting

Más contenido relacionado

La actualidad más candente

Dynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ TreesDynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ TreesPooja Dixit
 
Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)
Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)
Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)Beat Signer
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMSkoolkampus
 
Overview of Storage and Indexing ...
Overview of Storage and Indexing                                             ...Overview of Storage and Indexing                                             ...
Overview of Storage and Indexing ...Javed Khan
 
Lecture 1 data structures and algorithms
Lecture 1 data structures and algorithmsLecture 1 data structures and algorithms
Lecture 1 data structures and algorithmsAakash deep Singhal
 
Lecture4a dynamic data_structure
Lecture4a dynamic data_structureLecture4a dynamic data_structure
Lecture4a dynamic data_structurembadhi barnabas
 
Indexing structure for files
Indexing structure for filesIndexing structure for files
Indexing structure for filesZainab Almugbel
 
Data indexing presentation
Data indexing presentationData indexing presentation
Data indexing presentationgmbmanikandan
 
Introduction of Data Structure
Introduction of Data StructureIntroduction of Data Structure
Introduction of Data StructureMandavi Classes
 
Introduction to data structure
Introduction to data structureIntroduction to data structure
Introduction to data structureadeel hamid
 
Lecture4b dynamic data_structure
Lecture4b dynamic data_structureLecture4b dynamic data_structure
Lecture4b dynamic data_structurembadhi barnabas
 
Indexing and Hashing
Indexing and HashingIndexing and Hashing
Indexing and Hashingsathish sak
 
Data Structure the Basic Structure for Programming
Data Structure the Basic Structure for ProgrammingData Structure the Basic Structure for Programming
Data Structure the Basic Structure for Programmingpaperpublications3
 
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation:  Data Files, and Data Cleaning & PreparationAaa ped-6-Data manipulation:  Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & PreparationAminaRepo
 
Intro To TSQL - Unit 5
Intro To TSQL - Unit 5Intro To TSQL - Unit 5
Intro To TSQL - Unit 5iccma
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Beat Signer
 

La actualidad más candente (20)

Dynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ TreesDynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ Trees
 
Lecture1 data structure(introduction)
Lecture1 data structure(introduction)Lecture1 data structure(introduction)
Lecture1 data structure(introduction)
 
Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)
Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)
Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
 
Overview of Storage and Indexing ...
Overview of Storage and Indexing                                             ...Overview of Storage and Indexing                                             ...
Overview of Storage and Indexing ...
 
Lecture 1 data structures and algorithms
Lecture 1 data structures and algorithmsLecture 1 data structures and algorithms
Lecture 1 data structures and algorithms
 
Lecture4a dynamic data_structure
Lecture4a dynamic data_structureLecture4a dynamic data_structure
Lecture4a dynamic data_structure
 
Indexing structure for files
Indexing structure for filesIndexing structure for files
Indexing structure for files
 
Data indexing presentation
Data indexing presentationData indexing presentation
Data indexing presentation
 
Indexing Data Structure
Indexing Data StructureIndexing Data Structure
Indexing Data Structure
 
Introduction of Data Structure
Introduction of Data StructureIntroduction of Data Structure
Introduction of Data Structure
 
indexing and hashing
indexing and hashingindexing and hashing
indexing and hashing
 
Introduction to data structure
Introduction to data structureIntroduction to data structure
Introduction to data structure
 
Lecture4b dynamic data_structure
Lecture4b dynamic data_structureLecture4b dynamic data_structure
Lecture4b dynamic data_structure
 
Indexing and Hashing
Indexing and HashingIndexing and Hashing
Indexing and Hashing
 
Data Structure the Basic Structure for Programming
Data Structure the Basic Structure for ProgrammingData Structure the Basic Structure for Programming
Data Structure the Basic Structure for Programming
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
 
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation:  Data Files, and Data Cleaning & PreparationAaa ped-6-Data manipulation:  Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
 
Intro To TSQL - Unit 5
Intro To TSQL - Unit 5Intro To TSQL - Unit 5
Intro To TSQL - Unit 5
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
 

Similar a Modern Database Systems - Lecture 01

Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in indiaEdhole.com
 
indexingstructureforfiles-160728120658.pdf
indexingstructureforfiles-160728120658.pdfindexingstructureforfiles-160728120658.pdf
indexingstructureforfiles-160728120658.pdfFraolUmeta
 
Unit 4 data storage and querying
Unit 4   data storage and queryingUnit 4   data storage and querying
Unit 4 data storage and queryingRavindran Kannan
 
1- Introduction.pptx.pdf
1- Introduction.pptx.pdf1- Introduction.pptx.pdf
1- Introduction.pptx.pdfgm6523
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2PoguttuezhiniVP
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational modelChirag vasava
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!Arjen de Vries
 
19. Java data structures algorithms and complexity
19. Java data structures algorithms and complexity19. Java data structures algorithms and complexity
19. Java data structures algorithms and complexityIntro C# Book
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11AnwarrChaudary
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14AnwarrChaudary
 
Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19AnwarrChaudary
 

Similar a Modern Database Systems - Lecture 01 (20)

Unit08 dbms
Unit08 dbmsUnit08 dbms
Unit08 dbms
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
 
Queryproc2
Queryproc2Queryproc2
Queryproc2
 
indexingstructureforfiles-160728120658.pdf
indexingstructureforfiles-160728120658.pdfindexingstructureforfiles-160728120658.pdf
indexingstructureforfiles-160728120658.pdf
 
Keys.ppt
Keys.pptKeys.ppt
Keys.ppt
 
Unit 08 dbms
Unit 08 dbmsUnit 08 dbms
Unit 08 dbms
 
Unit 4 data storage and querying
Unit 4   data storage and queryingUnit 4   data storage and querying
Unit 4 data storage and querying
 
1- Introduction.pptx.pdf
1- Introduction.pptx.pdf1- Introduction.pptx.pdf
1- Introduction.pptx.pdf
 
Ardbms
ArdbmsArdbms
Ardbms
 
Storage struct
Storage structStorage struct
Storage struct
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational model
 
Searching algorithms
Searching algorithmsSearching algorithms
Searching algorithms
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!
 
Bc0041
Bc0041Bc0041
Bc0041
 
19. Java data structures algorithms and complexity
19. Java data structures algorithms and complexity19. Java data structures algorithms and complexity
19. Java data structures algorithms and complexity
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14
 
Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19
 
Indexing
IndexingIndexing
Indexing
 

Más de Michael Mathioudakis

Measuring polarization on social media
Measuring polarization on social mediaMeasuring polarization on social media
Measuring polarization on social mediaMichael Mathioudakis
 
Lecture 07 - CS-5040 - modern database systems
Lecture 07 -  CS-5040 - modern database systemsLecture 07 -  CS-5040 - modern database systems
Lecture 07 - CS-5040 - modern database systemsMichael Mathioudakis
 
Lecture 06 - CS-5040 - modern database systems
Lecture 06  - CS-5040 - modern database systemsLecture 06  - CS-5040 - modern database systems
Lecture 06 - CS-5040 - modern database systemsMichael Mathioudakis
 
Mining the Social Web - Lecture 3 - T61.6020
Mining the Social Web - Lecture 3 - T61.6020Mining the Social Web - Lecture 3 - T61.6020
Mining the Social Web - Lecture 3 - T61.6020Michael Mathioudakis
 
Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020Michael Mathioudakis
 
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMichael Mathioudakis
 
Bump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationBump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationMichael Mathioudakis
 

Más de Michael Mathioudakis (8)

Measuring polarization on social media
Measuring polarization on social mediaMeasuring polarization on social media
Measuring polarization on social media
 
Lecture 07 - CS-5040 - modern database systems
Lecture 07 -  CS-5040 - modern database systemsLecture 07 -  CS-5040 - modern database systems
Lecture 07 - CS-5040 - modern database systems
 
Lecture 06 - CS-5040 - modern database systems
Lecture 06  - CS-5040 - modern database systemsLecture 06  - CS-5040 - modern database systems
Lecture 06 - CS-5040 - modern database systems
 
Mining the Social Web - Lecture 3 - T61.6020
Mining the Social Web - Lecture 3 - T61.6020Mining the Social Web - Lecture 3 - T61.6020
Mining the Social Web - Lecture 3 - T61.6020
 
Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020
 
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
 
Absorbing Random Walk Centrality
Absorbing Random Walk CentralityAbsorbing Random Walk Centrality
Absorbing Random Walk Centrality
 
Bump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationBump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentation
 

Último

5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxKatherine Villaluna
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptxmary850239
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsEugene Lysak
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphNetziValdelomar1
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17Celine George
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17Celine George
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxEduSkills OECD
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxSaurabhParmar42
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfMohonDas
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxKatherine Villaluna
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfTechSoup
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfYu Kanazawa / Osaka University
 
Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational PhilosophyShuvankar Madhu
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapitolTechU
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfMohonDas
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxDr. Santhosh Kumar. N
 

Último (20)

5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptx
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptx
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George Wells
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a Paragraph
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptx
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdf
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
 
Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational Philosophy
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptx
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdf
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptx
 

Modern Database Systems - Lecture 01

  • 1. Modern Database Systems Lecture 1 Aristides Gionis Michael Mathioudakis T.A.: Orestis Kostakis Spring 2016
  • 2. logistics assignment will be up by Monday (you will receive email) due Feb 12th if you’re not registered... I will post material (slides and assignments) also at http://michalis.co/moderndb/ 2
  • 3. in this lecture... review past material relational model and sql storage and indexing access cost analysis hash index b+ tree 3
  • 5. relational model and sql what is the relational model? tabular representation of data why do we study it? supports simple and intuitive querying good for educational purposes most widely used 5
  • 6. definitions relational database a set of relations relation example! schema name of relation + name and type of each field fields as columns instance a table with rows and columns 6
  • 7. example relation: students cardinality (number of rows) = 3, degree (number of fields/columns) = 5 > can we have the same value twice in the same column? schema students(sid: integer, name: string, username: string, age: integer, gpa: real) sid name username age gpa 53666 Sam Jones jones 22 3.4 53688 Alice Smith smith 22 3.8 53650 Jon Edwards jon 23 2.4 7
  • 8. querying major strength of relational model simple, intuitive, precise querying of data the DBMS is responsible for efficient evaluation Standard Query Language (SQL) the standard language for relational queries developed by IBM in the 1970s was standardized in 1986 latest standard in 2011 example! 8
  • 9. example SQL query to find student records of age 23 SELECT * FROM students WHERE age=23 to find just names and usernames SELECT name, username FROM students WHERE age=23 sid name username age gpa 53666 Kate Jones jones 22 3.4 53688 Alice Smith smith 22 3.8 53650 Jon Edward jon 23 2.4 sid name username age gpa 53650 Jon Edward jon 23 2.4 name username Jon Edward jon 9
  • 10. creating, altering, and destroying, relations in SQL CREATE TABLE students (sid CHAR(20), name CHAR(20), username CHAR(10), age INTEGER, gpa REAL); the type of each column is enforced by the DBMS DROP TABLE students; ALTER TABLE students ADD COLUMN firstYear integer; every tuple in the current instance is extended with a null value in the new column CREATE TABLE course (sid CHAR(20), points integer, grade CHAR(2)); destroy relation students (schema and instance) 10
  • 11. adding and deleting tuples > what do the following statements do? INSERT INTO students(sid, name, username, age, gpa) VALUES (12345, “Kate Doe”, “kate”, 23, 4.0); DELETE FROM students WHERE name = ‘Jane Smith’; 11
  • 12. candidate keys a set of fields is a candidate key (aka ‘key’) for a relation if... 1)  distinct tuples cannot have same values in all key fields, and 2)  this is not true for any subset of the key if only part (1) from above is true... we have a superkey possibly many candidate keys for a relation DBMS admin chooses one (1) of them as primary key an integrity constraint condition must be true for any instance of the database other integrity constraints? 12
  • 13. candidate keys in SQL, use PRIMARY KEY to specify primary key UNIQUE to specify candidate keys example relation enrolled holds information about student enrollment to courses compare the following ‘create table’ statements use ICs carefully - they might forbid database instances that could arise in practice CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid)) CREATE TABLE Enrolled (sid CHAR(20) cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid), UNIQUE (cid, grade)) 13
  • 15. storage setting the DBMS uses disks as external storage to store relations into files of records disks retrieve random page at fixed cost cheaper to retrieve several consecutive pages than each by random access why? file organization method of arranging a file of records on external storage record: one row of a relation record is internally assigned a record id (rid) rid is sufficient to physically locate record (address) 15
  • 16. alternative file organizations heap files random order suitable when typical access is a file scan to retrieve all records sorted files records are sorted - typically by column value(s) suitable if records must be retrieved by same order indexes data structures that allow organized access to records… ... via search keys - typically column value(s) updates are faster than in sorted files -- why? 16
  • 17. data structures that allow us to find rids of records with specified column values any subset of the columns of a relation can be the search key for an index search key is not same as primary / candidate key indexes an index contains a collection of data entries supports efficient retrieval of data entries k* with a given key value k index entries data entries data records index file data file 17
  • 18. types of data entries three alternatives 1. data record with key value k 2. (k, rid of data record with search key k) 3. (k, list of rids of data records with search key k) type of data entries is orthogonal to index structure example of index structure B+ trees or hash tables 18
  • 19. data entries of type 1 index structure is a file organization for data records we just have an ‘index file’ index entries data records index file > how many indexes of a relation can be of type 1? 19
  • 20. types of data entries - types 2 & 3 data entries typically much smaller than data records > why? index entries data entries data records index file data file type 3 is more compact than type 2 > why? 20
  • 21. index classes primary vs secondary primary: if search key contains a primary key unique index: search key contains a candidate key clustered vs unclustered if order of data records is same as that of data entries makes big difference for some queries! > can alternative 1 indexes be unclustered? unclustered clustered 21
  • 22. hash-based indexes retrieve records with exactly specified search-key values suitable for equality queries index is collection of buckets bucket = 1 or more disk pages hashing function h h(r) = bucket where record r belongs, based on its column values data entries are … ... type 1: the buckets contain data records ... type 2 or 3: the buckets contain (key, rid) or (key, rids) pairs 22
  • 23. hash-based indexes Smith, 44, 3000 Jones, 40, 6003 Tracy, 44, 5004 Ashby, 25, 3000 Basu, 33, 4003 Kate, 29, 2007 Cass, 50, 5004 Basu, 33, 6003 age h1 relation employes(name CHAR(100), age INTEGER, salary INTEGER) 3000 3000 5004 5004 4003 2007 6003 6003 salaryh2 clustered (type 1) hash index on age unclustered (type 2) hash index on salary 23
  • 24. leaf pages contain data entries, and are chained (prev & next) non-leaf pages have index entries; only used to direct searches P0 K 1 P 1 K 2 P 2 K m P m index entry b+ tree indexes non-leaf pages leaf pages (sorted by search key) 24
  • 25. example b+ tree find 28*? 29*? all > 15* and < 30*? insert/delete find data entry in leaf, then update it need to adjust parent sometimes change sometimes bubbles up the tree 2* 3* root 17 30 14* 16* 33* 34* 38* 39* 135 7*5* 8* 22* 24* 27 27* 29* entries < 17 entries >= 17 note that data entries in leaf level are sorted
  • 27. access-cost model ● relation students ○  B: number of data pages, R: number of records per page ● execute typical select-from-where query ○  D: (average) time to read or write one disk page SELECT * FROM students WHERE <...> ● estimate running time of query ○  ignore cpu costs ○  number of disk accesses (read/writes) is the bottleneck 27
  • 28. file organizations heap file (random order; inserts at eof) sorted file, sorted on <age, gpa> clustered B+ tree file (type 1 data entries) on search key <age, gpa> heap file with unclustered B+ tree index on search key <age, gpa> heaf file with unclustered hash index on search key <age, gpa> 28
  • 29. queries to compare insert record SELECT * FROM students SELECT * FROM students WHERE age = 22 and gpa = 4.0 SELECT * FROM student WHERE age >= 20 INSERT INTO STUDENTS (sid, name, username, age, gpa) VALUES (12345, “Michael”, “mike”, 32, 2.6) scan - fetch all records equality search range search 29
  • 30. cost analysis what is the estimated time for each query to run? under simplified model how many disk pages are accessed? time = #disk-accesses x D 30
  • 31. cost analysis scan equality range insert heap sorted clustered unclustered b+ tree unclustered hash 31
  • 32. heap file operation cost and explanation scan B; simply retrieve all pages equality search B in worst case; if we know that exactly one such record exists, the cost is 0.5B in expectation range search B; must retrieve all records insert 2; fetch and store back the last page of the file 32
  • 33. sorted file operation cost and explanation scan B; simply retrieve all pages equality search log2B + #qualifying-pages; since the condition matches the index, we can find the page of the record with binary search that retrieves log2B pages; if more than one records qualify, retrieve sequentially #qualifying-pages after the first range search log2B + #qualifying-pages; as above, log2B pages are retrieved to find the first matching record, followed possibly by a number (#qualifying-pages) of pages with qualifying records insert log2B + B; find the position of the record in the file (log2B); then, read the second half of the file, insert the record, write the second half back (0.5B + 0.5B in expectation) 33
  • 34. clustered b+ tree operation cost and explanation scan 1.5B; simply retrieve all record pages equality search logF1.5B + #qualifying-pages; find the first qualifying record and retrieve consecutive qualifying ones range search logF1.5B + #qualifying-pages; find the first qualifying record and retrieve consecutive qualifying ones insert logF1.5B + 1; search for record page (logF1.5B) and add record to it (1) assumptions: 2/3 = 67% occupancy of record pages, i.e. 1.5B record pages; fanout F 34
  • 35. unclustered b+ tree operation cost and explanation scan B(R+0.15); scan the leaf level of the index (0.15B); for each data entry, fetch the page with the corresponding data record (6.7R x 0.15B = BR) equality search logF0.15B + #qualifying-records; locate the first data entry (logF0.15B) and do one disk access for every qualifying record (#qualifying-records) range search logF0.15B + #qualifying-records; locate the first data entry (logF0.15B) and do one disk access for every qualifying record (#qualifying-records) insert 3 + logF0.15B;insert at end of heap file (2), find page for data entry (logF0.15B) and update it (1) assumptions: the size of one data entry is 10% the size of one record; also, index pages have 2/3=67% occupancy; therefore, number of index leaf pages is 0.1*1.5B = 0.15B and number of data entries in one page are 10*0.67R = 6.7R 35
  • 36. unclustered hash index operation cost and explanation scan B(R+0.125); retrieve pages that contain data entries (0.125B); for each data entry, fetch the page with the corresponding data record equality search 2; retrieve page with data entry (1) and page with data record (1) range search 0.125B + #qualifying-records; the hash index offers no help - scan index (0.125B) and retrieve pages of matching records; typically it’s better to scan entire heapfile (B) insert 4; insert record into heap file (1 read+1 write); insert record into hash index (1 read + 1 write) assumptions: the size of one data entry is 10% the size of one record; static hashing, no overflow pages (one bucket is one page); 4/5 = 80% occupancy; therefore , 0.1*1.25B = 0.125B pages for data entries and the number of data entries in a page is 10*0.8R = 8R 36
  • 37. cost analysis scan equality range insert heap B B B 2 sorted B log2B + #qualifying- pages log2B + #qualifying- pages log2B + B clustered 1.5B logF1.5B + #qualifying- pages logF1.5B + #qualifying- pages logF1.5B + 1 unclustered b+ tree B(R+0.15) logF0.15B + #qualifying- records logF0.15B + #qualifying- records 3 + logF0.15B unclustered hash B(R+0.125) 2 0.125B + #qualifying- records 4 note we made several assumptions to obtain these numbers 37
  • 38. the morale different queries have different cost for different file organizations > how would you use this analysis as a db admin? discuss 38
  • 39. the morale know your workload what queries? how often? on what relations? what file organizations? what indexes would speed-up response times for your workload? hint: see WHERE clause for index key candidates why? what trade-offs will you face? hint: queries are faster but updates take time, index takes space we’ll see more complex cases in ‘query optimization’ 39
  • 40. indexes with composite search keys composite search keys search on a combination of fields equality query every field value is equal to a constant e.g., age=20 and sal =75, wrt <sal,age> index range query some field value is not a constant e.g., age =20; or age=20 and sal > 10, wrt <sal,age> index data entries in index sorted by search key to support range queries (e.g., b+ trees) <sal, age> <age> <sal> data records sorted by name data entries sorted by <sal,age> data entries sorted by <sal> examples of composite key indexes 11,80 12,10 12,20 13,75 10,12 20,12 75,13 80,11 11 12 12 13 10 20 75 80 name age sal bob 12 10 cal 11 80 joe 12 20 sue 13 75 <age,sal> remember also composite indexes are larger, updated more often 40
  • 41. composite search keys if condition is: 3000<sal<5000: <age,sal> index does not help! why? because the index does not match the selection condition index matches selection (condition ∧ ... ∧ ... ∧ condition) when: for hash index: only equality conditions for all fields for tree index: includes equality or range condition for a prefix of the search key 41
  • 42. to retrieve employee records with age=30 AND sal=4000, an index on <age,sal> or <sal, age> would be better than an index on <age> or an index on <sal> if condition is: age=30 AND 3000<sal<5000: <age,sal> index much better than <sal,age> index! why? hint: allows us to allocate answer with contiguous data entries order can make a difference depending on the selectivity of each condition if condition is: 20<age<30 AND 3000<sal<5000: tree index on <age,sal> or <sal,age> make no difference if selectivity of each condition is the same composite search keys 42
  • 43. index-only plans some queries can be answered without retrieving any data records if a suitable index is available example employees (name CHAR(100), depnum INTEGER, age INTEGER, salary INTEGER) SELECT depnum, COUNT(*) FROM employees GROUP BY depnum SELECT AVG(salary) FROM employees WHERE age=25 AND salary BETWEEN 3000 AND 5000 index on <depnum> b+ tree index on <age,salary> 43
  • 44. index-only plans are possible with both <dno,age> or <age,dno> tree index <age, dno> is better why? SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age=30 GROUP BY E.dno index-only plans 44
  • 46. summary ●  relational model and SQL ○  tabular representation ■  one record per row ■  schema determines names and types of columns ○  simple, intuitive querying language ■  statements to select records that satisfy a condition ■  specify columns to project ■  statements to insert and delete tuples 46 ●  storage ○  a DBMS might use different file organizations to store relations ○  heap file, sorted file, index ○  different queries have different access costs for different file organizations ○  having the right index can make a big difference in execution time ●  commonly used indexes ○  B+ tree and hash-based index
  • 47. next b+ trees and hash-based index external sorting joins query optimization 47
  • 48. references ●  “cowbook”, database management systems, by ramakrishnan and gehrke ●  “elmasri”, fundamentals of database systems, elmasri and navathe ●  other database textbooks ●  disk access analysis ○  cowbook, chapter 8 ●  b+ tree and hashing algorithms ○  elmasri ■  section 18.2: hash indexes ■  section 18.3.2: b+ trees ○  cowbook ■  chapters 10 and 11 48
  • 49. credits slides based on material from database management systems, by ramakrishnan and gehrke 49
  • 50. joins sid name username age gpa 53666 Sam Jones jones 22 3.4 53688 Alice Smith smith 22 3.8 53650 Jon Edwards jon 23 2.4 students sid points grade 53666 92 A 53688 35 D 53650 65 C course what does this compute? SELECT S.name, C.grade FROM Students S,Course C WHERE S.sid = C.sid AND C.points > 60 S.name C.grade Sam Jones A Jon Edwards C 50
  • 51. index-only plans SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>30 GROUP BY E.dno what if we consider the second query? we’ll come back to this after external sorting