SlideShare una empresa de Scribd logo
1 de 76
Descargar para leer sin conexión
Big Data Profiling
Fribourg
May 2014
Felix Naumann
The Hasso Plattner Institute
■ Founded in 1998 as a Public Private Partnership
■ Hasso Plattner, co-founder of SAP, endowed over 200 Mio. Euro.
■ Adjoined with the University of Potsdam
■ 500 students
□ BA, MA, PhD
2
■ Enterprise Platform and
Integration Concepts
■ Internet Technologies and
Systems
■ Human Computer Interaction
■ Computer Graphics Systems
■ Operating Systems and
Middleware
■ Business Process Technology
■ Software Architecture
■ Information Systems
■ System Engineering and
Modeling
■ School of Design Thinking
Felix Naumann | Data Profiling | CUSO 2014
Research Topics
■ Data Profiling and Analytics
■ Data Quality and Data Cleansing
■ Similarity Search and ETL Management
■ Knowledge Discovery and Text Extraction
■ (Linked) Open Data Integration
■ For more information on research topics and on teaching, please
see http://www.hpi.uni-potsdam.de/naumann/home.html
3
Felix Naumann | Data Profiling | CUSO 2014
Profiling in Spreadsheets
Felix Naumann | Data Profiling | CUSO 2014
4
Felix Naumann | Data Profiling | CUSO 2014
5
Felix Naumann | Data Profiling | CUSO 2014
6
Felix Naumann | Data Profiling | CUSO 2014
7
Felix Naumann | Data Profiling | CUSO 2014
8
Felix Naumann | Data Profiling | CUSO 2014
9
Felix Naumann | Data Profiling | CUSO 2014
10
Many interesting questions remain
■ What are possible keys and foreign keys?
□ Phone
□ firstname, lastname, street
■ Are there any functional dependencies?
□ zip -> city
□ race -> voting behavior
■ Which columns correlate?
□ county and first name
□ DoB and last name
■ What are frequent patterns in a column?
□ ddddd
□ dd aaaa St
Felix Naumann | Data Profiling | CUSO 2014
11
Definition Data Profiling
■ Data profiling is the process of examining the data available in an
existing data source [...] and collecting statistics and information
about that data.
Wikipedia 09/2013
■ Data profiling refers to the activity of creating small but
informative summaries of a database.
Ted Johnson, Encyclopedia of Database Systems
■ A fixed set of data profiling tasks / results
Felix Naumann | Data Profiling | CUSO 2014
12
„Big“ Data Profiling
or How big is „Big“?
Data profiling = measuring the „Vs“
■ Volume
□ Row counts, etc.
■ Velocity
□ Temporal profiling
■ Variability
□ How difficult to
integrate and analyse
■ Veracity
□ How good is it?
■ …
Felix Naumann | Data Profiling | CUSO 2014
13
Big
Data
Volume
Velocity
Variety
Veracity
Viscosity
Virality
Use Cases for Profiling
■ Query optimization
□ Counts and histograms
■ Data cleansing
□ Patterns, rules, and violations
■ Data integration
□ Cross-DB inclusion dependencies
■ Scientific data management
□ Handle new datasets
■ Data inspection, analytics, and mining
□ Profiling as preparation to decide on models and questions
■ Database reverse engineering
■ Data profiling as preparation for any other data management task
Felix Naumann | Data Profiling | CUSO 2014
14
Classification of Traditional
Profiling Tasks
Felix Naumann | Data Profiling | CUSO 2014
15
Dataprofiling
Single column
Cardinalities
Patterns and
data types
Value
distributions
Multiple
columns
Uniqueness
Key discovery
Conditional
Partial
Inclusion
dependencies
Foreign key
discovery
Conditional
Partial
Functional
dependencies
Conditional
Partial
Single-column vs. multi-column
■ Single column profiling
□ Most basic form of data profiling
□ Often part of the basic statistics gathered by DBMS
□ Discovery complexity: Number of values/rows
■ Multicolumn profiling
□ Discover joint properties
□ Discover dependencies
□ Discovery complexity: Number of columns and number of
values
Felix Naumann | Data Profiling | CUSO 2014
16
Scalable profiling
■ Scalability in number of rows
■ Scalability in number of columns
□ “Small” table with 100 columns:
2100 – 1 = 1,267,650,600,228,229,401,496,703,205,375
= 1.3 nonillion column combinations
◊ Impossible to check or even enumerate
■ Possible solutions
□ Scale up: More RAM, faster CPUs
◊ Expensive
□ Scale in: More cores
◊ More complex (threading)
□ Scale out: More machines
◊ Communication overhead
□ Intelligent enumeration and aggressive pruning
Felix Naumann | Data Profiling | CUSO 2014
17
Challenges of (Big) Data Profiling
Felix Naumann | Data Profiling | CUSO 2014
18
■ Computational complexity
□ Number of rows
□ Number of columns (and column combinations)
■ Large solution space
■ New data types (beyond strings and numbers)
■ New data models (beyond relational): RDF, XML, etc.
■ New requirements
□ User-oriented
□ Interactive
□ Streaming data
Agenda
19
■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
Cardinalities, distributions, and patterns
Category Task Description
Cardinalities num-rows Number of rows
value length
Measurements of value lengths (min, max, median, and
average)
null values Number or percentage of null values
distinct Number of distinct values; aka “cardinality”
uniqueness Number of distinct values divided by number of rows
Value distributions histogram Frequency histograms (equi-width, equi-depth, etc.)
constancy Frequency of most frequent value divided by number of rows
quartiles
Three points that divide the (numeric) values into four equal
groups
soundex Distribution of soundex codes
first digit
Distribution of first digit in numeric values; to check Benford's
law
Patterns, data
types, and domains basic type Generic data type: numeric, alphabetic, date, time
data type Concrete DBMS-specific data type: varchar, timestamp, etc.
decimals Maximum number of decimal places in numeric values
precision Maximum number of digits in numeric values
patterns Histogram of value patterns (Aa9…)
data class
Semantic, generic data type: code, indicator, text, date/time,
quantity, identifier, etc.
domain
Classification of semantic domain: credit card, first name, city,
phenotype, etc.
Felix Naumann | Data Profiling | CUSO 2014
20
Data types and value patterns
■ String vs. number
■ String vs. number vs. date
■ Categorical vs. continuous
■ SQL data types
□ CHAR, INT, DECIMAL, TIMESTAMP, BIT, CLOB, …
■ Domains
□ VARCHAR(12) vs. VARCHAR (13)
■ XML data types
□ More fine grained
■ Regular expressions (d{3})-(d{3})-(d{4})-(d+)
■ Semantic domains
□ Adress, phone, email, first name
Felix Naumann | Data Profiling | CUSO 2014
21
Increasingsemantics
An Aside: Benford Law Frequency
(“first digit law”)
■ Statement about the distribution of first digits d in (many)
naturally occurring numbers:
□ 𝑃 𝑑 = 𝑙𝑜𝑔10 𝑑 + 1 − 𝑙𝑜𝑔10 𝑑 = 𝑙𝑜𝑔10 1 + 1
𝑑
□ Holds if log(x) is uniformly distributed
Felix Naumann | Data Profiling | CUSO 2014
22
0
20
40
1 2 3 4 5 6 7 8 9
Examples for Benford‘s Law
■ Surface areas of 335 rivers
■ Sizes of 3259 US populations
■ 104 physical constants
■ 1800 molecular weights
■ 5000 entries from a mathematical handbook
■ 308 numbers contained in an issue of Reader's Digest
■ Street addresses of the first 342 persons listed in American Men of Science
Felix Naumann | Data Profiling | CUSO 2014
23
Heights of the 60 tallest structures
http://en.wikipedia.org/wiki/List_of_tallest_buildings_and_structures_in_the_world#
Tallest_structure_by_category
Agenda
24
■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
Naive Discovery Approach
■ Functional dependency „X → A“: whenever two records have the
same X values, they also have the same A values.
■ Given relation R, detect all minimal, non-trivial FDs X → A.
■ For each column combination X
□ For each pair of tuples (t1,t2)
◊ If t1[XA] = t2[XA] and t1[A]  t2[A]: Break
■ Complexity
□ Exponential in number of attributes
□ times number of rows squared
Felix Naumann | Data Profiling | CUSO 2014
25
Tane – General Idea [HKPT99]
■ Two elements of approach
1. Reduce column combinations through pruning
◊ Reasoning over FDs
2. Reduce tuple sets through partitioning
◊ Partition tuple IDs according to attribute values
◊ Level-wise increase of size of attribute set
● Consider sets of tuples whose values agree on that set
Felix Naumann | Data Profiling | CUSO 2014
26
Discovery strategy
■ Bottom up traversal through lattice
□  only minimal dependencies
□ Pruning
□ Re-use results from previous level
■ For a set X, test all XA → A, AX
□  only non-trivial dependencies
□ Test on efficient data structure
Felix Naumann | Data Profiling | CUSO 2014
27
A B C D
AB ACAD BC BD CD
ABC ABD ACD BCD
ABCD
Functional Dependencies:
State of the Art
Felix Naumann | Data Profiling | CUSO 2014
28
Partial and conditional dependencies
■ Partial dependency: dependencies that do not perfectly hold
□ For all but 10 of the tuples
□ Only for 90% of the tuples
□ Only for 1% of the tuples
■ Partiality also for patterns, types, uniques, and other constraints
■ Given a partial dependencies: For which part does it hold?
□ Expressed as a condition over the attributes of the relation
■ Problems:
□ Infinite possibilities of conditions
□ Interestingness:
◊ Many distinct values: less interesting
◊ Few distinct values: surprising condition – high coverage
■ Useful for
□ Integration: cross-source condition inclusion dependency
Felix Naumann | Data Profiling | CUSO 2014
29
Agenda
30
■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
Uniqueness, keys, and foreign keys
■ Uniqueness and keys
□ Unique column: Only unique values
□ Unique column combination: Only unique value combinations
◊ Minimality: No subset is unique
□ Key candidate: No null values
◊ Uniqueness and non-null in one instance does not imply key:
Only human can specify keys (and foreign keys)
■ Inclusion dependencies and foreign keys
□ A  B: All values in A are also present in B
□ A1,…,Ai  B1,…,Bi: All value comb. in A1,…,Ai are also present in
B1,…,Bi
□ Prerequisite for foreign key
◊ Across relations and across databases
◊ Again: Discovery on a given instance, only user can specify
for schema
Felix Naumann | Data Profiling | CUSO 2014
31
Uniqueness and keys
■ Unique column
□ Only unique values
■ Unique column combination
□ Only unique value combinations
□ Minimality: No subset is unique
■ Uniques: {A, AB, AC, BC, ABC}
■ Minimal uniques: {A, BC}
■ (Maximal) Non-uniques: {B, C}
Felix Naumann | Data Profiling | CUSO 2014
32
A B C
a 1 x
b 2 x
c 2 y
Null values
■ Null values have a wide range of interpretations.
□ Unknown (date of birth)
□ Non-applicable (driver license number for kids)
□ Undefined (result of integration/outer join)
■ What are minimal uniques for the following data set?
■ Primary key {A}; Some unusual uniques: {C} and {CD}
■ Distinct: {A, BC} but not {CD}
Felix Naumann | Data Profiling | CUSO 2014
33
A B C D
a 1 x 1
b 2 y 2
c 3 z 5
d 3 ⊥ 5
e ⊥ ⊥ 5
Pruning effect of a pair
Felix Naumann | Data Profiling | CUSO 2014
34
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABDABE ACD ACEADE BCD BCE BDE CDE
ABCDABCE ABDE ACDE BCDE
ABCDE
minimal
unique
unique
Pruning with uniques
■ Pruning: inferring the type of a combination without actual
verification
■ If A is unique, supersets must be unique
■ Finding a unique column prunes half of the lattice
□ Remove column from initial data set and restart
■ Finding a unique column pair removes a quarter of the lattice
□ In general, the lattice over the combination is removed
■ The pruning power of a combination is reduced by prior findings
□ AB prunes a quarter
□ BC additionally prunes only one eighth
□ ABC already pruned one eights
Felix Naumann | Data Profiling | CUSO 2014
35
Pruning both ways
Felix Naumann | Data Profiling | CUSO 2014
36
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABDABE ACD ACEADE BCD BCE BDE CDE
ABCDABCE ABDE ACDE BCDE
ABCDE
minimal
unique
unique
maximal
non-unique
non-unique
TPCH – Uniques and Non-Uniques
Felix Naumann | Data Profiling | CUSO 2014
37
non-uniqueunique
8 columns
9 columns
10 columns
Unique Column Combination Discovery
■ DUCC
□ Basic idea: random walk through lattice
□ Pick random superset if current combination is non-unique
□ Pick random subset otherwise
□ Lazy prune with previously visited nodes
Felix Naumann | Data Profiling | CUSO 2014
38
Row-basedColumn-based Hybrid
Gordian
[SBHR06]
Apriori
[GW99]
HCA
[AN11]
DUCC
[HQA+14]
SWAN
[AQN14]
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
ABCD
ABC
ABCE
ABD
ABDE
AB
ACD
CD
ACD BCD CDE
Minimum unique column combination candidate
Minimum unique column combination
Maximum non-unique column combination
candidate
Maximum non-unique
column combination
Pruned
Visited nodes: 10 out of 26
Felix Naumann | Data Profiling | CUSO 2014
39
Scaling the number of columns
■ NCVoter, 100k rows
Felix Naumann | Data Profiling | CUSO 2014
40
Scaling the number of rows
■ NCVoter, 15 columns
Felix Naumann | Data Profiling | CUSO 2014
41
Analysis of DUCC
■ Runtime mainly depends on size of solution set
■ Worst case: solution set in the middle
Felix Naumann | Data Profiling | CUSO 2014
42
Uniques and non-uniques
in NC-voter data
■ A minimal unique: voter_reg_num, zip_code, race_code
■ A maximal non-unique: voter_reg_num, status_cd,
voter_status_desc, reason_cd, voter_status_reason_desc, absent_ind,
name_prefx_cd, name_sufx_cd, half_code, street_dir, street_type_cd,
street_sufx_cd, unit_designator, unit_num, state_cd, mail_addr2,
mail_addr3, mail_addr4, mail_state, area_cd, phone_num,
full_phone_number, drivers_lic, race_code, race_desc, ethnic_code,
ethnic_desc, party_cd, party_desc, sex_code, sex, birth_place,
precinct_abbrv, precinct_desc, municipality_abbrv, municipality_desc,
ward_abbrv, ward_desc, cong_dist_abbrv, cong_dist_desc,
super_court_abbrv, super_court_desc, judic_dist_abbrv,
judic_dist_desc, nc_senate_abbrv, nc_senate_desc, nc_house_abbrv,
nc_house_desc, county_commiss_abbrv, county_commiss_desc,
township_abbrv, township_desc, school_dist_abbrv, school_dist_desc,
fire_dist_abbrv, fire_dist_desc, water_dist_abbrv, water_dist_desc,
sewer_dist_abbrv, sewer_dist_desc, sanit_dist_abbrv,
sanit_dist_desc, rescue_dist_abbrv, rescue_dist_desc,
munic_dist_abbrv, munic_dist_desc, dist_1_abbrv, dist_1_desc,
dist_2_abbrv, dist_2_desc, confidential_ind, age, vtd_abbrv, vtd_desc
Felix Naumann | Data Profiling | CUSO 2014
43
Dynamic Data: Challenges
■ Inserts may create new duplicate combinations
□ Minimal uniques (mUCs) might become non-unique
□ Maximal non-uniques (mNUCs) might lose maximality
■ Deletes remove duplicate value combinations
□ NUCs might get unique
□ mUCs might lose minimality
■ Idea
□ Leverage the knowledge of previously discovered mUCs and
mNUCs
□ Create appropriate indices
Felix Naumann | Data Profiling | CUSO 2014
44
SWAN Architecture [AQN14]
Felix Naumann | Data Profiling | CUSO 2014
45
SW AN
Database
(input dataset) Repository
(MUCS and MNUCS)
Inserts Handler
Uniqueness
Checker
Deletes Handler
Duplicate
Checker
deletesinserts
MUCS-indexdata-index duplicate-index
inserts/deletes
inserts/deletes
update
Scaling the Number of Columns
■ 100k rows and 10k inserts
Felix Naumann | Data Profiling | CUSO 2014
46
0.2$ 0.9$
1$
10$
100$
1000$
10000$
100000$
10$ 20$ 30$ 40$ 50$ 60$
Executiontime(s)
Number of columns
Ducc Gordian-Inc Swan
■ TPCH with 16 columns and 5 million rows
■ Swan/Ducc combination is able to process larger datasets than
Ducc on a static dataset
Stressing the Number of Inserts
Felix Naumann | Data Profiling | CUSO 2014
47
0"
2000"
4000"
6000"
8000"
10000"
12000"
10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%" 100%"
Executiontime(s)
Insert size wrt. initial dataset size
Ducc Swan
Next steps
■ Finding primary keys
□ Uniqueness is necessary criteria
□ No null values
□ Include other features
◊ Name includes “id”, number of columns
■ Partial uniques
□ 99.9% of the data unique
□ Useful to detect data errors
□ Gordian, HCA, and DUCC can be easily modified
■ Incremental discovery
Felix Naumann | Data Profiling | CUSO 2014
48
Inclusion Dependencies: Definition
■ INDs involve more than one relation.
■ Let D be a relational schema and let I be an instance of D.
■ R[A1, …, An] denotes projection of I on attributes A1, … An, of
relation R: R[A1, …, An] = πA1, …, An(R)
■ IND  = R[A1, …, An]  S[B1, …, Bn], where R, S are (possibly
identical) relations of D.
□ Projection on R and S must have same number of attributes.
■ An instance I of D satisfies  if I(R)[A1, …, An]  I(S)[B1, …, Bn]
■ Values of R: “dependent values”
■ Values of S: “referenced values”
Felix Naumann | Data Profiling | CUSO 2014
49
IND types
■ Unary INDs
□ INDs on single attributes: R[A]  S[B]
■ n-ary INDs
□ INDs on multiple attributes: R[X]  S[Y]
■ Partial INDs
□ IND R[A]  S[B] is satisfied for x% of all tuples in R
□ IND R[A]  S[B] is satisfied for all but x tuples in R
■ Approximate INDs
□ IND R[A]  S[B] is satisfied with probability p.
□ Based on sampling or other heuristics
Felix Naumann | Data Profiling | CUSO 2014
50
Motivation for IND discovery
■ General insight into data
■ Detect unknown foreign keys
■ Example
□ PDB: Protein Data Bank
□ OpenMMS provides relational schema
◊ Parses protein and nucleic acid
macromolecular structure data
from the standard mmCIF format.
□ 175 tables with primary key
constraints
□ 2705 attributes
□ But: Not a single foreign key
constraint!
Felix Naumann | Data Profiling | CUSO 2014
51
Motivation for IND discovery
■ Ensembl – genome database
□ shipped as MySQL dump files
□ more than 200 tables
□ Not a single foreign key constraint!
■ Why are FKs missing?
□ Lack of support for checking foreign key constraints in the
host system
◊ Example: Oracle did not support FKs up to v6
□ Fear that checking such constraints would impede database
performance
□ Lack of database knowledge within the development team
Felix Naumann | Data Profiling | CUSO 2014
52
Felix Naumann | Data Profiling | CUSO 2014
53
SPIDER: Single Pass Inclusion DEpendency
Recognition [BLNT07]
■ Main ideas
□ Test all IND-candidate pairs in parallel.
□ Read attribute values only once.
□ Stop test of an IND-candidate after first counter-example.
□ Reduce number of value comparisons by specialized data structure.
□ No need to build inverted index.
■ Two steps:
□ Sort and distinct all attribute‘s values and write them to disk
◊ For each attribute: SELECT DISTINCT A FROM R ORDER BY A
□ Test all IND candidate pairs in parallel
SPIDER by example
■ In each step: Intersect „attributes to process“ with each refs list of
previous step
Felix Naumann | Data Profiling | CUSO 2014
54
attributes A, B, C
A B C
s s
t t t
x
y y y
z
attributes
to process
dep A
refs
dep B
refs
dep C
refs
Init B,C A,C A,B
Step 1 A,C C A,C A
Step 2 A,B,C C A,C A
Step 3 A  A,C A
Step 4 A,B,C  A,C A
Step 5 C  A,C 
Problem: Automatic Determination of
Foreign Keys
■ Given
□ Relational schema
□ Database instance of that schema
□ Complete set of (observed) inclusion dependencies
◊ Attributes A and B with R[A]  S[B] (in short A  B)
■ Find
□ All foreign key constraints: attributes A and B with A  B
■ Difficulty
□ Foreign keys are not intrinsic to data, but defined by humans
□ Discover semantics
■ Machine learning approach based on syntactic features [RAB+09]
Felix Naumann | Data Profiling | CUSO 2014
55
Features
■ DependentAndReferenced
□ Counts how often the dependent attribute A
appears as referenced attribute in the set of
all INDs.
□ Usually, a foreign key is not also a primary
key that is referenced as foreign key by other
tables.
■ MultiDependent
□ Counts how often A appears as dependent
attribute in the set of all INDs.
□ If s(A) is contained in the set of values of
many other attributes, the likelihood for each
of these INDs being a FK is decreased.
■ MultiReferenced
□ Counts how often B appears as referenced
attribute in the set of all INDs.
□ Often, primary keys are referenced by more
than one foreign key.
Felix Naumann | Data Profiling | CUSO 2014
56
A
a
B
a
b
?
C
a
D
a
A
a
B
a
b
?
C
a
D
a
A
a
B
a
b
?
C
a
D
a
Features
■ DistinctDependentValues
□ The cardinality of s(A).
□ Usually, attributes that are foreign keys
contain at least some different values.
■ ValueLengthDiff
□ Difference between the average value length
(as string) in s(A) and s(B).
□ Usually, average length of the values is similar
whenever foreign keys reference a non-biased
sample of the primary keys.
Felix Naumann | Data Profiling | CUSO 2014
57
A
a
a
a
a
a
B
a
b
c
d
e
?
A
abab
abab
abab
c
d
B
abab
b
c
d
e
?
Features
■ Coverage
□ The ratio of values in s(B) that are covered by s(A)
compared to all values in s(B).
□ Usually, foreign keys cover a considerable number of
primary key values.
◊ 60% of FK-attribute values cover all ref-values
◊ Each covers at least 10%
■ OutOfRange
□ Percentage of values in s(B) that are not within
[ min(s(A)), max(s(A)) ].
□ Usually, the dependent values should be evenly
distributed over the referenced values.
□ Mostly, less than 5% of values outside of range
■ TableSizeRatio
□ Ratio of number of tuples in A and number of tuples in B.
□ Usually in life sciences databases, table sizes do not
differ wildly
Felix Naumann | Data Profiling | CUSO 2014
58
A
b
c
b
c
B
a
b
c
d
e
f
g
?
Features
■ ColumnName
□ Similarity between name(A) and
name(B), also considering the
name of the table of which B is
an attribute.
■ TypicalNameSuffix
□ Checks whether name(A) ends
with a substring that indicates a
foreign key.
□ „id“, „key“, and „nr“
Felix Naumann | Data Profiling | CUSO 2014
59
FILMTEXTE.FILMTEXTTYPNR
 FILMTEXTTYPEN.FILMTEXTTYPNR
CUSTOMER.C_NATIONKEY
 NATION.N_NATIONKEY
SG_SEQFEATURE.ENT_OID
 SG_COMMENT.ENT_OID
COURSE.STUDENT
 STUDENT.ID
SG_BIOENTRY.TAX_OID
 SG_TAXON.OID
Agenda
60
■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
Tools have very long feature lists
Felix Naumann | Data Profiling | CUSO 2014
61
■ Num rows
■ Min value length
■ Median value length
■ Max value length
■ Avg value length
■ Precision of numeric values
■ Scale of numeric values
■ Quartiles
■ Basic data types
■ Num distinct values ("cardinality")
■ Percentage null values
■ Data class and data type
■ Uniqueness and constancy
■ Single-column frequency histogram
■ Multi-column frequency histogram
■ Pattern discovery (Aa9)
■ Soundex frequencies
■ Benford Law Frequency
■ Single column primary key discovery
■ Multi-column primary key discovery
■ Single column IND discovery
■ Inclusion percentage
■ Single-column FK discovery
■ Multi-column IND discovery
■ Multi-column FK discovery
■ Value overlap (cross domain analysis)
■ Single-column FD discovery
■ Multi-column FD discovery
■ Text profiling
Oracle Data Profiling and Quality
Control Center
Felix Naumann | Data Profiling | CUSO 2014
62
Screenshots from IBM Information Analyzer
Felix Naumann | Data Profiling | CUSO 2014
63
Typical Shortcomings of Tools
(and methods from research)
■ Usability
□ Complex to configure
□ Results complex to view and interpret
■ Scalability
□ Main-memory based
□ SQL based
■ Efficiency
□ Coffee, Lunch, Overnight
■ Functionality
□ Restricted to simplest tasks
□ Restricted to individual columns or small column sets
◊ “Realistic” key candidates vs. further use-cases
□ „Checking“ vs. „discovery“
■ Interpretation of profiling results
Felix Naumann | Data Profiling | CUSO 2014
64
That‘s the big one
Metanome – Profiling your Datanome
Felix Naumann | Data Profiling | CUSO 2014
65
 Algorithm execution
 Result
management
 Algorithm configuration
 Result
presentation
Configuration
Measurements
SPIDER
jar
DUCC
jar
SWAN
jar
txt
xml
csv
DB2
DB2
MySQL
Results
Agenda
66
■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
Online Profiling
■ Profiling is long procedure
□ Boring for developers
□ Expensive for machines (I/O and CPU)
■ Challenge: Display intermediate results
□ … of improving/converging accuracy
□ Allows early abort of profiling run
■ Gear algorithms toward that goal
□ Allow intermediate output
□ Enable early output: “progressive” profiling
Felix Naumann | Data Profiling | CUSO 2014
67
Incremental Profiling
■ Data is dynamic
□ Insert (batch or tuple-based)
□ Updates
□ Deletes
■ Problem: Keep profiling results up-to-date…
□ … without re-profiling the entire data set.
□ Easy examples: SUM, MIN, MAX, COUNT, AVG
□ Difficult examples: MEDIAN, uniqueness (see earlier slides),
dependencies
Felix Naumann | Data Profiling | CUSO 2014
68
Piggyback Profiling
■ Goal: Determine metadata for query results
■ Challenge: With as little query processing overhead as possible
□ Baseline: Run second SQL query
□ Piggybacking: profile along query plan (using base statistics)
Felix Naumann | Data Profiling | CUSO 2014
69
Profiling for Integration
■ Profile multiple sources simultaneously
■ Schema matching/mapping
□ What constitutes the “difficulty” of matching/mapping?
■ Duplicate detection
□ Estimate data overlap
□ Estimate fusion effort
■ Create measures to estimate
integration (and cleansing) effort
□ Schema and data overlap
□ Severity of heterogeneity
Felix Naumann | Data Profiling | CUSO 2014
70
Profiling new Types of Data
■ Traditional data profiling: Single table or multiple tables
■ More and more data in other models
□ XML / nested relational / JSON
□ RDF triples
□ Textual data: Blogs, Tweets, News
□ Multimedia data
■ Different models offer new dimensions to profile
□ XML: Nestedness, measures at different nesting levels
□ RDF: Graph structure, in- and outdegrees
□ Multimedia: Color, video-length, volume, etc.
□ Text: Sentiment, sentence structure, complexity, and other
linguistic measures
Felix Naumann | Data Profiling | CUSO 2014
71
Average Sentence Length
Felix Naumann | Data Profiling | CUSO 2014
75
„Literature Fingerprinting: A New Method for Visual
Literary Analysis” by Daniel A. Keim and Daniela Oelke
Hapax Legomena
Felix Naumann | Data Profiling | CUSO 2014
76
„Literature Fingerprinting: A New Method for Visual
Literary Analysis” by Daniel A. Keim and Daniela Oelke
News Statistics
Felix Naumann | Data Profiling | CUSO 2014
77
Master thesis Matthias Kohnen
Summary
78
■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
Summary
Felix Naumann | Data Profiling | CUSO 2014
79
Data Profiling
Single source
Single column
Cardinalities
Uniqueness
and keys
Patterns and
data types
Distributions
Multiple
columns
Uniqueness
and keys
Inclusion and
foreign key
dep.
Functional
dependencies
Conditional and
approximate
dep.
Multiple
sources
Topical overlap
Topic discovery
Topical
clustering
Schematic
overlap
Schema
matching
Cross-schema
dependencies
Data overlap
Duplicate
detection
Record linkage

Más contenido relacionado

La actualidad más candente (20)

Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data
DataData
Data
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data Quality
Data QualityData Quality
Data Quality
 
Data Quality Services in SQL Server 2012
Data Quality Services in SQL Server 2012Data Quality Services in SQL Server 2012
Data Quality Services in SQL Server 2012
 
Database
DatabaseDatabase
Database
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Introduction to Tableau
Introduction to Tableau Introduction to Tableau
Introduction to Tableau
 
Ethics in Data Management.pptx
Ethics in Data Management.pptxEthics in Data Management.pptx
Ethics in Data Management.pptx
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to database
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bi
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : Concepts
 
Data Quality
Data QualityData Quality
Data Quality
 
Data analysis
Data analysisData analysis
Data analysis
 
Relational Database Design
Relational Database DesignRelational Database Design
Relational Database Design
 
The data quality challenge
The data quality challengeThe data quality challenge
The data quality challenge
 
Introduction to Relational Databases
Introduction to Relational DatabasesIntroduction to Relational Databases
Introduction to Relational Databases
 

Similar a Big Data Profiling

Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnetcaise2013vlc
 
Stefano romanazzi terrorist network mining.pptx
Stefano romanazzi terrorist network mining.pptxStefano romanazzi terrorist network mining.pptx
Stefano romanazzi terrorist network mining.pptxStefano Romanazzi
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web versionMichael Brodie
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsPyData
 
2016 04-19 machine learning
2016 04-19 machine learning2016 04-19 machine learning
2016 04-19 machine learningMark Reynolds
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and VisualizationDr. Neil Brittliff
 
Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected homeHéloïse Nonne
 
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...nikshaikh786
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guidegokulprasath06
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Henock Beyene
 
Semantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsSemantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsFaegheh Hasibi
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysAerospike, Inc.
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsarx-deidentifier
 
Oxford Lectures Part 1
Oxford Lectures Part 1Oxford Lectures Part 1
Oxford Lectures Part 1Andrea Pasqua
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
finding nobel prize window by PageRank
finding nobel prize window by PageRankfinding nobel prize window by PageRank
finding nobel prize window by PageRankYuji Fujita
 

Similar a Big Data Profiling (20)

Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
 
Daming
DamingDaming
Daming
 
Stefano romanazzi terrorist network mining.pptx
Stefano romanazzi terrorist network mining.pptxStefano romanazzi terrorist network mining.pptx
Stefano romanazzi terrorist network mining.pptx
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web version
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive Spreadsheets
 
2016 04-19 machine learning
2016 04-19 machine learning2016 04-19 machine learning
2016 04-19 machine learning
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
 
Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected home
 
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guide
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
Semantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsSemantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity Cards
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California Highways
 
03 presentation-bothiesson
03 presentation-bothiesson03 presentation-bothiesson
03 presentation-bothiesson
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithms
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
Oxford Lectures Part 1
Oxford Lectures Part 1Oxford Lectures Part 1
Oxford Lectures Part 1
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
finding nobel prize window by PageRank
finding nobel prize window by PageRankfinding nobel prize window by PageRank
finding nobel prize window by PageRank
 

Más de eXascale Infolab

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...eXascale Infolab
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex GraphseXascale Infolab
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapeXascale Infolab
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...eXascale Infolab
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceanseXascale Infolab
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingeXascale Infolab
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...eXascale Infolab
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big DataeXascale Infolab
 

Más de eXascale Infolab (20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 

Último

6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2AuEnriquezLontok
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaDr.Mahmoud Abbas
 
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika DasBACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika DasChayanika Das
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11GelineAvendao
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
Measures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGMeasures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGSoniaBajaj10
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 
Unveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialUnveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialMarkus Roggen
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 

Último (20)

6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika DasBACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
Measures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGMeasures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UG
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 
Unveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialUnveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s Potential
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 

Big Data Profiling

  • 1. Big Data Profiling Fribourg May 2014 Felix Naumann
  • 2. The Hasso Plattner Institute ■ Founded in 1998 as a Public Private Partnership ■ Hasso Plattner, co-founder of SAP, endowed over 200 Mio. Euro. ■ Adjoined with the University of Potsdam ■ 500 students □ BA, MA, PhD 2 ■ Enterprise Platform and Integration Concepts ■ Internet Technologies and Systems ■ Human Computer Interaction ■ Computer Graphics Systems ■ Operating Systems and Middleware ■ Business Process Technology ■ Software Architecture ■ Information Systems ■ System Engineering and Modeling ■ School of Design Thinking Felix Naumann | Data Profiling | CUSO 2014
  • 3. Research Topics ■ Data Profiling and Analytics ■ Data Quality and Data Cleansing ■ Similarity Search and ETL Management ■ Knowledge Discovery and Text Extraction ■ (Linked) Open Data Integration ■ For more information on research topics and on teaching, please see http://www.hpi.uni-potsdam.de/naumann/home.html 3 Felix Naumann | Data Profiling | CUSO 2014
  • 4. Profiling in Spreadsheets Felix Naumann | Data Profiling | CUSO 2014 4
  • 5. Felix Naumann | Data Profiling | CUSO 2014 5
  • 6. Felix Naumann | Data Profiling | CUSO 2014 6
  • 7. Felix Naumann | Data Profiling | CUSO 2014 7
  • 8. Felix Naumann | Data Profiling | CUSO 2014 8
  • 9. Felix Naumann | Data Profiling | CUSO 2014 9
  • 10. Felix Naumann | Data Profiling | CUSO 2014 10
  • 11. Many interesting questions remain ■ What are possible keys and foreign keys? □ Phone □ firstname, lastname, street ■ Are there any functional dependencies? □ zip -> city □ race -> voting behavior ■ Which columns correlate? □ county and first name □ DoB and last name ■ What are frequent patterns in a column? □ ddddd □ dd aaaa St Felix Naumann | Data Profiling | CUSO 2014 11
  • 12. Definition Data Profiling ■ Data profiling is the process of examining the data available in an existing data source [...] and collecting statistics and information about that data. Wikipedia 09/2013 ■ Data profiling refers to the activity of creating small but informative summaries of a database. Ted Johnson, Encyclopedia of Database Systems ■ A fixed set of data profiling tasks / results Felix Naumann | Data Profiling | CUSO 2014 12
  • 13. „Big“ Data Profiling or How big is „Big“? Data profiling = measuring the „Vs“ ■ Volume □ Row counts, etc. ■ Velocity □ Temporal profiling ■ Variability □ How difficult to integrate and analyse ■ Veracity □ How good is it? ■ … Felix Naumann | Data Profiling | CUSO 2014 13 Big Data Volume Velocity Variety Veracity Viscosity Virality
  • 14. Use Cases for Profiling ■ Query optimization □ Counts and histograms ■ Data cleansing □ Patterns, rules, and violations ■ Data integration □ Cross-DB inclusion dependencies ■ Scientific data management □ Handle new datasets ■ Data inspection, analytics, and mining □ Profiling as preparation to decide on models and questions ■ Database reverse engineering ■ Data profiling as preparation for any other data management task Felix Naumann | Data Profiling | CUSO 2014 14
  • 15. Classification of Traditional Profiling Tasks Felix Naumann | Data Profiling | CUSO 2014 15 Dataprofiling Single column Cardinalities Patterns and data types Value distributions Multiple columns Uniqueness Key discovery Conditional Partial Inclusion dependencies Foreign key discovery Conditional Partial Functional dependencies Conditional Partial
  • 16. Single-column vs. multi-column ■ Single column profiling □ Most basic form of data profiling □ Often part of the basic statistics gathered by DBMS □ Discovery complexity: Number of values/rows ■ Multicolumn profiling □ Discover joint properties □ Discover dependencies □ Discovery complexity: Number of columns and number of values Felix Naumann | Data Profiling | CUSO 2014 16
  • 17. Scalable profiling ■ Scalability in number of rows ■ Scalability in number of columns □ “Small” table with 100 columns: 2100 – 1 = 1,267,650,600,228,229,401,496,703,205,375 = 1.3 nonillion column combinations ◊ Impossible to check or even enumerate ■ Possible solutions □ Scale up: More RAM, faster CPUs ◊ Expensive □ Scale in: More cores ◊ More complex (threading) □ Scale out: More machines ◊ Communication overhead □ Intelligent enumeration and aggressive pruning Felix Naumann | Data Profiling | CUSO 2014 17
  • 18. Challenges of (Big) Data Profiling Felix Naumann | Data Profiling | CUSO 2014 18 ■ Computational complexity □ Number of rows □ Number of columns (and column combinations) ■ Large solution space ■ New data types (beyond strings and numbers) ■ New data models (beyond relational): RDF, XML, etc. ■ New requirements □ User-oriented □ Interactive □ Streaming data
  • 19. Agenda 19 ■ Basic statistics ■ Functional dependencies ■ Keys and foreign keys ■ Data profiling tools ■ Advanced profiling Felix Naumann | Data Profiling | CUSO 2014
  • 20. Cardinalities, distributions, and patterns Category Task Description Cardinalities num-rows Number of rows value length Measurements of value lengths (min, max, median, and average) null values Number or percentage of null values distinct Number of distinct values; aka “cardinality” uniqueness Number of distinct values divided by number of rows Value distributions histogram Frequency histograms (equi-width, equi-depth, etc.) constancy Frequency of most frequent value divided by number of rows quartiles Three points that divide the (numeric) values into four equal groups soundex Distribution of soundex codes first digit Distribution of first digit in numeric values; to check Benford's law Patterns, data types, and domains basic type Generic data type: numeric, alphabetic, date, time data type Concrete DBMS-specific data type: varchar, timestamp, etc. decimals Maximum number of decimal places in numeric values precision Maximum number of digits in numeric values patterns Histogram of value patterns (Aa9…) data class Semantic, generic data type: code, indicator, text, date/time, quantity, identifier, etc. domain Classification of semantic domain: credit card, first name, city, phenotype, etc. Felix Naumann | Data Profiling | CUSO 2014 20
  • 21. Data types and value patterns ■ String vs. number ■ String vs. number vs. date ■ Categorical vs. continuous ■ SQL data types □ CHAR, INT, DECIMAL, TIMESTAMP, BIT, CLOB, … ■ Domains □ VARCHAR(12) vs. VARCHAR (13) ■ XML data types □ More fine grained ■ Regular expressions (d{3})-(d{3})-(d{4})-(d+) ■ Semantic domains □ Adress, phone, email, first name Felix Naumann | Data Profiling | CUSO 2014 21 Increasingsemantics
  • 22. An Aside: Benford Law Frequency (“first digit law”) ■ Statement about the distribution of first digits d in (many) naturally occurring numbers: □ 𝑃 𝑑 = 𝑙𝑜𝑔10 𝑑 + 1 − 𝑙𝑜𝑔10 𝑑 = 𝑙𝑜𝑔10 1 + 1 𝑑 □ Holds if log(x) is uniformly distributed Felix Naumann | Data Profiling | CUSO 2014 22 0 20 40 1 2 3 4 5 6 7 8 9
  • 23. Examples for Benford‘s Law ■ Surface areas of 335 rivers ■ Sizes of 3259 US populations ■ 104 physical constants ■ 1800 molecular weights ■ 5000 entries from a mathematical handbook ■ 308 numbers contained in an issue of Reader's Digest ■ Street addresses of the first 342 persons listed in American Men of Science Felix Naumann | Data Profiling | CUSO 2014 23 Heights of the 60 tallest structures http://en.wikipedia.org/wiki/List_of_tallest_buildings_and_structures_in_the_world# Tallest_structure_by_category
  • 24. Agenda 24 ■ Basic statistics ■ Functional dependencies ■ Keys and foreign keys ■ Data profiling tools ■ Advanced profiling Felix Naumann | Data Profiling | CUSO 2014
  • 25. Naive Discovery Approach ■ Functional dependency „X → A“: whenever two records have the same X values, they also have the same A values. ■ Given relation R, detect all minimal, non-trivial FDs X → A. ■ For each column combination X □ For each pair of tuples (t1,t2) ◊ If t1[XA] = t2[XA] and t1[A]  t2[A]: Break ■ Complexity □ Exponential in number of attributes □ times number of rows squared Felix Naumann | Data Profiling | CUSO 2014 25
  • 26. Tane – General Idea [HKPT99] ■ Two elements of approach 1. Reduce column combinations through pruning ◊ Reasoning over FDs 2. Reduce tuple sets through partitioning ◊ Partition tuple IDs according to attribute values ◊ Level-wise increase of size of attribute set ● Consider sets of tuples whose values agree on that set Felix Naumann | Data Profiling | CUSO 2014 26
  • 27. Discovery strategy ■ Bottom up traversal through lattice □  only minimal dependencies □ Pruning □ Re-use results from previous level ■ For a set X, test all XA → A, AX □  only non-trivial dependencies □ Test on efficient data structure Felix Naumann | Data Profiling | CUSO 2014 27 A B C D AB ACAD BC BD CD ABC ABD ACD BCD ABCD
  • 28. Functional Dependencies: State of the Art Felix Naumann | Data Profiling | CUSO 2014 28
  • 29. Partial and conditional dependencies ■ Partial dependency: dependencies that do not perfectly hold □ For all but 10 of the tuples □ Only for 90% of the tuples □ Only for 1% of the tuples ■ Partiality also for patterns, types, uniques, and other constraints ■ Given a partial dependencies: For which part does it hold? □ Expressed as a condition over the attributes of the relation ■ Problems: □ Infinite possibilities of conditions □ Interestingness: ◊ Many distinct values: less interesting ◊ Few distinct values: surprising condition – high coverage ■ Useful for □ Integration: cross-source condition inclusion dependency Felix Naumann | Data Profiling | CUSO 2014 29
  • 30. Agenda 30 ■ Basic statistics ■ Functional dependencies ■ Keys and foreign keys ■ Data profiling tools ■ Advanced profiling Felix Naumann | Data Profiling | CUSO 2014
  • 31. Uniqueness, keys, and foreign keys ■ Uniqueness and keys □ Unique column: Only unique values □ Unique column combination: Only unique value combinations ◊ Minimality: No subset is unique □ Key candidate: No null values ◊ Uniqueness and non-null in one instance does not imply key: Only human can specify keys (and foreign keys) ■ Inclusion dependencies and foreign keys □ A  B: All values in A are also present in B □ A1,…,Ai  B1,…,Bi: All value comb. in A1,…,Ai are also present in B1,…,Bi □ Prerequisite for foreign key ◊ Across relations and across databases ◊ Again: Discovery on a given instance, only user can specify for schema Felix Naumann | Data Profiling | CUSO 2014 31
  • 32. Uniqueness and keys ■ Unique column □ Only unique values ■ Unique column combination □ Only unique value combinations □ Minimality: No subset is unique ■ Uniques: {A, AB, AC, BC, ABC} ■ Minimal uniques: {A, BC} ■ (Maximal) Non-uniques: {B, C} Felix Naumann | Data Profiling | CUSO 2014 32 A B C a 1 x b 2 x c 2 y
  • 33. Null values ■ Null values have a wide range of interpretations. □ Unknown (date of birth) □ Non-applicable (driver license number for kids) □ Undefined (result of integration/outer join) ■ What are minimal uniques for the following data set? ■ Primary key {A}; Some unusual uniques: {C} and {CD} ■ Distinct: {A, BC} but not {CD} Felix Naumann | Data Profiling | CUSO 2014 33 A B C D a 1 x 1 b 2 y 2 c 3 z 5 d 3 ⊥ 5 e ⊥ ⊥ 5
  • 34. Pruning effect of a pair Felix Naumann | Data Profiling | CUSO 2014 34 A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABDABE ACD ACEADE BCD BCE BDE CDE ABCDABCE ABDE ACDE BCDE ABCDE minimal unique unique
  • 35. Pruning with uniques ■ Pruning: inferring the type of a combination without actual verification ■ If A is unique, supersets must be unique ■ Finding a unique column prunes half of the lattice □ Remove column from initial data set and restart ■ Finding a unique column pair removes a quarter of the lattice □ In general, the lattice over the combination is removed ■ The pruning power of a combination is reduced by prior findings □ AB prunes a quarter □ BC additionally prunes only one eighth □ ABC already pruned one eights Felix Naumann | Data Profiling | CUSO 2014 35
  • 36. Pruning both ways Felix Naumann | Data Profiling | CUSO 2014 36 A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABDABE ACD ACEADE BCD BCE BDE CDE ABCDABCE ABDE ACDE BCDE ABCDE minimal unique unique maximal non-unique non-unique
  • 37. TPCH – Uniques and Non-Uniques Felix Naumann | Data Profiling | CUSO 2014 37 non-uniqueunique 8 columns 9 columns 10 columns
  • 38. Unique Column Combination Discovery ■ DUCC □ Basic idea: random walk through lattice □ Pick random superset if current combination is non-unique □ Pick random subset otherwise □ Lazy prune with previously visited nodes Felix Naumann | Data Profiling | CUSO 2014 38 Row-basedColumn-based Hybrid Gordian [SBHR06] Apriori [GW99] HCA [AN11] DUCC [HQA+14] SWAN [AQN14]
  • 39. A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE ABCD ABC ABCE ABD ABDE AB ACD CD ACD BCD CDE Minimum unique column combination candidate Minimum unique column combination Maximum non-unique column combination candidate Maximum non-unique column combination Pruned Visited nodes: 10 out of 26 Felix Naumann | Data Profiling | CUSO 2014 39
  • 40. Scaling the number of columns ■ NCVoter, 100k rows Felix Naumann | Data Profiling | CUSO 2014 40
  • 41. Scaling the number of rows ■ NCVoter, 15 columns Felix Naumann | Data Profiling | CUSO 2014 41
  • 42. Analysis of DUCC ■ Runtime mainly depends on size of solution set ■ Worst case: solution set in the middle Felix Naumann | Data Profiling | CUSO 2014 42
  • 43. Uniques and non-uniques in NC-voter data ■ A minimal unique: voter_reg_num, zip_code, race_code ■ A maximal non-unique: voter_reg_num, status_cd, voter_status_desc, reason_cd, voter_status_reason_desc, absent_ind, name_prefx_cd, name_sufx_cd, half_code, street_dir, street_type_cd, street_sufx_cd, unit_designator, unit_num, state_cd, mail_addr2, mail_addr3, mail_addr4, mail_state, area_cd, phone_num, full_phone_number, drivers_lic, race_code, race_desc, ethnic_code, ethnic_desc, party_cd, party_desc, sex_code, sex, birth_place, precinct_abbrv, precinct_desc, municipality_abbrv, municipality_desc, ward_abbrv, ward_desc, cong_dist_abbrv, cong_dist_desc, super_court_abbrv, super_court_desc, judic_dist_abbrv, judic_dist_desc, nc_senate_abbrv, nc_senate_desc, nc_house_abbrv, nc_house_desc, county_commiss_abbrv, county_commiss_desc, township_abbrv, township_desc, school_dist_abbrv, school_dist_desc, fire_dist_abbrv, fire_dist_desc, water_dist_abbrv, water_dist_desc, sewer_dist_abbrv, sewer_dist_desc, sanit_dist_abbrv, sanit_dist_desc, rescue_dist_abbrv, rescue_dist_desc, munic_dist_abbrv, munic_dist_desc, dist_1_abbrv, dist_1_desc, dist_2_abbrv, dist_2_desc, confidential_ind, age, vtd_abbrv, vtd_desc Felix Naumann | Data Profiling | CUSO 2014 43
  • 44. Dynamic Data: Challenges ■ Inserts may create new duplicate combinations □ Minimal uniques (mUCs) might become non-unique □ Maximal non-uniques (mNUCs) might lose maximality ■ Deletes remove duplicate value combinations □ NUCs might get unique □ mUCs might lose minimality ■ Idea □ Leverage the knowledge of previously discovered mUCs and mNUCs □ Create appropriate indices Felix Naumann | Data Profiling | CUSO 2014 44
  • 45. SWAN Architecture [AQN14] Felix Naumann | Data Profiling | CUSO 2014 45 SW AN Database (input dataset) Repository (MUCS and MNUCS) Inserts Handler Uniqueness Checker Deletes Handler Duplicate Checker deletesinserts MUCS-indexdata-index duplicate-index inserts/deletes inserts/deletes update
  • 46. Scaling the Number of Columns ■ 100k rows and 10k inserts Felix Naumann | Data Profiling | CUSO 2014 46 0.2$ 0.9$ 1$ 10$ 100$ 1000$ 10000$ 100000$ 10$ 20$ 30$ 40$ 50$ 60$ Executiontime(s) Number of columns Ducc Gordian-Inc Swan
  • 47. ■ TPCH with 16 columns and 5 million rows ■ Swan/Ducc combination is able to process larger datasets than Ducc on a static dataset Stressing the Number of Inserts Felix Naumann | Data Profiling | CUSO 2014 47 0" 2000" 4000" 6000" 8000" 10000" 12000" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%" 100%" Executiontime(s) Insert size wrt. initial dataset size Ducc Swan
  • 48. Next steps ■ Finding primary keys □ Uniqueness is necessary criteria □ No null values □ Include other features ◊ Name includes “id”, number of columns ■ Partial uniques □ 99.9% of the data unique □ Useful to detect data errors □ Gordian, HCA, and DUCC can be easily modified ■ Incremental discovery Felix Naumann | Data Profiling | CUSO 2014 48
  • 49. Inclusion Dependencies: Definition ■ INDs involve more than one relation. ■ Let D be a relational schema and let I be an instance of D. ■ R[A1, …, An] denotes projection of I on attributes A1, … An, of relation R: R[A1, …, An] = πA1, …, An(R) ■ IND  = R[A1, …, An]  S[B1, …, Bn], where R, S are (possibly identical) relations of D. □ Projection on R and S must have same number of attributes. ■ An instance I of D satisfies  if I(R)[A1, …, An]  I(S)[B1, …, Bn] ■ Values of R: “dependent values” ■ Values of S: “referenced values” Felix Naumann | Data Profiling | CUSO 2014 49
  • 50. IND types ■ Unary INDs □ INDs on single attributes: R[A]  S[B] ■ n-ary INDs □ INDs on multiple attributes: R[X]  S[Y] ■ Partial INDs □ IND R[A]  S[B] is satisfied for x% of all tuples in R □ IND R[A]  S[B] is satisfied for all but x tuples in R ■ Approximate INDs □ IND R[A]  S[B] is satisfied with probability p. □ Based on sampling or other heuristics Felix Naumann | Data Profiling | CUSO 2014 50
  • 51. Motivation for IND discovery ■ General insight into data ■ Detect unknown foreign keys ■ Example □ PDB: Protein Data Bank □ OpenMMS provides relational schema ◊ Parses protein and nucleic acid macromolecular structure data from the standard mmCIF format. □ 175 tables with primary key constraints □ 2705 attributes □ But: Not a single foreign key constraint! Felix Naumann | Data Profiling | CUSO 2014 51
  • 52. Motivation for IND discovery ■ Ensembl – genome database □ shipped as MySQL dump files □ more than 200 tables □ Not a single foreign key constraint! ■ Why are FKs missing? □ Lack of support for checking foreign key constraints in the host system ◊ Example: Oracle did not support FKs up to v6 □ Fear that checking such constraints would impede database performance □ Lack of database knowledge within the development team Felix Naumann | Data Profiling | CUSO 2014 52
  • 53. Felix Naumann | Data Profiling | CUSO 2014 53 SPIDER: Single Pass Inclusion DEpendency Recognition [BLNT07] ■ Main ideas □ Test all IND-candidate pairs in parallel. □ Read attribute values only once. □ Stop test of an IND-candidate after first counter-example. □ Reduce number of value comparisons by specialized data structure. □ No need to build inverted index. ■ Two steps: □ Sort and distinct all attribute‘s values and write them to disk ◊ For each attribute: SELECT DISTINCT A FROM R ORDER BY A □ Test all IND candidate pairs in parallel
  • 54. SPIDER by example ■ In each step: Intersect „attributes to process“ with each refs list of previous step Felix Naumann | Data Profiling | CUSO 2014 54 attributes A, B, C A B C s s t t t x y y y z attributes to process dep A refs dep B refs dep C refs Init B,C A,C A,B Step 1 A,C C A,C A Step 2 A,B,C C A,C A Step 3 A  A,C A Step 4 A,B,C  A,C A Step 5 C  A,C 
  • 55. Problem: Automatic Determination of Foreign Keys ■ Given □ Relational schema □ Database instance of that schema □ Complete set of (observed) inclusion dependencies ◊ Attributes A and B with R[A]  S[B] (in short A  B) ■ Find □ All foreign key constraints: attributes A and B with A  B ■ Difficulty □ Foreign keys are not intrinsic to data, but defined by humans □ Discover semantics ■ Machine learning approach based on syntactic features [RAB+09] Felix Naumann | Data Profiling | CUSO 2014 55
  • 56. Features ■ DependentAndReferenced □ Counts how often the dependent attribute A appears as referenced attribute in the set of all INDs. □ Usually, a foreign key is not also a primary key that is referenced as foreign key by other tables. ■ MultiDependent □ Counts how often A appears as dependent attribute in the set of all INDs. □ If s(A) is contained in the set of values of many other attributes, the likelihood for each of these INDs being a FK is decreased. ■ MultiReferenced □ Counts how often B appears as referenced attribute in the set of all INDs. □ Often, primary keys are referenced by more than one foreign key. Felix Naumann | Data Profiling | CUSO 2014 56 A a B a b ? C a D a A a B a b ? C a D a A a B a b ? C a D a
  • 57. Features ■ DistinctDependentValues □ The cardinality of s(A). □ Usually, attributes that are foreign keys contain at least some different values. ■ ValueLengthDiff □ Difference between the average value length (as string) in s(A) and s(B). □ Usually, average length of the values is similar whenever foreign keys reference a non-biased sample of the primary keys. Felix Naumann | Data Profiling | CUSO 2014 57 A a a a a a B a b c d e ? A abab abab abab c d B abab b c d e ?
  • 58. Features ■ Coverage □ The ratio of values in s(B) that are covered by s(A) compared to all values in s(B). □ Usually, foreign keys cover a considerable number of primary key values. ◊ 60% of FK-attribute values cover all ref-values ◊ Each covers at least 10% ■ OutOfRange □ Percentage of values in s(B) that are not within [ min(s(A)), max(s(A)) ]. □ Usually, the dependent values should be evenly distributed over the referenced values. □ Mostly, less than 5% of values outside of range ■ TableSizeRatio □ Ratio of number of tuples in A and number of tuples in B. □ Usually in life sciences databases, table sizes do not differ wildly Felix Naumann | Data Profiling | CUSO 2014 58 A b c b c B a b c d e f g ?
  • 59. Features ■ ColumnName □ Similarity between name(A) and name(B), also considering the name of the table of which B is an attribute. ■ TypicalNameSuffix □ Checks whether name(A) ends with a substring that indicates a foreign key. □ „id“, „key“, and „nr“ Felix Naumann | Data Profiling | CUSO 2014 59 FILMTEXTE.FILMTEXTTYPNR  FILMTEXTTYPEN.FILMTEXTTYPNR CUSTOMER.C_NATIONKEY  NATION.N_NATIONKEY SG_SEQFEATURE.ENT_OID  SG_COMMENT.ENT_OID COURSE.STUDENT  STUDENT.ID SG_BIOENTRY.TAX_OID  SG_TAXON.OID
  • 60. Agenda 60 ■ Basic statistics ■ Functional dependencies ■ Keys and foreign keys ■ Data profiling tools ■ Advanced profiling Felix Naumann | Data Profiling | CUSO 2014
  • 61. Tools have very long feature lists Felix Naumann | Data Profiling | CUSO 2014 61 ■ Num rows ■ Min value length ■ Median value length ■ Max value length ■ Avg value length ■ Precision of numeric values ■ Scale of numeric values ■ Quartiles ■ Basic data types ■ Num distinct values ("cardinality") ■ Percentage null values ■ Data class and data type ■ Uniqueness and constancy ■ Single-column frequency histogram ■ Multi-column frequency histogram ■ Pattern discovery (Aa9) ■ Soundex frequencies ■ Benford Law Frequency ■ Single column primary key discovery ■ Multi-column primary key discovery ■ Single column IND discovery ■ Inclusion percentage ■ Single-column FK discovery ■ Multi-column IND discovery ■ Multi-column FK discovery ■ Value overlap (cross domain analysis) ■ Single-column FD discovery ■ Multi-column FD discovery ■ Text profiling
  • 62. Oracle Data Profiling and Quality Control Center Felix Naumann | Data Profiling | CUSO 2014 62
  • 63. Screenshots from IBM Information Analyzer Felix Naumann | Data Profiling | CUSO 2014 63
  • 64. Typical Shortcomings of Tools (and methods from research) ■ Usability □ Complex to configure □ Results complex to view and interpret ■ Scalability □ Main-memory based □ SQL based ■ Efficiency □ Coffee, Lunch, Overnight ■ Functionality □ Restricted to simplest tasks □ Restricted to individual columns or small column sets ◊ “Realistic” key candidates vs. further use-cases □ „Checking“ vs. „discovery“ ■ Interpretation of profiling results Felix Naumann | Data Profiling | CUSO 2014 64 That‘s the big one
  • 65. Metanome – Profiling your Datanome Felix Naumann | Data Profiling | CUSO 2014 65  Algorithm execution  Result management  Algorithm configuration  Result presentation Configuration Measurements SPIDER jar DUCC jar SWAN jar txt xml csv DB2 DB2 MySQL Results
  • 66. Agenda 66 ■ Basic statistics ■ Functional dependencies ■ Keys and foreign keys ■ Data profiling tools ■ Advanced profiling Felix Naumann | Data Profiling | CUSO 2014
  • 67. Online Profiling ■ Profiling is long procedure □ Boring for developers □ Expensive for machines (I/O and CPU) ■ Challenge: Display intermediate results □ … of improving/converging accuracy □ Allows early abort of profiling run ■ Gear algorithms toward that goal □ Allow intermediate output □ Enable early output: “progressive” profiling Felix Naumann | Data Profiling | CUSO 2014 67
  • 68. Incremental Profiling ■ Data is dynamic □ Insert (batch or tuple-based) □ Updates □ Deletes ■ Problem: Keep profiling results up-to-date… □ … without re-profiling the entire data set. □ Easy examples: SUM, MIN, MAX, COUNT, AVG □ Difficult examples: MEDIAN, uniqueness (see earlier slides), dependencies Felix Naumann | Data Profiling | CUSO 2014 68
  • 69. Piggyback Profiling ■ Goal: Determine metadata for query results ■ Challenge: With as little query processing overhead as possible □ Baseline: Run second SQL query □ Piggybacking: profile along query plan (using base statistics) Felix Naumann | Data Profiling | CUSO 2014 69
  • 70. Profiling for Integration ■ Profile multiple sources simultaneously ■ Schema matching/mapping □ What constitutes the “difficulty” of matching/mapping? ■ Duplicate detection □ Estimate data overlap □ Estimate fusion effort ■ Create measures to estimate integration (and cleansing) effort □ Schema and data overlap □ Severity of heterogeneity Felix Naumann | Data Profiling | CUSO 2014 70
  • 71. Profiling new Types of Data ■ Traditional data profiling: Single table or multiple tables ■ More and more data in other models □ XML / nested relational / JSON □ RDF triples □ Textual data: Blogs, Tweets, News □ Multimedia data ■ Different models offer new dimensions to profile □ XML: Nestedness, measures at different nesting levels □ RDF: Graph structure, in- and outdegrees □ Multimedia: Color, video-length, volume, etc. □ Text: Sentiment, sentence structure, complexity, and other linguistic measures Felix Naumann | Data Profiling | CUSO 2014 71
  • 72. Average Sentence Length Felix Naumann | Data Profiling | CUSO 2014 75 „Literature Fingerprinting: A New Method for Visual Literary Analysis” by Daniel A. Keim and Daniela Oelke
  • 73. Hapax Legomena Felix Naumann | Data Profiling | CUSO 2014 76 „Literature Fingerprinting: A New Method for Visual Literary Analysis” by Daniel A. Keim and Daniela Oelke
  • 74. News Statistics Felix Naumann | Data Profiling | CUSO 2014 77 Master thesis Matthias Kohnen
  • 75. Summary 78 ■ Basic statistics ■ Functional dependencies ■ Keys and foreign keys ■ Data profiling tools ■ Advanced profiling Felix Naumann | Data Profiling | CUSO 2014
  • 76. Summary Felix Naumann | Data Profiling | CUSO 2014 79 Data Profiling Single source Single column Cardinalities Uniqueness and keys Patterns and data types Distributions Multiple columns Uniqueness and keys Inclusion and foreign key dep. Functional dependencies Conditional and approximate dep. Multiple sources Topical overlap Topic discovery Topical clustering Schematic overlap Schema matching Cross-schema dependencies Data overlap Duplicate detection Record linkage