The presentation describes a tool for validating and previewing instances of Schema.org JobPosting described in structured data markup embedded in web pages. The validator and preview was developed to assist users of Schema.org to produce data of better quality. In this way, it tries to enhance usability of a part of Schema.org covering the domain of job postings. The paper discusses implementation of the tool and design of its validation rules based on SPARQL 1.1. Results of experimental validation of a job posting corpus harvested from the Web are presented. Among other findings, the results indicate that publishers of Schema.org JobPosting data often misunderstand precedence rules employed by markup parsers and that they ignore case-sensitivity of vocabulary names.
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
1. Validator and preview
for the JobPosting data model
of Schema.org
Jindřich Mynarz
Department of Information and Knowledge
Engineering,
University of Economics, Prague
EC-WEB 2014, September 2, 2014
2. Motivation
● Improving usability of vocabularies
● Provide feedback on the use of
vocabularies
● Make vocabulary specification executable
● Help ensure basic level of data quality
● Capture application-specific requirements
for data in validation rules
3. DámePráci.eu project
“Matching jobs with unemployed
through semantic data”
Data model using Schema.org with
an extension for the job market.
Application for searching through job postings
aggregated from distinct sources:
www.damepraci.cz (in Czech)
4. Validation method
● Rule-based, schema-aware
validation
● Operates in the RDF data model
● Focuses on semantic errors, beyond well-formed
markup
● Partial open world assumption
● Implemented as SPARQL 1.1 CONSTRUCT
queries
● Error reporting via SPIN RDF vocabulary
5. Background knowledge
schema.org
+ extension for job market (RDFS)
+ external enumerations:
● ISO 4217 currency codes (SKOS)
● ISO 639-1 language codes (SKOS)
Loaded in separate named graphs that the
validation rules can reference.
6. Validation rules
● Data completeness
● Distinction between datatype and object
properties
● Conflicting data
● Datatype violations
● Invalid codes
7. Data completeness
● At least 1 instance
of schema:JobPosting
● Other type information (class membership,
datatypes) left optional
● Empty literals
● Conditionally required data (e.g.,
compensation + currency)
8. Distinction between datatype
and object properties
● Object properties with literal objects instead
of URIs or blank nodes (and vice versa for
datatype properties)
● Simpler syntax of datatype
properties
○ Avoiding nested objects or difficulties with finding an
object's URI
● May be a symptom of incorrectly nested
HTML elements
9. Conflicting data
● Mutually-exclusive properties
○ schema:jobLocation
+ schema:isRemoteWork true
● Cardinality violation for functional properties
with > 1 object
○ schema:startDate, schema:currency, schema:
availableVacancies
● Incompatible class membership inferences
○ schema:domainIncludes, schema:rangeIncludes
○ Incompatible class membership is instantiation of 2+
distinct classes that are not in rdfs:subClassOf
relation.
10. Datatype violations
● Regular expressions, casting errors
of XPath datatype constructor functions
● Date and time formats (xsd:date, xsd:
duration)
○ Not conforming to regular expressions
○ Non-existent dates
○ Dates from the future
● Interval limits
○ Positive integers for schema:availableVacancies
11. Invalid codes
● Based on lookup in code lists enumerating
every valid code
● Includes language codes (ISO 639-1) and
currency codes (ISO 4217)
12. Implementation
Ruby on Rails web application
backed by Jena Fuseki SPARQL 1.1 endpoint.
● Validates both RDFa and HTML5 Microdata
● Czech and English localization
● Validation results in HTML or JSON-LD
● RSpec tests for each validation rule
● Open source: https://github.com/OPLZZ/job-posting-validator
15. Experimental validation
of a JobPosting corpus
● 1332 seed URLs from 752 distinct
pay-level domains obtained via Google
Custom Search Engine restricted to schema:
JobPosting
● Sample of 42 872 web pages obtained
by crawling seed URLs
● Each page validated, validation results
in JSON-LD loaded to Elasticsearch
for exploration
17. Datatype property used
as object property
Most common path to error: schema:title
Possible cause: incorrect understanding of
markup precedence rules:
<a property="title" href="#title">SEO guru</a>
[] schema:title <#title> .
[] schema:title "SEO guru" .
18. Empty literal value
Most common path to error: schema:
addressRegion
Possible cause: incomplete data used to
generate HTML from fixed templates
Less common in manually marked-up HTML
19. Incorrect character case
in schema:Postaladdress
Both RDFa and HTML5 Microdata are case-sensitive.
Spread across 116 unique PLDs.
“The default mode of authoring [Schema.org
markup] is copy and edit.” — R.V. Guha
20. Object property used
as datatype property
Most common path to error: schema:jobLocation
Common cause: simpler markup without intermediate
resources
<p property="jobLocation">
<p rel="jobLocation">
Munich
<p rel="address">
</p>
<p property=
"addressLocality">
Munich
</p>
</p>
</p>
21. Unsuccessful experiments
Web Data Commons
● Errors smoothed by extraction to RDF
● Not suitable as a source of seed URLs: job
postings disappear quickly
Veterans Job Bank
● Data from few PLDs, lacks variety
● Severe restrictions on automated downloads
through its API
22. Questions?
Acknowledgements:
The presented research was partially supported by the project
of Operational Programme Human Resources and Employment no. CZ.
1.04/5.1.01/77.00440.
Image credits:
Check List designed by Arthur Shlain from the thenounproject.com
Puzzle designed by John from the thenounproject.com