SlideShare a Scribd company logo
1 of 67
Structured Strategy:
How to Supercharge Your Content
Analysis with XML and XPath
Josh Anderson
Who I am
• Information Architect at Precision
Content
• Certified Professional Technical
Communicator (CPTC) Foundation
• Co-Organized World Information
Architecture Day 2023 Toronto
• Master of Information from the
University of Toronto
3
We are experts in structured content.
We’re a full-service, end-to-end technical
communications consultancy, technology
innovator, and systems integrator offering
professional services, training, and technology.
Areas of expertise
• structured authoring methods
• content lifecycle management
• DITA/XML design and
implementation
• information architecture
• content strategy,
• and structured content delivery.
4
Who is this presentation for?
People who…
• strategize, plan for, and otherwise work with text
content
• understand the benefits of structured content
• are familiar with XML but perhaps not with XPath
• want to learn how to take their content analysis
skills to the next level
5
Structured
content
Content is easier to use and
understand when organized in a
predictable way.
Content is written to fit a model:
• Title
• Presenter
• Description
• Speaker Bio
6
Structure makes content FAIR
Findable
Accessible
Interoperable
Reusable
“The FAIR Guiding Principles for scientific data management and stewardship” was published in Scientific Data in 2016
7
An example of structure: HTML
Content is contained
inside opening and
closing tags.
Sometimes elements
contain other
elements.
All elements are
contained within a
single root.
8
There’s just one problem…
These elements don’t tell me anything about what the content is about!
9
XML
• XML is a way to store information
• XML stands for “eXtensible Markup Language”
• “Extensible” means that you define your own structure
HTML: Pre-defined tags XML: Define your own tags
10
How do you define your own structure?
You define your structure, or your content model,
in a Document Type Definition (DTD).
11
Defining your structure
What you can
define with a DTD
What you can’t
define with a DTD
• Elements
• Attributes
• If an element can
contain text, another
element, or both
• Order of elements
• If something is
required or optional
• Length of content
• Occurrence
constraints
• What text can go
inside elements
12
Structure helps you analyze your content
• Structure is a prerequisite to performing content
analysis at scale
• You want a way to tell if your content is valid or invalid
• Semantic structures can be understood by both people
and computers
• Using a widely adopted standard like XML lets us take
advantage of specialized tools
• Oftentimes you can adopt a standard structure rather
than inventing your own
13
XML-based standards
Some extensions of XML have become standards in their own right
Scalable Vector Graphics Resource Description
Framework
Darwin Information Typing Architecture (DITA)
14
Finding structure
• What if your content is unstructured?
• Look for patterns in
• attributes
• classes
• common parent/sibling elements, and
• common text strings.
15
Creating structure
• Break your content down into microcontent
• about one primary idea, fact, or concept
• easily scannable
• labelled for clear identification
and meaning, and
• appropriately written and formatted
for use anywhere and anytime it is needed.
16
Microcontent structure
Source: The DITA Style Guide – Best Practices for Authors. Tony Self. www.ditastyle.com
• You do not need code to have structure
• Structure means
• systematic labelling
• modular, topic-based architecture
• constrained writing environments, and
• separation of content and form.
17
Focus
Information about hours of work
Requirement for unplanned absences
Information about lunch breaks
Requirement for planned absences
18
Function
Reference information
Principle information
19
Unstructured to structured
What is XPath?
21
What is XPath?
• XPath is a language that lets you identify particular parts
of XML documents
• In XPath, we write “location paths”
• Example of an XPath location path: //bookstore/book/@id
• XPath can help you answer queries like…
• “Show me every element called ‘book’.”
• “Show me the parent element of the element called
‘price’.”
• “Show me all the elements that have the attribute
‘language’ set to ‘English’.”
• … and much more
• XPath is used in other XML-related languages like
XQuery and XSLT
22
Image source: https://www.researchgate.net/figure/Example-of-XML-document-and-XML-tree-representation_fig1_315998361
XML is structured like a tree
23
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
24
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
25
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
26
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
27
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
28
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
29
Node selectors
Expression Description
/ Selects the document root node
// Selects from all descendants of the context node and the context node itself
. Selects the current node
.. Selects the parent of the current node
@ Selects attribute nodes
* Selects any element node, regardless of type.
30
Select the bookstore
element node
How to select nodes in XPath
31
Select the bookstore
element node
• /bookstore
• //bookstore
How to select nodes in XPath
32
Select all book
element nodes
How to select nodes in XPath
33
Select all book
element nodes
• /bookstore/book
• //book
How to select nodes in XPath
34
Select all price
element nodes
How to select nodes in XPath
35
Select all price
element nodes
• /bookstore/book/
price
• //price
How to select nodes in XPath
36
Select all lang
attribute nodes
How to select nodes in XPath
37
Select all lang
attribute nodes
• //@lang
How to select nodes in XPath
38
Select all comment
nodes
How to select nodes in XPath
39
Select all comment
nodes
• //comment()
How to select nodes in XPath
40
Select the parent
element nodes of
the title element
nodes
How to select nodes in XPath
41
Select the parent
element nodes of
the title element
nodes
• //title/..
How to select nodes in XPath
42
Select the comment
nodes that
are children of book
elements
How to select nodes in XPath
43
Select the comment
nodes that
are children of book
elements
• /bookstore/book/
comment()
• //book/comment()
How to select nodes in XPath
44
Axes
An axis is a direction that
we travel along to get to
different parts of an XML
document.
All XPath location paths
have an axis. So far, we
have used “abbreviated
location paths.”
Unabbreviated, they use a
double colon before the
node test. It looks like this:
//child::bookstore
Image source: https://jrebecchi.github.io/xpath-helper/xpath-axes.html
45
Select any comment
that is a
descendant of the
book element
Selecting with axes in XPath
46
Select any comment
that is a
descendant of the
book element
• //book/descendant::
comment()
Selecting with axes in XPath
47
Select the parent
element of the
price element
Selecting with axes in XPath
48
Select the parent
element of the
price element
• //price/..
• //price/parent::element()
Selecting with axes in XPath
49
Select the sibling
elements following
the title element
Selecting with axes in XPath
50
Select the sibling
elements following
the title element
• //title/following-
sibling::element()
Selecting with axes in XPath
51
Predicates
• Predicates are like a filter on your results
• Predicates appear inside [square brackets]
• Predicates are Boolean expressions
• The full syntax of an XPath location path is
axis::node[predicate]
• Axis and node are required. Predicate is optional.
• If you do not specify an axis, it is assumed to be “child::”
52
Select the book with
the title “Harry Potter”
Selecting with predicates in XPath
53
Selecting with predicates in XPath
Select the book with
the title “Harry Potter”
• //book[title=“Harry
Potter”]
54
Select the titles that
are in English
Selecting with predicates in XPath
55
Selecting with predicates in XPath
Select the titles that
are in English
• //title[@lang=“eng”]
56
Select the textbooks
Selecting with predicates in XPath
57
Selecting with predicates in XPath
Select the textbooks
• //book[@category=
“textbook”]
58
Select the second
book
Selecting with predicates in XPath
59
Selecting with predicates in XPath
Select the second
book
• //book[2]
60
Select the second
textbook
Selecting with predicates in XPath
61
Selecting with predicates in XPath
Select the second
textbook
• //book[@category=
“textbook”][2]
64
Real-world content analysis with XPath
• From experience, I know that tables inside tables often
have unpredictable issues. I want to check on them.
• //table/table
• I need to change all the section titles called
“Introduction” to “Overview.” Did I miss any?
• //section[title=“Overview”]
• The client wants a disclaimer paragraph at the very end
of the topic. Are there any disclaimers that are in the
wrong place?
• //p[@outputclass=“disclaimer”]/following-sibling::element()
65
Ideas for content analysis with XPath
66
Ideas for content analysis with XPath
• Look for outliers
• Ensure that elements are used for their intended
purpose (not just for some formatting shortcut)
• Check consistency across different types of elements
• Track down unnecessary child elements
67
XML and XPath tools
68
XML and XPath resources
• W3Schools tutorials
• https://www.w3schools.com/xml/
• XPath cheat sheet
• https://devhints.io/xpath
Thank You!
Are you ready to upgrade, transform, and future-enable your content?
Contact us and we’ll show you what’s possible.
precisioncontent.com | more-info@precisioncontent.com | 1(647)265-8500

More Related Content

Similar to Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath

CPP18 - String Parsing
CPP18 - String ParsingCPP18 - String Parsing
CPP18 - String ParsingMichael Heron
 
Text analysis
Text analysisText analysis
Text analysisshahidzac
 
ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)Arun Verma
 
ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics Arun Verma
 
DITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part IDITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part ISuite Solutions
 
Post conference workshop (xml and structure)
Post conference workshop (xml and structure)Post conference workshop (xml and structure)
Post conference workshop (xml and structure)Scriptorium Publishing
 
Decoding and developing the online finding aid
Decoding and developing the online finding aidDecoding and developing the online finding aid
Decoding and developing the online finding aidkgerber
 
Introduction to Python and Django
Introduction to Python and DjangoIntroduction to Python and Django
Introduction to Python and Djangosolutionstreet
 
Chapter 1 Getting Started with HTML5
Chapter 1 Getting Started with HTML5Chapter 1 Getting Started with HTML5
Chapter 1 Getting Started with HTML5Dr. Ahmed Al Zaidy
 
Xml basics
Xml basicsXml basics
Xml basicsKumar
 
Preliminary committee presentation
Preliminary committee presentationPreliminary committee presentation
Preliminary committee presentationRichard Drake
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...gagravarr
 
ProjectsSummary.pptx
ProjectsSummary.pptxProjectsSummary.pptx
ProjectsSummary.pptxJamesKirk79
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?gagravarr
 
DC SPUG Feb 2015 The Secret Sauce to Information Architecture
DC SPUG Feb 2015 The Secret Sauce to Information ArchitectureDC SPUG Feb 2015 The Secret Sauce to Information Architecture
DC SPUG Feb 2015 The Secret Sauce to Information ArchitectureJill Hannemann
 
Using the Archivists' Toolkit: Hands-on practice and related tools
Using the Archivists' Toolkit: Hands-on practice and related toolsUsing the Archivists' Toolkit: Hands-on practice and related tools
Using the Archivists' Toolkit: Hands-on practice and related toolsAudra Eagle Yun
 

Similar to Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath (20)

CPP18 - String Parsing
CPP18 - String ParsingCPP18 - String Parsing
CPP18 - String Parsing
 
Text analysis
Text analysisText analysis
Text analysis
 
ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)
 
ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics
 
DITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part IDITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part I
 
XML Technologies
XML TechnologiesXML Technologies
XML Technologies
 
Post conference workshop (xml and structure)
Post conference workshop (xml and structure)Post conference workshop (xml and structure)
Post conference workshop (xml and structure)
 
Decoding and developing the online finding aid
Decoding and developing the online finding aidDecoding and developing the online finding aid
Decoding and developing the online finding aid
 
Introduction to Python and Django
Introduction to Python and DjangoIntroduction to Python and Django
Introduction to Python and Django
 
Chapter 1 Getting Started with HTML5
Chapter 1 Getting Started with HTML5Chapter 1 Getting Started with HTML5
Chapter 1 Getting Started with HTML5
 
Xml basics
Xml basicsXml basics
Xml basics
 
Preliminary committee presentation
Preliminary committee presentationPreliminary committee presentation
Preliminary committee presentation
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
 
Solved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdfSolved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdf
 
E-publishing
E-publishingE-publishing
E-publishing
 
ProjectsSummary.pptx
ProjectsSummary.pptxProjectsSummary.pptx
ProjectsSummary.pptx
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 
DC SPUG Feb 2015 The Secret Sauce to Information Architecture
DC SPUG Feb 2015 The Secret Sauce to Information ArchitectureDC SPUG Feb 2015 The Secret Sauce to Information Architecture
DC SPUG Feb 2015 The Secret Sauce to Information Architecture
 
Using the Archivists' Toolkit: Hands-on practice and related tools
Using the Archivists' Toolkit: Hands-on practice and related toolsUsing the Archivists' Toolkit: Hands-on practice and related tools
Using the Archivists' Toolkit: Hands-on practice and related tools
 
Apex code (Salesforce)
Apex code (Salesforce)Apex code (Salesforce)
Apex code (Salesforce)
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath

  • 1. Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath Josh Anderson
  • 2. Who I am • Information Architect at Precision Content • Certified Professional Technical Communicator (CPTC) Foundation • Co-Organized World Information Architecture Day 2023 Toronto • Master of Information from the University of Toronto
  • 3. 3 We are experts in structured content. We’re a full-service, end-to-end technical communications consultancy, technology innovator, and systems integrator offering professional services, training, and technology. Areas of expertise • structured authoring methods • content lifecycle management • DITA/XML design and implementation • information architecture • content strategy, • and structured content delivery.
  • 4. 4 Who is this presentation for? People who… • strategize, plan for, and otherwise work with text content • understand the benefits of structured content • are familiar with XML but perhaps not with XPath • want to learn how to take their content analysis skills to the next level
  • 5. 5 Structured content Content is easier to use and understand when organized in a predictable way. Content is written to fit a model: • Title • Presenter • Description • Speaker Bio
  • 6. 6 Structure makes content FAIR Findable Accessible Interoperable Reusable “The FAIR Guiding Principles for scientific data management and stewardship” was published in Scientific Data in 2016
  • 7. 7 An example of structure: HTML Content is contained inside opening and closing tags. Sometimes elements contain other elements. All elements are contained within a single root.
  • 8. 8 There’s just one problem… These elements don’t tell me anything about what the content is about!
  • 9. 9 XML • XML is a way to store information • XML stands for “eXtensible Markup Language” • “Extensible” means that you define your own structure HTML: Pre-defined tags XML: Define your own tags
  • 10. 10 How do you define your own structure? You define your structure, or your content model, in a Document Type Definition (DTD).
  • 11. 11 Defining your structure What you can define with a DTD What you can’t define with a DTD • Elements • Attributes • If an element can contain text, another element, or both • Order of elements • If something is required or optional • Length of content • Occurrence constraints • What text can go inside elements
  • 12. 12 Structure helps you analyze your content • Structure is a prerequisite to performing content analysis at scale • You want a way to tell if your content is valid or invalid • Semantic structures can be understood by both people and computers • Using a widely adopted standard like XML lets us take advantage of specialized tools • Oftentimes you can adopt a standard structure rather than inventing your own
  • 13. 13 XML-based standards Some extensions of XML have become standards in their own right Scalable Vector Graphics Resource Description Framework Darwin Information Typing Architecture (DITA)
  • 14. 14 Finding structure • What if your content is unstructured? • Look for patterns in • attributes • classes • common parent/sibling elements, and • common text strings.
  • 15. 15 Creating structure • Break your content down into microcontent • about one primary idea, fact, or concept • easily scannable • labelled for clear identification and meaning, and • appropriately written and formatted for use anywhere and anytime it is needed.
  • 16. 16 Microcontent structure Source: The DITA Style Guide – Best Practices for Authors. Tony Self. www.ditastyle.com • You do not need code to have structure • Structure means • systematic labelling • modular, topic-based architecture • constrained writing environments, and • separation of content and form.
  • 17. 17 Focus Information about hours of work Requirement for unplanned absences Information about lunch breaks Requirement for planned absences
  • 21. 21 What is XPath? • XPath is a language that lets you identify particular parts of XML documents • In XPath, we write “location paths” • Example of an XPath location path: //bookstore/book/@id • XPath can help you answer queries like… • “Show me every element called ‘book’.” • “Show me the parent element of the element called ‘price’.” • “Show me all the elements that have the attribute ‘language’ set to ‘English’.” • … and much more • XPath is used in other XML-related languages like XQuery and XSLT
  • 23. 23 1. The root node 2. Element nodes 3. Text nodes 4. Attribute nodes 5. Comment nodes 6. Processing instruction nodes 7. Namespace nodes Seven kinds of XML nodes
  • 24. 24 1. The root node 2. Element nodes 3. Text nodes 4. Attribute nodes 5. Comment nodes 6. Processing instruction nodes 7. Namespace nodes Seven kinds of XML nodes
  • 25. 25 1. The root node 2. Element nodes 3. Text nodes 4. Attribute nodes 5. Comment nodes 6. Processing instruction nodes 7. Namespace nodes Seven kinds of XML nodes
  • 26. 26 1. The root node 2. Element nodes 3. Text nodes 4. Attribute nodes 5. Comment nodes 6. Processing instruction nodes 7. Namespace nodes Seven kinds of XML nodes
  • 27. 27 1. The root node 2. Element nodes 3. Text nodes 4. Attribute nodes 5. Comment nodes 6. Processing instruction nodes 7. Namespace nodes Seven kinds of XML nodes
  • 28. 28 1. The root node 2. Element nodes 3. Text nodes 4. Attribute nodes 5. Comment nodes 6. Processing instruction nodes 7. Namespace nodes Seven kinds of XML nodes
  • 29. 29 Node selectors Expression Description / Selects the document root node // Selects from all descendants of the context node and the context node itself . Selects the current node .. Selects the parent of the current node @ Selects attribute nodes * Selects any element node, regardless of type.
  • 30. 30 Select the bookstore element node How to select nodes in XPath
  • 31. 31 Select the bookstore element node • /bookstore • //bookstore How to select nodes in XPath
  • 32. 32 Select all book element nodes How to select nodes in XPath
  • 33. 33 Select all book element nodes • /bookstore/book • //book How to select nodes in XPath
  • 34. 34 Select all price element nodes How to select nodes in XPath
  • 35. 35 Select all price element nodes • /bookstore/book/ price • //price How to select nodes in XPath
  • 36. 36 Select all lang attribute nodes How to select nodes in XPath
  • 37. 37 Select all lang attribute nodes • //@lang How to select nodes in XPath
  • 38. 38 Select all comment nodes How to select nodes in XPath
  • 39. 39 Select all comment nodes • //comment() How to select nodes in XPath
  • 40. 40 Select the parent element nodes of the title element nodes How to select nodes in XPath
  • 41. 41 Select the parent element nodes of the title element nodes • //title/.. How to select nodes in XPath
  • 42. 42 Select the comment nodes that are children of book elements How to select nodes in XPath
  • 43. 43 Select the comment nodes that are children of book elements • /bookstore/book/ comment() • //book/comment() How to select nodes in XPath
  • 44. 44 Axes An axis is a direction that we travel along to get to different parts of an XML document. All XPath location paths have an axis. So far, we have used “abbreviated location paths.” Unabbreviated, they use a double colon before the node test. It looks like this: //child::bookstore Image source: https://jrebecchi.github.io/xpath-helper/xpath-axes.html
  • 45. 45 Select any comment that is a descendant of the book element Selecting with axes in XPath
  • 46. 46 Select any comment that is a descendant of the book element • //book/descendant:: comment() Selecting with axes in XPath
  • 47. 47 Select the parent element of the price element Selecting with axes in XPath
  • 48. 48 Select the parent element of the price element • //price/.. • //price/parent::element() Selecting with axes in XPath
  • 49. 49 Select the sibling elements following the title element Selecting with axes in XPath
  • 50. 50 Select the sibling elements following the title element • //title/following- sibling::element() Selecting with axes in XPath
  • 51. 51 Predicates • Predicates are like a filter on your results • Predicates appear inside [square brackets] • Predicates are Boolean expressions • The full syntax of an XPath location path is axis::node[predicate] • Axis and node are required. Predicate is optional. • If you do not specify an axis, it is assumed to be “child::”
  • 52. 52 Select the book with the title “Harry Potter” Selecting with predicates in XPath
  • 53. 53 Selecting with predicates in XPath Select the book with the title “Harry Potter” • //book[title=“Harry Potter”]
  • 54. 54 Select the titles that are in English Selecting with predicates in XPath
  • 55. 55 Selecting with predicates in XPath Select the titles that are in English • //title[@lang=“eng”]
  • 56. 56 Select the textbooks Selecting with predicates in XPath
  • 57. 57 Selecting with predicates in XPath Select the textbooks • //book[@category= “textbook”]
  • 58. 58 Select the second book Selecting with predicates in XPath
  • 59. 59 Selecting with predicates in XPath Select the second book • //book[2]
  • 60. 60 Select the second textbook Selecting with predicates in XPath
  • 61. 61 Selecting with predicates in XPath Select the second textbook • //book[@category= “textbook”][2]
  • 62. 64 Real-world content analysis with XPath • From experience, I know that tables inside tables often have unpredictable issues. I want to check on them. • //table/table • I need to change all the section titles called “Introduction” to “Overview.” Did I miss any? • //section[title=“Overview”] • The client wants a disclaimer paragraph at the very end of the topic. Are there any disclaimers that are in the wrong place? • //p[@outputclass=“disclaimer”]/following-sibling::element()
  • 63. 65 Ideas for content analysis with XPath
  • 64. 66 Ideas for content analysis with XPath • Look for outliers • Ensure that elements are used for their intended purpose (not just for some formatting shortcut) • Check consistency across different types of elements • Track down unnecessary child elements
  • 66. 68 XML and XPath resources • W3Schools tutorials • https://www.w3schools.com/xml/ • XPath cheat sheet • https://devhints.io/xpath
  • 67. Thank You! Are you ready to upgrade, transform, and future-enable your content? Contact us and we’ll show you what’s possible. precisioncontent.com | more-info@precisioncontent.com | 1(647)265-8500

Editor's Notes

  1. Precision Content is a consultancy specializing in end-to-end services for technical communications. We provide services in writer training, content strategy, information architecture, content lifecycle management, systems integration, and content publishing. We use our expertise in microcontent and structured authoring with DITA/XML to empower our clients across a variety of industries to modernize their content. [click]
  2. https://www.w3schools.com/xml/xml_whatis.asp
  3. Everything not permitted is forbidden
  4. Image source: https://www.geeksforgeeks.org/xsd-full-form/
  5. https://www.w3schools.com/xml/xml_whatis.asp
  6. https://commons.wikimedia.org/wiki/File:SVG_Logo.svg https://www.w3.org/RDF/icons/
  7. https://www.w3schools.com/xml/xml_whatis.asp
  8. https://www.w3schools.com/xml/xml_whatis.asp
  9. [Image – “Hours of Work” section from the old handbook] [Image – The series of briefer microcontent topics in the updated handbook. “Work Hour Limits,” “Time Tracking Requirement,” “Your Work Environment,” etc.
  10. [Image – highlight both reference and principle information in the original employee handbook topic “Hours of Work”] [Image – show two separate topics (with type info, if possible) that were broken out of the single mixed-function topic “Hours of Work”]
  11. (Maybe what I can do for this is go on Heretto, find a topic, then delete the headings and paragraph breaks and such and use that as my example of “unstructured” content)  Maybe “Hours of work” from the old employee handbook compared to the rewritten passage in the new one Link to old employee handbook: https://ascan.sharepoint.com/CorpCommunications/Forms/AllItems.aspx?id=%2FCorpCommunications%2FPrecision%20Content%20Employee%20Handbook%2Epdf&parent=%2FCorpCommunications Look at some of the other PCAS microcontent presentations for some stuff about what we mean by structure. In fact, use material from those presentations throughout your talk.
  12. https://www.w3schools.com/xml/xml_whatis.asp
  13. https://www.w3schools.com/xml/xml_whatis.asp
  14. https://www.w3schools.com/xml/xml_whatis.asp
  15. https://www.w3schools.com/xml/xml_whatis.asp
  16. https://www.w3schools.com/xml/xml_whatis.asp