SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
AI and investigative journalism
Josh Nicholas
Data journalist
The Guardian
Agenda
● Introduction
● What is AI
○ Different forms
○ More than a black box
● Case studies
○ Extracting useful info from text
○ Fuzzy matching between datasets
○ Finding a needle in a haystack
● Homework
● Q + A
Code for all examples is on my Github
More resources in HANDOUT
After the session:
● Recording
● Handout
● Homework in our LinkedIn Group
● LINK to join
What is AI?
1
● Many AI terms are used
interchangeably
● We are going to focus on machine
learning models
● These are algorithms that can learn
their own rules from data
Artificial intelligence is catch-all
This graphic was adapted from Build a Large Language Model by Sebastian Raschka
What are ‘rules’?
Learning from the data
● Machines are great at identifying patterns that aren’t obvious to humans
● Given some examples to learn from, an algorithm can find more
AI and newsgathering
● Machine-learning algorithms are trained on large datasets
○ They can be fine-tuned on smaller datasets
● They are useful for “fuzzy” problems, when it’s hard to write explicit
rules/instructions
● You can access many pre-trained algorithms for free e.g.
○ Huggingface.co
○ Google, OpenAI, Mistral, Facebook etc.
and…
● If we can’t find an algorithm that fits our purpose, we can fine-tune an existing one
Examples we can steal from borrow
• Email spam filters
• Recommendation systems (Netflix, Spotify etc.)
• Language translation
• Audio transcription
• Facial recognition
• Object detection
• Predictive text
• Search engines
■ Google BERT etc.
Case studies
2
1) Extraction
The problem:
● Extracting names, locations and
dollar amounts from thousands of
text documents:
○ 34k+ Facebook posts
○ 2.4k media releases
● What if we don’t know the names
they’ll use?
● What if they say something vague
like a “a million for x”?
● We scraped thousands of Facebook
posts and media releases from official
websites
● We used a pre-trained model from
Spacy, a common Python library
● The model identified names, locations
and references to money in the texts
● Since 2022 these tools have become even easier to use
● You can also achieve similar results with GenAI tools ike ChatGPT
2) Fuzzy matching
The problem:
● We need to connect datasets that are
slightly different
○ Josh Nicholas vs Joshua Nicholas
● Previously we used a method called
Levenshtein Distance
○ Matching every name against every
other name
○ It took ages!!
Making use of the AI ecosystem
● When you input text into a chatbot it
turns the text into a series of numbers
● We can use this same technique to
match names
• Find the numbers that are most
similar
● This same technique can be scaled to
full sentences or even entire documents
● Can also be run in reverse - what things
are least similar
3) Finding a needle in a haystack
The problem:
● Who poses most with dogs, babies,
hi vis etc.?
● We need to search through
thousands of images, many of them
not captioned
● There are loads of models that are
immediately useful
• E.g. ones for workplace safety, that can
identify hard hats etc.
• Also lots of free datasets online
● We manually created a training dataset
with novelty cheques and hi vis vests
Training a detection model
● Machine learning models can learn their own rules from the patterns in
data
● This helps us when we need to work with fuzzier/unlabelled data
○ Images, entire documents etc.
● There are thousands of models available for free online
● We can fine tune them for specific tasks if necessary
● They can be run directly or built into interfaces for common problems
● GenAI tools can often do the same tasks, but harder to scale
Quick summary
● Homework 1 (if you can code),
○ Open the Huggingface MODELS tab and choose a model that
would solve an editorial problem for you
○ Try out the tool and share your results in the LinkedIn Group
■ Why/what did you choose?
● Homework 2 (If you can't code yet):
○ Open the Huggingface SPACES tab and choose one of the tools
○ Give it a prompt and share your results in the LinkedIn Group
■ Why/what did you choose?
● How would this help in a journalism context?
Homework
1. Join the Closed LinkedIn Group
2. Post your work for trainer feedback within 4 weeks
3. Leave constructive feedback on at least one other
person’s post - within 2 weeks
4. Follow the Group Rules!
How homework works
Any questions?
?
Josh Nicholas
Data journalist
The Guardian
josh.nicholas@theguardian.com
Thank you!

Más contenido relacionado

Similar a Webinar 3 - AI & Investigative Journalism - Training Slidedeck

ChatGPT in academic settings H2.de
ChatGPT in academic settings H2.deChatGPT in academic settings H2.de
ChatGPT in academic settings H2.deDavid Döring
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentationgustavosouto
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflowCharmi Chokshi
 
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent CerveauTheFamily
 
Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi Professor Lili Saghafi
 
Tensorflow a brief introduction (1).pptx
Tensorflow a brief introduction (1).pptxTensorflow a brief introduction (1).pptx
Tensorflow a brief introduction (1).pptxAnandMenon54
 
How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)Alexander Borzunov
 
Getting a Data Science Job
Getting a Data Science JobGetting a Data Science Job
Getting a Data Science JobAlexey Grigorev
 
Software Engineering Primer
Software Engineering PrimerSoftware Engineering Primer
Software Engineering PrimerGeorg Buske
 
Machine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventMachine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventBenjamin Schulte
 
What Are the Basics of Product Manager Interviews by Google PM
What Are the Basics of Product Manager Interviews by Google PMWhat Are the Basics of Product Manager Interviews by Google PM
What Are the Basics of Product Manager Interviews by Google PMProduct School
 
Take the Open Source road: learn, share, grow
Take the Open Source road: learn, share, growTake the Open Source road: learn, share, grow
Take the Open Source road: learn, share, growNaLUG
 
Curtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahooCurtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahoo羽祈 張
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabCloudxLab
 
PETE&C 2018: Let's Get Digital: Problem solving that is!
PETE&C 2018: Let's Get Digital: Problem solving that is!PETE&C 2018: Let's Get Digital: Problem solving that is!
PETE&C 2018: Let's Get Digital: Problem solving that is!The Source for Learning, Inc.
 

Similar a Webinar 3 - AI & Investigative Journalism - Training Slidedeck (20)

DocGPT
DocGPTDocGPT
DocGPT
 
ChatGPT in academic settings H2.de
ChatGPT in academic settings H2.deChatGPT in academic settings H2.de
ChatGPT in academic settings H2.de
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
Let's talk FOSS!
Let's talk FOSS!Let's talk FOSS!
Let's talk FOSS!
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
 
Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi
 
Tensorflow a brief introduction (1).pptx
Tensorflow a brief introduction (1).pptxTensorflow a brief introduction (1).pptx
Tensorflow a brief introduction (1).pptx
 
Binary crosswords
Binary crosswordsBinary crosswords
Binary crosswords
 
How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)
 
Getting a Data Science Job
Getting a Data Science JobGetting a Data Science Job
Getting a Data Science Job
 
Software Engineering Primer
Software Engineering PrimerSoftware Engineering Primer
Software Engineering Primer
 
Machine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventMachine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup Event
 
What Are the Basics of Product Manager Interviews by Google PM
What Are the Basics of Product Manager Interviews by Google PMWhat Are the Basics of Product Manager Interviews by Google PM
What Are the Basics of Product Manager Interviews by Google PM
 
Take the Open Source road: learn, share, grow
Take the Open Source road: learn, share, growTake the Open Source road: learn, share, grow
Take the Open Source road: learn, share, grow
 
Getting it Built
Getting it BuiltGetting it Built
Getting it Built
 
Curtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahooCurtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahoo
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
PETE&C 2018: Let's Get Digital: Problem solving that is!
PETE&C 2018: Let's Get Digital: Problem solving that is!PETE&C 2018: Let's Get Digital: Problem solving that is!
PETE&C 2018: Let's Get Digital: Problem solving that is!
 

Más de walkleys

List of Pacific Media Outlets and Sources 2024
List of Pacific Media Outlets and Sources 2024List of Pacific Media Outlets and Sources 2024
List of Pacific Media Outlets and Sources 2024walkleys
 
Sean Dorney Grant Frequently Asked Questions - Slide Deck
Sean Dorney Grant Frequently Asked Questions - Slide DeckSean Dorney Grant Frequently Asked Questions - Slide Deck
Sean Dorney Grant Frequently Asked Questions - Slide Deckwalkleys
 
PNG's Women in Waiting, Essay by Jo Chandler
PNG's Women in Waiting, Essay by Jo ChandlerPNG's Women in Waiting, Essay by Jo Chandler
PNG's Women in Waiting, Essay by Jo Chandlerwalkleys
 
Climate justice in the Pacific, by Jo Chandler
Climate justice in the Pacific, by Jo ChandlerClimate justice in the Pacific, by Jo Chandler
Climate justice in the Pacific, by Jo Chandlerwalkleys
 
Webinar 2 - Slides_Making the business case for solutions journalism.pdf
Webinar 2 - Slides_Making the business case for solutions journalism.pdfWebinar 2 - Slides_Making the business case for solutions journalism.pdf
Webinar 2 - Slides_Making the business case for solutions journalism.pdfwalkleys
 
SLIDE PDF - Learn about AI for Text Journalism.pdf
SLIDE PDF - Learn about AI for Text Journalism.pdfSLIDE PDF - Learn about AI for Text Journalism.pdf
SLIDE PDF - Learn about AI for Text Journalism.pdfwalkleys
 

Más de walkleys (6)

List of Pacific Media Outlets and Sources 2024
List of Pacific Media Outlets and Sources 2024List of Pacific Media Outlets and Sources 2024
List of Pacific Media Outlets and Sources 2024
 
Sean Dorney Grant Frequently Asked Questions - Slide Deck
Sean Dorney Grant Frequently Asked Questions - Slide DeckSean Dorney Grant Frequently Asked Questions - Slide Deck
Sean Dorney Grant Frequently Asked Questions - Slide Deck
 
PNG's Women in Waiting, Essay by Jo Chandler
PNG's Women in Waiting, Essay by Jo ChandlerPNG's Women in Waiting, Essay by Jo Chandler
PNG's Women in Waiting, Essay by Jo Chandler
 
Climate justice in the Pacific, by Jo Chandler
Climate justice in the Pacific, by Jo ChandlerClimate justice in the Pacific, by Jo Chandler
Climate justice in the Pacific, by Jo Chandler
 
Webinar 2 - Slides_Making the business case for solutions journalism.pdf
Webinar 2 - Slides_Making the business case for solutions journalism.pdfWebinar 2 - Slides_Making the business case for solutions journalism.pdf
Webinar 2 - Slides_Making the business case for solutions journalism.pdf
 
SLIDE PDF - Learn about AI for Text Journalism.pdf
SLIDE PDF - Learn about AI for Text Journalism.pdfSLIDE PDF - Learn about AI for Text Journalism.pdf
SLIDE PDF - Learn about AI for Text Journalism.pdf
 

Último

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 

Último (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 

Webinar 3 - AI & Investigative Journalism - Training Slidedeck

  • 1. AI and investigative journalism Josh Nicholas Data journalist The Guardian
  • 2. Agenda ● Introduction ● What is AI ○ Different forms ○ More than a black box ● Case studies ○ Extracting useful info from text ○ Fuzzy matching between datasets ○ Finding a needle in a haystack ● Homework ● Q + A Code for all examples is on my Github More resources in HANDOUT After the session: ● Recording ● Handout ● Homework in our LinkedIn Group ● LINK to join
  • 4. ● Many AI terms are used interchangeably ● We are going to focus on machine learning models ● These are algorithms that can learn their own rules from data Artificial intelligence is catch-all This graphic was adapted from Build a Large Language Model by Sebastian Raschka
  • 6. Learning from the data ● Machines are great at identifying patterns that aren’t obvious to humans ● Given some examples to learn from, an algorithm can find more
  • 7. AI and newsgathering ● Machine-learning algorithms are trained on large datasets ○ They can be fine-tuned on smaller datasets ● They are useful for “fuzzy” problems, when it’s hard to write explicit rules/instructions ● You can access many pre-trained algorithms for free e.g. ○ Huggingface.co ○ Google, OpenAI, Mistral, Facebook etc. and… ● If we can’t find an algorithm that fits our purpose, we can fine-tune an existing one
  • 8. Examples we can steal from borrow • Email spam filters • Recommendation systems (Netflix, Spotify etc.) • Language translation • Audio transcription • Facial recognition • Object detection • Predictive text • Search engines ■ Google BERT etc.
  • 10. 1) Extraction The problem: ● Extracting names, locations and dollar amounts from thousands of text documents: ○ 34k+ Facebook posts ○ 2.4k media releases ● What if we don’t know the names they’ll use? ● What if they say something vague like a “a million for x”?
  • 11. ● We scraped thousands of Facebook posts and media releases from official websites ● We used a pre-trained model from Spacy, a common Python library ● The model identified names, locations and references to money in the texts ● Since 2022 these tools have become even easier to use ● You can also achieve similar results with GenAI tools ike ChatGPT
  • 12. 2) Fuzzy matching The problem: ● We need to connect datasets that are slightly different ○ Josh Nicholas vs Joshua Nicholas ● Previously we used a method called Levenshtein Distance ○ Matching every name against every other name ○ It took ages!!
  • 13. Making use of the AI ecosystem ● When you input text into a chatbot it turns the text into a series of numbers ● We can use this same technique to match names • Find the numbers that are most similar ● This same technique can be scaled to full sentences or even entire documents ● Can also be run in reverse - what things are least similar
  • 14. 3) Finding a needle in a haystack The problem: ● Who poses most with dogs, babies, hi vis etc.? ● We need to search through thousands of images, many of them not captioned
  • 15. ● There are loads of models that are immediately useful • E.g. ones for workplace safety, that can identify hard hats etc. • Also lots of free datasets online ● We manually created a training dataset with novelty cheques and hi vis vests Training a detection model
  • 16. ● Machine learning models can learn their own rules from the patterns in data ● This helps us when we need to work with fuzzier/unlabelled data ○ Images, entire documents etc. ● There are thousands of models available for free online ● We can fine tune them for specific tasks if necessary ● They can be run directly or built into interfaces for common problems ● GenAI tools can often do the same tasks, but harder to scale Quick summary
  • 17. ● Homework 1 (if you can code), ○ Open the Huggingface MODELS tab and choose a model that would solve an editorial problem for you ○ Try out the tool and share your results in the LinkedIn Group ■ Why/what did you choose? ● Homework 2 (If you can't code yet): ○ Open the Huggingface SPACES tab and choose one of the tools ○ Give it a prompt and share your results in the LinkedIn Group ■ Why/what did you choose? ● How would this help in a journalism context? Homework
  • 18. 1. Join the Closed LinkedIn Group 2. Post your work for trainer feedback within 4 weeks 3. Leave constructive feedback on at least one other person’s post - within 2 weeks 4. Follow the Group Rules! How homework works
  • 19. Any questions? ? Josh Nicholas Data journalist The Guardian josh.nicholas@theguardian.com