SlideShare una empresa de Scribd logo
1 de 12
Descargar para leer sin conexión
First steps at parsing and analyzing
web server log files at scale
Elias Dabbas
@eliasdabbas
Raw log file
Harvard Dataverse, ecommerce site (zanbil.ir)


3.3GB ~1.3M lines
Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs", 

https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
Parse and convert to DataFrame/Table
• Loading and parsing the whole file into memory probably won’t work (or scale)

• Log files are usually not big, they’re huge

• Sequentially parse chunks of lines, save to another efficient format (parquet), combine
• File ingestion gets even faster after saving the DataFrame to a single
optimized file, also more convenient to store as a single file
• Convert to more efficient data types

• Faster writing and reading time
• Magic provided by:

• Pandas

• Apache Arrow Project

• Apache Parquet Project
Model Name: MacBook Pro
Model Identifier: MacBookPro16,4
Processor Name: 8-Core Intel Core i9
Processor Speed: 2.4 GHz
Number of Processors: 1
Total Number of Cores: 8
L2 Cache (per Core): 256 KB
L3 Cache: 16 MB
Hyper-Threading Technology: Enabled
Memory: 32 GB
logs_to_df function
Assumes common (or combined) log format
Can be extended to other formats
def logs_to_df(logfile, output_dir, errors_file):
with open(logfile) as source_file:
linenumber = 0
parsed_lines = []
for line in source_file:
try:
log_line = re.findall(combined_regex, line)[0]
parsed_lines.append(log_line)
except Exception as e:
with open(errors_file, 'at') as errfile:
print((line, str(e)), file=errfile)
continue
linenumber += 1
if linenumber % 250_000 == 0:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
parsed_lines.clear()
else:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’)
parsed_lines.clear()
combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(?
P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?
P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)'
Regular Expressions Cookbook
by Jan Goyvaerts, Steven Levithan
Thank you

Más contenido relacionado

La actualidad más candente

Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOKoray Tugberk GUBUR
 
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?Koray Tugberk GUBUR
 
Passage indexing is likely more important than you think
Passage indexing is likely more important than you thinkPassage indexing is likely more important than you think
Passage indexing is likely more important than you thinkDawn Anderson MSc DigM
 
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOSearch Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOKoray Tugberk GUBUR
 
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote CultureCoronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote CultureKoray Tugberk GUBUR
 
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...Koray Tugberk GUBUR
 
Keyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic WebKeyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic WebBill Slawski
 
How to create an SEO data-driven content strategy
How to create an SEO data-driven content strategyHow to create an SEO data-driven content strategy
How to create an SEO data-driven content strategyKevin Gibbons
 
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...Koray Tugberk GUBUR
 
7 E-Commerce SEO Mistakes & How to Fix Them #DeepSEOCon
7 E-Commerce SEO Mistakes & How to Fix Them #DeepSEOCon7 E-Commerce SEO Mistakes & How to Fix Them #DeepSEOCon
7 E-Commerce SEO Mistakes & How to Fix Them #DeepSEOConAleyda Solís
 
[LondonSEO 2020] BigQuery & SQL for SEOs
[LondonSEO 2020] BigQuery & SQL for SEOs[LondonSEO 2020] BigQuery & SQL for SEOs
[LondonSEO 2020] BigQuery & SQL for SEOsAreej AbuAli
 
BrightonSEO October 2022 - Dan Taylor SEO - Indexing Ecommerce Websites
BrightonSEO October 2022 - Dan Taylor SEO - Indexing Ecommerce WebsitesBrightonSEO October 2022 - Dan Taylor SEO - Indexing Ecommerce Websites
BrightonSEO October 2022 - Dan Taylor SEO - Indexing Ecommerce WebsitesDan Taylor
 
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdfCreating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdfRichard Lawrence
 
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...James Brockbank
 
The Quickest Win in SEO – How to do Internal Linking the Right Way
The Quickest Win in SEO – How to do Internal Linking the Right WayThe Quickest Win in SEO – How to do Internal Linking the Right Way
The Quickest Win in SEO – How to do Internal Linking the Right WayMartin Hayman
 
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEOSemantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEOKoray Tugberk GUBUR
 
How to do keyword research in a language you don’t speak, by Lidia Infante
How to do keyword research in a language you don’t speak, by Lidia InfanteHow to do keyword research in a language you don’t speak, by Lidia Infante
How to do keyword research in a language you don’t speak, by Lidia InfanteLidia Infante
 
Data-driven SEO & content strategy to reduce your customer acquisition costs
Data-driven SEO & content strategy to reduce your customer acquisition costsData-driven SEO & content strategy to reduce your customer acquisition costs
Data-driven SEO & content strategy to reduce your customer acquisition costsadlift
 
Semantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA ConSemantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA ConBill Slawski
 
AI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBURAI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBURAnton Shulke
 

La actualidad más candente (20)

Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEO
 
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
 
Passage indexing is likely more important than you think
Passage indexing is likely more important than you thinkPassage indexing is likely more important than you think
Passage indexing is likely more important than you think
 
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOSearch Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
 
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote CultureCoronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
 
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
 
Keyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic WebKeyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic Web
 
How to create an SEO data-driven content strategy
How to create an SEO data-driven content strategyHow to create an SEO data-driven content strategy
How to create an SEO data-driven content strategy
 
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
 
7 E-Commerce SEO Mistakes & How to Fix Them #DeepSEOCon
7 E-Commerce SEO Mistakes & How to Fix Them #DeepSEOCon7 E-Commerce SEO Mistakes & How to Fix Them #DeepSEOCon
7 E-Commerce SEO Mistakes & How to Fix Them #DeepSEOCon
 
[LondonSEO 2020] BigQuery & SQL for SEOs
[LondonSEO 2020] BigQuery & SQL for SEOs[LondonSEO 2020] BigQuery & SQL for SEOs
[LondonSEO 2020] BigQuery & SQL for SEOs
 
BrightonSEO October 2022 - Dan Taylor SEO - Indexing Ecommerce Websites
BrightonSEO October 2022 - Dan Taylor SEO - Indexing Ecommerce WebsitesBrightonSEO October 2022 - Dan Taylor SEO - Indexing Ecommerce Websites
BrightonSEO October 2022 - Dan Taylor SEO - Indexing Ecommerce Websites
 
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdfCreating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
 
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...
 
The Quickest Win in SEO – How to do Internal Linking the Right Way
The Quickest Win in SEO – How to do Internal Linking the Right WayThe Quickest Win in SEO – How to do Internal Linking the Right Way
The Quickest Win in SEO – How to do Internal Linking the Right Way
 
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEOSemantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
 
How to do keyword research in a language you don’t speak, by Lidia Infante
How to do keyword research in a language you don’t speak, by Lidia InfanteHow to do keyword research in a language you don’t speak, by Lidia Infante
How to do keyword research in a language you don’t speak, by Lidia Infante
 
Data-driven SEO & content strategy to reduce your customer acquisition costs
Data-driven SEO & content strategy to reduce your customer acquisition costsData-driven SEO & content strategy to reduce your customer acquisition costs
Data-driven SEO & content strategy to reduce your customer acquisition costs
 
Semantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA ConSemantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA Con
 
AI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBURAI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBUR
 

Similar a Log File Analysis

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworksViet-Trung TRAN
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsBob Pusateri
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsYifeng Jiang
 
HTML5 Data Storage
HTML5 Data StorageHTML5 Data Storage
HTML5 Data StorageAllan Huang
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Local Storage for Web Applications
Local Storage for Web ApplicationsLocal Storage for Web Applications
Local Storage for Web ApplicationsMarkku Laine
 
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2Richard Esplin
 
Web Performance & Scalability Tools
Web Performance & Scalability ToolsWeb Performance & Scalability Tools
Web Performance & Scalability ToolsFolio3 Software
 
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksDeep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksAmazon Web Services
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions Alfresco Software
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSAmazon Web Services
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan
 
Overview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesOverview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesAndrew Kandels
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conferencenkabra
 

Similar a Log File Analysis (20)

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworks
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
HTML5 Data Storage
HTML5 Data StorageHTML5 Data Storage
HTML5 Data Storage
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Local Storage for Web Applications
Local Storage for Web ApplicationsLocal Storage for Web Applications
Local Storage for Web Applications
 
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
 
Web Performance & Scalability Tools
Web Performance & Scalability ToolsWeb Performance & Scalability Tools
Web Performance & Scalability Tools
 
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksDeep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
Overview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesOverview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational Databases
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
 

Más de Elias Dabbas

Log file analysis with advertools
Log file analysis with advertoolsLog file analysis with advertools
Log file analysis with advertoolsElias Dabbas
 
Don't research keywords, generate them...
Don't research keywords, generate them...Don't research keywords, generate them...
Don't research keywords, generate them...Elias Dabbas
 
BoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive DashboardBoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive DashboardElias Dabbas
 
Remarketing Basics
Remarketing BasicsRemarketing Basics
Remarketing BasicsElias Dabbas
 
Analytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence SeriesAnalytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence SeriesElias Dabbas
 
Online Marketing - Forward to Basics
Online Marketing - Forward to BasicsOnline Marketing - Forward to Basics
Online Marketing - Forward to BasicsElias Dabbas
 
Structured Data - The Future of Search
Structured Data - The Future of SearchStructured Data - The Future of Search
Structured Data - The Future of SearchElias Dabbas
 
Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011Elias Dabbas
 
Google Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online MarketerGoogle Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online MarketerElias Dabbas
 
Adwords training social media forum 2010
Adwords training social media forum 2010Adwords training social media forum 2010
Adwords training social media forum 2010Elias Dabbas
 
Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010Elias Dabbas
 
SEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME ForumSEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME ForumElias Dabbas
 
CMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - DrupalCMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - DrupalElias Dabbas
 
Web Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubaiWeb Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubaiElias Dabbas
 
AdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, StrategiesAdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, StrategiesElias Dabbas
 

Más de Elias Dabbas (17)

Log file analysis with advertools
Log file analysis with advertoolsLog file analysis with advertools
Log file analysis with advertools
 
Twitter Dashboard
Twitter DashboardTwitter Dashboard
Twitter Dashboard
 
Don't research keywords, generate them...
Don't research keywords, generate them...Don't research keywords, generate them...
Don't research keywords, generate them...
 
BoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive DashboardBoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive Dashboard
 
Remarketing Basics
Remarketing BasicsRemarketing Basics
Remarketing Basics
 
Analytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence SeriesAnalytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence Series
 
Online Marketing - Forward to Basics
Online Marketing - Forward to BasicsOnline Marketing - Forward to Basics
Online Marketing - Forward to Basics
 
Structured Data - The Future of Search
Structured Data - The Future of SearchStructured Data - The Future of Search
Structured Data - The Future of Search
 
Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011
 
Google Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online MarketerGoogle Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online Marketer
 
Adwords training social media forum 2010
Adwords training social media forum 2010Adwords training social media forum 2010
Adwords training social media forum 2010
 
Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010
 
SEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME ForumSEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME Forum
 
CMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - DrupalCMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - Drupal
 
Web Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubaiWeb Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubai
 
AdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, StrategiesAdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, Strategies
 
Web2.0 Primer
Web2.0 PrimerWeb2.0 Primer
Web2.0 Primer
 

Último

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 

Último (20)

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 

Log File Analysis

  • 1. First steps at parsing and analyzing web server log files at scale Elias Dabbas @eliasdabbas
  • 2. Raw log file Harvard Dataverse, ecommerce site (zanbil.ir) 
 3.3GB ~1.3M lines Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs",  https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
  • 3. Parse and convert to DataFrame/Table • Loading and parsing the whole file into memory probably won’t work (or scale) • Log files are usually not big, they’re huge • Sequentially parse chunks of lines, save to another efficient format (parquet), combine
  • 4.
  • 5. • File ingestion gets even faster after saving the DataFrame to a single optimized file, also more convenient to store as a single file
  • 6.
  • 7.
  • 8. • Convert to more efficient data types • Faster writing and reading time
  • 9.
  • 10. • Magic provided by: • Pandas • Apache Arrow Project • Apache Parquet Project Model Name: MacBook Pro Model Identifier: MacBookPro16,4 Processor Name: 8-Core Intel Core i9 Processor Speed: 2.4 GHz Number of Processors: 1 Total Number of Cores: 8 L2 Cache (per Core): 256 KB L3 Cache: 16 MB Hyper-Threading Technology: Enabled Memory: 32 GB
  • 11. logs_to_df function Assumes common (or combined) log format Can be extended to other formats def logs_to_df(logfile, output_dir, errors_file): with open(logfile) as source_file: linenumber = 0 parsed_lines = [] for line in source_file: try: log_line = re.findall(combined_regex, line)[0] parsed_lines.append(log_line) except Exception as e: with open(errors_file, 'at') as errfile: print((line, str(e)), file=errfile) continue linenumber += 1 if linenumber % 250_000 == 0: df = pd.DataFrame(parsed_lines, columns=columns) df.to_parquet(f'{output_dir}/file_{linenumber}.parquet') parsed_lines.clear() else: df = pd.DataFrame(parsed_lines, columns=columns) df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’) parsed_lines.clear() combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(? P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (? P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)' Regular Expressions Cookbook by Jan Goyvaerts, Steven Levithan