24. …
• 84% of phishing sites exist for less than 24 hours and some sites just appear
for less than 15 minutes.
• Almost all of the phishing sites are hidden within the legitimate domains.
24
Changing Fast Cross-platformHacked Server
25. 1. Establishment and maintenance of infrastructure
– Collection of public phishing data
– Updating of blacklist
– Streaming analysis with Storm
– The analysis of duplication
2. The evolution of detection and prevention technology
– List-based
– Visual-based
– Feature-based
– Ensemble model
25
26. 26
Phishing site Crawler Feature Extraction Detection Model Analysis & Report
HTML
CSS
JavaScript
Fonts
…
HTTP response header
DNS record
WHOIS record
IP address
URL
SSL record
Screenshot
…
29. 29
Phishing site Crawler Feature Extraction Detection Model Analysis & Report
HTML
CSS
JavaScript
Fonts
…
HTTP response header
DNS record
WHOIS record
IP address
URL
SSL record
Screenshot
…
30. • Deployed with container
technology, multi-thread
crawlers on Google Cloud
Platform (GCP)
• Features and screenshot will be
extracted and store in File
storage, image server and
Mongo database.
30
Legitimate sitesPhishing sites
Crawler 1 Crawler 2 Crawler 3 Crawler N
Image Server MongoDB
Feature Extractor Web
Crawlers
Analysis
URL Fetcher
URL Pool
File Storage
Data sources
…
36. 36
Phishing site Crawler Feature Extraction Detection Model Analysis & Report
HTML
CSS
JavaScript
Fonts
…
HTTP response header
DNS record
WHOIS record
IP address
URL
SSL record
Screenshot
…
39. •
–
–
–
–
–
–
•
–
–
–
–
–
39
• Domain based Features
– Age of Domain
– DNS Record
– Website Traffic
– PageRank
– Google Index
– Number of Links Pointing to Page
– Statistical-Reports Based Feature
41. 41
Phishing site Crawler Feature Extraction Detection Model Analysis & Report
HTML
CSS
JavaScript
Fonts
…
HTTP response header
DNS record
WHOIS record
IP address
URL
SSL record
Screenshot
…
45. 45
Phishing site Crawler Feature Extraction Detection Model Analysis & Report
HTML
CSS
JavaScript
Fonts
…
HTTP response header
DNS record
WHOIS record
IP address
URL
SSL record
Screenshot
…
50. 50
Phishing site Crawler Feature Extraction Detection Model Analysis & Report
HTML
CSS
JavaScript
Fonts
…
HTTP response header
DNS record
WHOIS record
IP address
URL
SSL record
Screenshot
…
54. Two-stage phishing detection model
• The two-stage phishing detection model
is combined with validation and
detection model
– Non-phish = invalid + legitimate
• Build the validation model with manual
labeling
– Apply supervised learning algorithm
– Apply active learning
• Improve the performance of detection
model with the validated phishing data
54
Target
Non-Phish
Invalid Valid
Legitimate Phish
Phish
Two-stage Model
Validation model
Detection model
55. Phishing data validation
• Once a page encounter the following situations, we call it invalid
– Offline: the website is not reachable. E.g. status code 404.
– Redirection: the page is redirected to the legitimate page.
– Invalid content: the content of the page is changed and contains invalid keyword such as “this
account has been suspended” or “the page is forbidden”.
55
[Invalid content] The account has been suspended by host provider.[Redirection] Redirect to google homepage
Construct a validation classifier!
56. Examples of invalid page
56
The page has been removed
Blocked by host provider
Domain Parking
Redirect to homepage Error message from host provider
Redirect to legitimate site
61. The rules of manual labeling
61
The screenshot on PhishTankThe screenshot we took
URL and host information
Label area
1. Check the screenshots to confirm if it is invalid
2. Check the URL and WHOIS to confirm if it is invalid
3. Check the website with search engine to confirm if it is invalid