320 likes | 452 Vues
Detecting Visually Similar Web Pages: Application to Phishing Detection. TEH-CHUNG CHEN SCOTT DICK JAMES MILLER Of University of Alberta Presented By: Rutvij Shah 2534739. Main Concept.
E N D
Detecting Visually Similar Web Pages:Application to Phishing Detection TEH-CHUNG CHEN SCOTT DICK JAMES MILLER Of University of Alberta Presented By: Rutvij Shah 2534739
Main Concept • Construction of a question which says web page difference measurement and implementation of an approach to provide an answer to this question are principal themes
Why we need Phishing? • Techniques for similarity detection: • Web search engines • Automated categorization systems • phishing/spam filtering mechanisms • to prevent users from becoming the victims of malicious activities by filtering out suspicious Web pages with embedded similarity identification technology aimed at detecting malicious pages.
SIMILARITY SIGNATURE • Feature-Based Similarity Measures • What Can’t We Count On for Visual Similarity Identification? • Two pages considered “identical” by users would exhibit vastly different “fingerprints” when feature-based techniques are employed.
THEORETICAL FOUNDATION • Gestalt Theory • The theoretical basis for approach. • Gestalt visual psychology is based around a number of simple laws: figure/ground, proximity, closure, similarity, and continuation.
Inattentional Blindness • IB can be summed up as the phenomenon of “looking without seeing.” • When IB happens, even though an individual’s eyes are wide open and various objects are imaged on their retinas, individuals seem to perceive nothing.
Supersignals • Supersignals can be thought of as trying to provide an explanation of an individual’s behavior when they encounter a complex, but familiar, situation.
OBJECTIFICATION OF THE SIMILARITY METRIC • Kolmogorov complexity can be viewed as the limiting case for compression technology. • claim that Normalized Information Distance (NID) can “discover all similarities between two arbitrary entities; and represents object similarity according to the dominating shared features between two objects.”
Normalized Compression Distance • It is described as a parameter-free distance metric • Compression Algorithms and Supersignals • Gzip: Its reliability, speed, and simplicity make it the most popular compressor. • Bzip2: It is a fast compressor which uses the blocksorting algorithm
APPLICATION TO ANTI-PHISHING TECHNOLOGIES • What is Phishing? • Phishing is a type of online identity theft in which sensitive information is obtained by misleading people to access a malicious Web page. • Motivation • Existing Anti-Phishing Solutions
Existing Anti-Phishing Solutions • They are closely related to anti-spam solutions • Anti-phishing toolbars are the most popular. • Determines the currently viewed URL and send it to the blacklist or whitelist database for filtering. • The result that may be assurance or an alert warning will be delivered back to the user
Another trick known as “DNS/URL redirection or domain forwarding”. • Fool the B/W databases by rapidly changing the DNS/URL IP address mapping in a dynamic DNS domain server. Mutual authentication: • The client can make sure they are browsing the legitimate Web site by setting up secure connection with the server
A Key Characteristic of Phishing Web Sites • Phisher’s goal is to make the phishing Website resemble to the legitimate Website. • PhishTank is used to provide “accurate and actionable” information to the anti-phishing community.
EMPIRICAL EVALUATION • The Twelve-Pairs Experiment • The objective of this experiment is to see if we can group twelve legitimate WebPages and twelve phishing pages each targeting one of these pages together in pairs. • Design and Methodology. • It compares with all the sample websites. • Lower NCD values indicate greater similarity.
Interpretation of Results • The “-L” in this table refers to the legitimate Web site of that brand, while “-P” denotes a phishing Web page targeting that brand. • Here, RBC-L is most similar to RBC-P in this group of Web pages.
Design and Methodology Quartet tree visualization for 12 pairs experiment.
The Clustering Experiment • This experiment examines the performance of the NCD similarity technique when the groups of highly similar Web sites are not balanced in size. • This experiment examines the performance of the NCD similarity technique when the groups of highly similar Web sites are not balanced in size.
Design and Methodology: This is similar to the Twelve-Pairs Experiment’s Design and Methodology. • Interpretation of Results:
The Large-Scale Experiment • Objective: To similarity-based anti-phishing technique to a realistic test. • Expected result: A statistically significant difference in the means of the two populations, specifically with the mean of the latter group being lower.
Design • Goal: To examine how the NCD similarity technique would perform in a realistic, browser-level anti-phishing scenario. • When we visit a Web site, we automatically execute an image capture, followed by a comparison (using the NCD similarity technique) against all Web sites in the whitelist. If there is a strong similarity to one of the whitelisted sites (i.e.,theNCD is unusually low), we signal an alert.
Methodology • Interpretation of Results
Robustness against Countermeasures The effects of local noise on NCD values
Nonstructural Distortions a) Phish before 40% of the pixels have been changed. (b) Phish after 40% of the pixels have been changed.
Conclusion • The concepts of Gestalt theory and supersignals provide us with a theoretical rationale for the conjecture that Web pages must be treated as indivisible entities (i.e., a whole) to be congruent to human perceptions. • We use the domain of anti-phishing technology to derive test scenarios for our experiments, as visual similarity between a phishing page and its target is an essential part of the phishing scam.