How Machine Learning Transforms Cyber

Cybersecurity spansacrossa variety of missions: prediction, prevention, detection, response ( Gartner PPDR ); over a variety of ‘places’: network (intrusion detection), endpoint (anti-malware), application (firewalls), user (behavior analytics), process (anti-fraud); during a variety of times: in real-time (transit), at rest (monitor), during investigation (retrospect).

On the face of it, Machine Learning has a huge potential to add value in Cyber: from its innate ability to detect patterns in big data that elude the human eye, through its high fit for fast-scaling real-time changing data, to its hyper-personalization aptitude. Yet, the use of ML in Cyber does not match its popularity in vision or language. Why?

Challenges of Cyber ML

The abundance of training data in vision and language coupled with the relative ease of labeling is no match to the scarcity of well-labeled data in cyber . It is one thing to ask Mechanical Turkers to label objects in images and another to detect an attack in log files. Not to mention organizations’ reluctance to share data that exposes their vulnerabilities and shortcomings.

Signature matching algorithms that detect known attacks (misuse) are inefficient at detecting novel (zero-day) attacks . Anomaly detection approaches that flag deviations from the norm are susceptible to false positives stemming from the difficulty in differentiating between normal and abnormal behaviors, and the continuously changing malicious behaviors .

While language, vision and even recommendation models are often considered quasi-stationary, cyber models need to be re-trained daily or upon detection of new attack / malicious data. This strain on ML workflow can be a limiting constraint on model training time, and deployment processes, especially across multiple places (edge, endpoints, applications, networks…).

These algorithmic challenges, when exacerbated with imbalanced data sets and retraining frequency requirements, lead to overflow of false positives. It is no wonder that when security personnel suffer from alert fatigue, they become a ML skeptic. So new ML product initiatives often need to battle with some bad reputation.

Change is coming…

What is changing now

Data

Data is being accumulated like no other time in history. The realization that even mundane data, that does not seem to have any value today, may be extremely valuable to train models and determine normative behaviors in the future, is starting to sink in.

The shift of many organizations to the cloud coupled with the ever-growing ease of sharing data is starting to show results. Data sets for Cyber ML are accumulating , making it easier for data scientists to test their hypotheses.

Even companies that are reluctant to share cyber-related data in public (or with academia) are starting to reach critical mass for applying unsupervised ML algorithms on their own data.

2. Tooling

The ML Rush attracts many builders, students, and adjacent professionals towards data science. This not only manifested in the growth of educational programs (on and offline) but also in the flood of novice data scientists, in recent years. Many ML and cloud vendors saw this coming and built automation and semi-automation tools making it easy for a novice data scientist to get started with ML.

These tools range from pre-trained models available out of the box via API, through automation of the modeling process (in chunks or in whole). Open source algos, notebooks and the arxiv ocean, enable the novice data scientist to focus more on finding similar examples and reproducing workflows, than on inventing ML from scratch.

Similarly, experienced data scientists take advantage of these fast-evolving tools to shed off the time-consuming ML Ops work and focus on modeling. By automating certain processes (e.g. professional labeling jobs) and giving data scientists more visibility into the model and training process (monitoring and retraining), these tools empower the experienced data scientist to be a lot more effective. And given the scarcity of experienced data scientist, that is worth a lot!

3. Maturity

The ML-can-fix-it-all hype is over. Not just in cyber. Builders and even more so users and product managers have come to the realization that ML is a tool, with great potential but just as significant limitations. Methodologies for evaluating use case fit for ML attest to the understanding that ML can work if applied to the right problem.

The plethora of meetups on ML PoC and on product manager collaboration with data scientists indicate that maturity has grown beyond use case fit. The desire to fail-fast and apply lean methodology to ML, has led to AutoML, pre-trained models and open source algorithms that can be applied nearly out-of-the-box to get to a PoC in a matter of hours or few days.

Even the maturity of the more popular siblings, vision and language, comes into play. Facial recognition tools eliminate most of the heavy lifting of bio-metric validations. NLU engines and word2vec algo’s makes the art of feature engineering a lot easier. Neural nets that excel in learning features from raw images can do wanders in learning from raw binary malware . Leaving cyber professionals to deal with the domain, and not the ML iceberg underneath.

What’s ahead for Cyber ML

Anomalies can be detected using unsupervised approach on neural networks to analyze time series data . Individual behaviors can be learned and LSTM RNN networks trained to better detect changes in time series that could indicate abnormalities, whether in a specific point, context or in collection with other points . This can analogous to tokenizing conversations between computers as if they were language with words and sentences that has typical and atypical demeanor .

Advances in feature extractions and feature engineering by analyzing deep interactions between variable . In Intrusion Detection Systems (IDS) advanced features can be extracted from : network headers (e.g. IP Source, Destination, IP Length, Source Port etc), TCP connection (Duration Length, Protocol Type, Number of data bytes etc), 2second time or 100connections windows (Number of connections to the same host, SYN/REJ Error rates, Percentage of connections with same service etc), domain knowledge (number failed login attempts, compromised conditions, root accesses, shell prompts etc.).

Contextual/conditional/semantic patterns can be recognized using fuzzy association rules . Multidimensional rules can find new signatures for inclusion into misuse detection systems. Latent relationships can be unveiled using graphs algorithms and Bayesian networks. By constantly growing and improving relationship identification in graph databases, companies can accelerate response time and contain related attack more accurately.

Interventional scale can be achieved using pre-identified density-based clusters. By knowing which cluster of similar behaviors the attacker belongs to, orchestrations can be implemented, and prevention of latent attacks can be achieved. Clusters can also be combined with decision trees for allowing parallel evaluation of features .

ML can be integrated into the work of digital forensic investigation in order to automate and empower non-experts to help with some of the forensic backlogs. Malware detection and classification can be carried on using static code features as well as dynamic executed code .

The number of open data sets , open source algorithms and expert communities collaboration , is growing (even if not optimal yet). Enabling faster time to model and new hybrid versions for integrating open source methods with proprietary ones to achieve better accuracy.

Data overflows and static code obfuscations that inhibit real-time response can be overcome with conjunctive rule extraction , dimensionality reduction using data categorization techniques (based on content, time, source and destination) and an ensemble of recurrent neural networks that predict whether an executable is malicious within the first 5 seconds of its execution.

Evolutionary Computation (genetic algorithms) that apply survival of the fittest principles by evolving a set of initial (known) rules to generate new rules using four genetic operators: reproduction, crossover, mutation, and dropping. The fitness function can be the support and confidence of a new genetically created rule.

Summary

ML adoption in Cyber is lagging behind vision and language. Challenged with the difficulty of obtaining labeled data, the fast-changing creative nature of zero-day attacks/malicious actions, and the need for frequent retraining.

However, accumulated data sets (both normative and abnormal); new tools that accelerate labeling (professional), training (distributed), deployment (hybrid), monitoring (real-time) all the way to out-of-the-box models for the novice data scientists; coupled with maturity of research and organizational understanding of ML — lead to the new era of cyber ML.

Going forward we see a variety of innovative approaches assisting in the various aspects of Cyberwarfare. From RNNs, time series/sequence anomalies, deep feature relationships, fuzzy rules, clustering, graphs, ensemble learning, all the way to evolutionary computing.

These are exciting times to be in Cyber ML!

— — — —

This post is my opinion and does not represent my current or past employers. It was first published on OrenSteinberg.com

Challenges of Cyber ML

What is changing now

2. Tooling

3. Maturity

What’s ahead for Cyber ML

Summary

Recommend

监控神器Prometheus用不对，也就是把新手村的剑

华为L20首席安全专家提交Linux内核补丁，被指低级漏洞

中国技术再下一城阿里云自研数据库AnalyticDB打破TPC-DS世界纪录

乐视生态：贾跃亭早已不实际控制乐视网

What’s New in the Kendo UI Vue Components with R2 2020

腾讯股价周四开盘涨超3%，市值一度超阿里巴巴

小雨伞上演当当事件：董事长徐翰发文控诉CEO光耀夺权

断更一天，没办法改变网文平台

复星锐正领投以色列动力系统解决方案提供商IRP Systems

创业圈为何总掉进P2P爆雷大坑？

About Joyk