//2023-02-19T21:57:34-05:00Homepage of Michal Sofka, scientist and technical leader with passion for innovation to transform ideas into product solutions.Jekyll//2020/07/28/Deep-Learning-at-HyperfineDeep Learning at Hyperfine Research2020-07-28T00:00:00-04:00Michal SofkaAs soon as Hyperfine was getting the first scans, I started building out the machine learning competency. We now have a strong team and are building algorithms that improve image quality and run on the scanner and tools for image interpretation that run in the cloud. I am passionate to work on hard and complex problems and there is no lack of those at Hyperfine.<!--more-->
<div class="row">
<div class="medium-4 medium-push-8 columns">
<div class="panel radius">
<p id="toc"><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#ms--101----michals-tech-journey" id="markdown-toc-ms--101----michals-tech-journey">MS 1:01 - Michal’s tech journey</a></li>
<li><a href="#ms--334----hyperfines-mission-to-make-mris-more-accessible" id="markdown-toc-ms--334----hyperfines-mission-to-make-mris-more-accessible">MS 3:34 - Hyperfine’s mission to make MRIs more accessible</a></li>
<li><a href="#ms--559----the-role-of-deep-learning-in-building-smarter-mri-systems" id="markdown-toc-ms--559----the-role-of-deep-learning-in-building-smarter-mri-systems">MS 5:59 - The role of Deep Learning in building smarter MRI systems</a></li>
<li><a href="#ms--919----challenges-the-team-faces-when-getting-the-product-to-market" id="markdown-toc-ms--919----challenges-the-team-faces-when-getting-the-product-to-market">MS 9:19 - Challenges the team faces when getting the product to market</a></li>
<li><a href="#ms--1207----building-and-positioning-the-ai-team-for-success" id="markdown-toc-ms--1207----building-and-positioning-the-ai-team-for-success">MS 12:07 - Building and positioning the AI team for success</a></li>
<li><a href="#ms--1400----developing-software-in-a-regulated-environment" id="markdown-toc-ms--1400----developing-software-in-a-regulated-environment">MS 14:00 - Developing software in a regulated environment</a></li>
<li><a href="#ms--1609----projects-i-am-the-most-excited-about" id="markdown-toc-ms--1609----projects-i-am-the-most-excited-about">MS 16:09 - Projects I am the most excited about</a></li>
<li><a href="#ms--1802----what-am-i-most-enjoying-about-my-current-role" id="markdown-toc-ms--1802----what-am-i-most-enjoying-about-my-current-role">MS 18:02 - What am I most enjoying about my current role</a></li>
<li><a href="#ms--2020----what-future-looks-like-at-hyperfine" id="markdown-toc-ms--2020----what-future-looks-like-at-hyperfine">MS 20:20 - What future looks like at Hyperfine</a></li>
</ul>
</div>
</div><!-- /.medium-4.columns -->
<div class="medium-8 medium-pull-4 columns">
<iframe src="https://anchor.fm/alldus-international/embed/episodes/E111-Michal-Sofka--Deep-Learning-Team-Lead-at-Hyperfine-eg2nq3/a-a2j54jg" height="102px" width="400px" frameborder="0" scrolling="no"></iframe>
<p>Transcript of the AI in Action podcast with <b>host JP Valentine (JP)</b> and <b>guest Michal Sofka (MS)</b>. <!--more--></p>
<h3 class="no_toc" id="jp--031">JP 0:31</h3>
<p>You’re listening to AI in Action on your host JP Valentine. Our guest today is Michal Sofka. Michal leads the deep learning team at Hyperfine Research. Michal, welcome to the show.</p>
<h3 class="no_toc" id="ms--044">MS 0:44</h3>
<p>Thank you, JP, happy to be here.</p>
<h3 class="no_toc" id="jp--046">JP 0:46</h3>
<p>That’s our pleasure. Michal, let’s start with the background of yourself, how you first got involved in technology, what your interests were, then talk us through some of the roles you’ve held along the way leading us up to your current position with Hyperfine.</p>
<h3 id="ms--101----michals-tech-journey">MS 1:01 - Michal’s tech journey</h3>
<p>I grew up in Czech Republic and came to the US for grad school. My PhD at the Rensselaer Polytechnic Institute focused on machine learning for various retinal image analysis tasks, and a few problems in Computer Aided detection. For example, comparing tumors in lung CT scans. I then went to work at Siemens Corporate Research based in Princeton, New Jersey. It houses about 220 scientists and engineers who are focused on researching and developing emerging technologies, with applications ranging from health care and communications to automation and security. And I personally joined a team that was pioneering machine learning tools for diagnosis and treatment planning. And over there I worked on many many projects including automated measurements in fetal ultrasound, detecting and finding outlines of anatomical structures in CT scans and building software tools for total knee replacement surgery. And after that, when I was looking for my next adventure, many were advising me to do something orthogonal. So I joined a newly acquired network security startup team and worked at Cisco for two years. And my main projects were about machine learning tools for threat defense. And I then found out about the collection of startups in 4Catalyzer, and I was immediately hooked. I joined about four years ago and initially worked on Butterfly Network projects for handheld ultrasound, to improve image acquisition and interpretation. And as soon as Hyperfine was getting the first scans, I started building out machine learning competency in the company. And we now have a strong team. And we’re building algorithms that improve images, and run on the scanner and also tools for image interpretation that run in the cloud. I’m really passionate to work on hard and complex problems, and there’s definitely no lack of those in Hyperfine.</p>
<h3 class="no_toc" id="jp--316">JP 3:16</h3>
<p>Excellent. Well, thank you for that overview. It’s really helpful to understand your journey. So leading us now to your current role of Hyperfine, if you could give us an overview of who Hyperfine are and then give us some insight into the sets of technologies that you’re currently working with.</p>
<h3 id="ms--334----hyperfines-mission-to-make-mris-more-accessible">MS 3:34 - Hyperfine’s mission to make MRIs more accessible</h3>
<p>Hyperfine is a privately held company founded by Jonathan Rothberg in 2014. And the company is on a mission to make MRI accessible to every patient, regardless of income or resources like simply anywhere and anytime. MRI is really truly a technological marvel, but remains broadly accessible. Nearly 90 percent of the world has no access to them at all. Let me give you some examples. Japan has 52 scanners per million population. The USA has 37. But Canada only nine and Israel five, and we go to India it’s 0.1 scanners per million population. Considering the developing world, it gets even worse. Uganda has four MRI scanners and a population of 43 million so that’s one MRI for 10 million people. Hyperfine’s point-of-care MRI, that represents multiple innovations in the MRI design, architecture and the workflow, has been filed in more than hundred patents issued or currently pending. And the system itself is highly portable and wheels directly to the patient’s bedside. It plugs into an electrical wall outlet and is controlled via a wireless tablet such as an Apple iPad. It is a big deal since current systems require complicated installations and are lifted with a crane into a specially designed hospital section. Our AI algorithms generate high quality images they make up for the losses caused by the simplified design, the smaller magnet and then the absence of the shielded room. And AI cloud software processes the images for faster diagnosis, decision making and treatment planning.</p>
<div class="row t30">
<div class="col-xs-12">
<img class="cust-padd" src="/images/MRI_not_accessible.png" />
</div>
</div>
<h3 class="no_toc" id="jp--536">JP 5:36</h3>
<p>That’s amazing. So how do you guys do what you’re doing? I mean, this seems like such a massive advancement in technology. It clearly could have huge implications for the medical field, but how are you guys able to make such strides in innovation and more importantly for your role? How does AI and data play a part in this?</p>
<h3 id="ms--559----the-role-of-deep-learning-in-building-smarter-mri-systems">MS 5:59 - The role of Deep Learning in building smarter MRI systems</h3>
<p>Our ongoing effort is to use deep learning and in two different workflows. We’re looking at a deep learning-based image reconstruction. So, this is the process of producing images of internal organs from physical measurements using sensors. By quick introduction, an MRI works by measuring the response of atomic nuclei of body tissues to high frequency radio waves, when placed in a strong magnetic field. Put simply it measures how atoms orient themselves when placed in a magnet. And the speed by which the data can be collected depends on physiological properties of tissues and hardware constraints. Typically, it takes time to do a single scan and the entire scan exam of multiple scans might take 30 minutes or more. That’s a long time. For this to be practical, but with deep learning, we can shorten this time. Or, alternatively, we can produce higher quality images using the same fixed scanning time. It’s all about this trade off. One powerful idea that we rely on is to capture less measurements, and then reduce the scanning time and reconstruct the image with the same image quality as if it were reconstructed with the full set of measurements. So this is one area and in the second area, we’re focused on, the scanner uploads data to the cloud, where it is processed by the deep learning algorithms, powering clinical applications for diagnosis and treatment planning. Our first anatomical target is the brain and we build tools to automatically measure various structures in the brain and to measure and outline abnormalities. These tools are very critical for accurate diagnosis. Many of the steps would have to be done in a manual and tedious way, which is amplified by the fact that the data is 3d. So, in a nutshell, this streamlined clinical workflow has utmost importance, especially in emergency departments, one of many environments in need of our machines. One example of where time matters is stroke. Perhaps you’ve heard the slogan time is brain in a brain was stroke, 1.9 million neurons and 7.5 miles of myelinated fibers are destroyed every minute. To put it in perspective, a brain with stroke ages three weeks every minute, so now AI tools can help outline damaged tissues and provide quantitative information to the stroke teams in a very timely manner.</p>
<h3 class="no_toc" id="jp--844">JP 8:44</h3>
<p>Great to hear about some of the potential use cases and, you know, really allows us to imagine the scope of impact This could have when it becomes more broadly adapted. What are the main challenges that you and your team face in getting this new product to market, whether it’s from the deep learning algorithms or the hardware to every hospital.</p>
<h3 id="ms--919----challenges-the-team-faces-when-getting-the-product-to-market">MS 9:19 - Challenges the team faces when getting the product to market</h3>
<p>It’s really about how to carefully coordinate both hardware and software teams so that we can work in a synchronized way to build these products. Hyperfine is really created around three technological areas. It’s cloud, deep learning, and MRI device itself. And we were fortunate to attract experts from top universities in the world, and from the best engineering teams. Our mechanical, electrical and device software teams are based in Connecticut and our cloud and deep learning teams are in New York. And just to give you an idea how this works, mechanical and electrical engineers take care of the hardware components, including off the shelf and custom manufactured parts. Device software engineers take care of the platforms that run the scanner itself, and medical physicists, designing instructions to highlight different tissues and abnormalities. And then deep learning scientists and engineers reconstruct the highest quality image and build applications for clinical decision making. Cloud software engineers build our viewer and back end systems for storing, archiving, analyzing the scanner data. So there’s a great advantage of having all teams work together on the final product and our machine learning algorithms that improve image quality have access to the entire imaging pipeline. We can modify the way the measurements are obtained using the hardware, we can use various software and hardware tricks to help reconstruct better images when the patient moves. And we know what kind of interference we can expect in the hospital so that we can address it. The scanner data is stored in the cloud, it is available immediately for training new systems to further improve the algorithms for image quality improvement and for providing clinical insights. And since the scanners use differently than traditional MRI, this type of data really paves the way for new clinical applications that have not been really possible to envision so far.</p>
<h3 class="no_toc" id="jp--1140">JP 11:40</h3>
<p>So it’s great to learn about the structure of the team because clearly it’s such a complex project, combining software hardware, medical expertise, so it’s good insight to learn how you guys approach in such a collaborative manner. Speaking specifically about your AI team, what have you learned in your role as the leader of this team, what’s most important to you when building a successful AI team that innovates and delivers products?</p>
<h3 id="ms--1207----building-and-positioning-the-ai-team-for-success">MS 12:07 - Building and positioning the AI team for success</h3>
<p>That’s a good question. There are a number of roles needed in a highly innovative AI startup. Just to make sure that the startup has cutting edge technology and competitive advantage, but also can deliver the products to its customers. Specifically, you will need smart scientists who can think out of the box, design new algorithms to previously unsolved problems and quickly prototype them and test them. They need to know how to address complex challenges in the computational pipeline. And you cannot really find these solutions in available publications. Then you need skilled software engineers who know the latest computing services, developer tools, and cloud platforms. They know how to efficiently implement complicated pipelines that can handle large amounts of data that can scale adaptively and are flexible to accommodate new features. And then you need subject matter experts who would work with the product manager to ensure that you’re building the right tools. In healthcare, this would be a visionary clinician, who can imagine your workflows, solutions and approaches. And again, they can see how they can be applied to the current needs. This can be hard, since in some situations, your customers cannot really articulate what they need.</p>
<h3 class="no_toc" id="jp--1345">JP 13:45</h3>
<p>So as you guys build the next generation of AI products and you know, particularly software products in a highly regulated healthcare environment, it’s especially challenging. Can you speak to how you guys are handling these constraints are Hyperfine?</p>
<h3 id="ms--1400----developing-software-in-a-regulated-environment">MS 14:00 - Developing software in a regulated environment</h3>
<p>Yes, this is our day-to-day discussion. There’s a lot of scrutiny around filing AI, machine learning software going for the FDA clearance, which seems to have intensified through although similar tools existed years ago. So let me clarify. The truth is that previous algorithms were locked prior to marketing and any changes likely require FDA review. However, not all algorithms are locked. Some of those systems being developed today can adapt over time. Even if there is extensive testing and documentation before every release, for example, after retraining the system, it would not be practical to go through another round of 510k clearance process. So the agency, the FDA, is adapting and developing a guidance such as this kind of retrain and release cycle is possible without incurring additional risks. And, and the risk is really the key word. With regulatory bodies, it’s all about keeping risks under control. The developers need to ensure that any changes to the released software will not introduce additional risks, or modify existing risks that could result in significant harm to the patient. And this is the reason why it is so challenging to introduce new self learning tools that would be adapting to the environment and the user. But this will come in future eventually.</p>
<h3 class="no_toc" id="jp--1541">JP 15:41</h3>
<p>So there’s certainly a lot that you guys have already accomplished. And I encourage anyone listening to go and look at the Hyperfine product to give you a sense of the advancement comparing the hardware, costs and mobility to traditional MRI machines which would have taken up, you know, your average New York City, one bedroom apartment. So it’s amazing to see the journey. What are your common projects that you’re most excited about?</p>
<h3 id="ms--1609----projects-i-am-the-most-excited-about">MS 16:09 - Projects I am the most excited about</h3>
<p>I am most excited about the opportunities that that is this new imaging device will bring. So for the first time, we were able to do quick imaging easily in an emergency department, whenever there is a suspicion for a problem and the patient’s head. And we will be able to learn about diseases such as stroke in order to identify what exactly happened and when, detect what is happening at a particular time, and predict the best possible treatment. So many many interesting and impactful problems for AI. And another example, we can do imaging more frequently than before. This makes it possible to monitor patients in the ICU, for example, which is important when we want to know the progress of the head injury. Is the patient getting better or worse? And how quickly can we find out? Again, smart AI tools will make it easier to quantify and report these changes. These are the things I’m excited about and many more.</p>
<div class="row t30">
<div class="col-xs-12">
<video style="width:80%;" autoplay="" muted="" loop="">
<source src="/images/T1_Linear_vs_DL.mp4" type="video/mp4" />
Your browser does not support HTML5 video.
</video>
</div>
<div class="col-xs-12">
<p><i>T1 scan reconstructed using linear and deep learning algorithm (work in progress).</i></p>
</div>
</div>
<h3 class="no_toc" id="jp--1717">JP 17:17</h3>
<p>Excellent, excellent. Well, we are too. I mean, looking at the impact that Hyperfine could have the medical industry as a whole is incredible. We’re excited to see what else is coming. I want to get your take on the startup environment, particularly your thoughts on graduates and on people who are starting their career in technology. There’s a lot going on at Hyperfine. What specifically are you enjoying most about your role? And then, you’ve got a lot of experience in the AI tech community in general. How can we tell your story and at least give some insight into what’s possible within the startup environment, not just Hyperfine? What are you most excited about?</p>
<h3 id="ms--1802----what-am-i-most-enjoying-about-my-current-role">MS 18:02 - What am I most enjoying about my current role</h3>
<p>There are a few things I’m really excited about every day. I’m surrounded by smart, very smart people, which I share the journey with and learn from. There’s really something special about his deep intellectual debate when you’re trying to get to the bottom of a difficult issue. For example, our scanner got disassembled to the bare bones a few times. And we occasionally scrutinize our algorithms and examine them line by line. So we go into really the very detail of the design. And the second thing is that we are on a very important mission to make a significant contribution to health care of humankind. And this is a risky project that corporations typically would not undertake. More than 90% of the world does not have access to MRI. Imaging is very important for diagnosing various conditions. And for example, where you have a stroke, which I mentioned a few times, a clogged vein inside the brain, and you get treated within a few hours of that happening, you may get blood thinner and may be on a path to full recovery. And yet many strokes are missed in the emergency department and having access to imaging and diagnosis tools might improve that. So people are actually dying because the strokes are missed. And the third thing I love about my job is working on super challenging problems. I have always been fascinated by scientific achievements, and positive impact and progress in technology and human lives. And really pushing the boundary of what is possible with AI today and working really, really hard problems very fulfilling for me.</p>
<h3 class="no_toc" id="jp--1950">JP 19:50</h3>
<p>Excellent, well, final question for you, Michal. Clearly as you go as they continue to be successful, the organization is going to grow and we’re all very much looking forward to seeing Hyperfine’s equipment in every hospital around the world. As the organization grows, how will your data team grow? And what opportunities are there going to be for, you know, individuals listening to this, whether it’s on the machine learning side data science or, or overall within the data team.</p>
<h3 id="ms--2020----what-future-looks-like-at-hyperfine">MS 20:20 - What future looks like at Hyperfine</h3>
<p>We have a lot to do at Hyperfine. And although we have a list of tremendous accomplishments, the path ahead of us is incredibly exciting. As we scale the company, deliver a lot of scanners to our customers and grow the team, we’re going to expand the offering both in terms of hardware as well as software. What I’m personally excited about are new machine learning and cloud services that will be driven by the device and the data we’re managing. I’m looking forward to building out this competency and seeing the impact of many different areas of healthcare. Access to frequent MRI imaging will make it possible to build databases for various patient conditions, and hopefully yield to better understanding of the diseases and new discoveries in treatment. This is the impact I’m really passionate about.</p>
<h3 class="no_toc" id="jp--2121">JP 21:21</h3>
<p>Absolutely. Well, this has been an absolute pleasure. I really enjoyed learning about what you guys are doing at Hyperfine. I’m sure everyone listening will encourage the company and yourself all the success given how much of an impact they can have to the medical field. So thank you very much, Michal, this has been a great learning.</p>
<h3 class="no_toc" id="ms--2140">MS 21:40</h3>
<p>Thank you, JP.</p>
</div><!-- /.medium-8.columns -->
</div>
<!-- /.row -->
2020-07-28T00:00:00-04:00//2015/09/03/Learning-Detectors-of-Malicious-Network-TrafficLearning Detectors of Malicious Network Traffic2015-09-03T00:00:00-04:00Michal SofkaMalware is constantly evolving and changing. One way to identify malware is by analyzing the communication that the malware performs on the network. Using machine learning, these traffic patterns can be utilized to identify malicious software. Machine learning faces two obstacles: obtaining a sufficient training set of malicious and normal traffic and retraining the system as malware evolves. This post will analyze an approach that overcomes these obstacles by developing a detector that utilizes domains (easily obtained from domain black lists, security reports, and sandboxing analysis) to train the system which can then be used to analyze more detailed proxy logs using statistical and machine learning techniques.<!--more-->
<div class="row">
<div class="medium-4 medium-push-8 columns">
<div class="panel radius">
<p id="toc"><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#multiple-instance-learning" id="markdown-toc-multiple-instance-learning">Multiple Instance Learning</a> <ul>
<li><a href="#evaluation" id="markdown-toc-evaluation">Evaluation</a></li>
</ul>
</li>
<li><a href="#adapting-to-malware-behavior-changes" id="markdown-toc-adapting-to-malware-behavior-changes">Adapting to Malware Behavior Changes</a> <ul>
<li><a href="#experimental-evaluation" id="markdown-toc-experimental-evaluation">Experimental Evaluation</a></li>
</ul>
</li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>
</div>
</div><!-- /.medium-4.columns -->
<div class="medium-8 medium-pull-4 columns">
<p>The network traffic analysis relies on extracting communication patterns from HTTP proxy logs (flows) that are distinctive for malware. Behavioral techniques compute features from the proxy log fields and build a detector that generalizes to the particular malware family exhibiting the targeted behavior. <!--more--></p>
<p>The statistical features calculated from flows of malware samples are used to train a classifier of malicious traffic. This way, the classifier generalizes the information present in the flows and features and learns to recognize a malware behavior. We use features describing URL structures (such as URL length, decomposition, or character distribution), number of bytes transferred from server to client and vice versa, user agent, HTTP status, MIME type, port, etc. In our experimental evaluation, we used 305 features in total for each flow.</p>
<p>The first conceptual problem in using the standard supervised machine learning methods is the lack of sufficiently representative training set containing examples of malicious and legitimate communication. Providing security intelligence on individual proxy logs is expensive and does not scale with constantly evolving malware. The second problem is that the trained classifier is heavily dependent on the samples used in the training. Once a malware changes the behavior, the system needs to be retrained. With continuously rising number of malware variants, this becomes a major bottleneck in modern malware detection systems.</p>
<p>Both problems are addressed by considering groups of flows (also called bags). The bags are constructed for each user (or source IP) and contain all network communication with a particular hostname for a specific period of time.</p>
<h2 id="multiple-instance-learning">Multiple Instance Learning</h2>
<p>The robustness of the learned malicious flow detector directly depends on using a representative training set. Labeling individual flows in large quantities is difficult but the labels of domains can be easily obtained by leveraging internet domain black lists, security reports, and sandboxing analysis. Assigning labels based on the domains instead of the richer proxy logs with full target website URLs results in weak supervision in training: it is not known which flows in a positive bag are malicious and which are legitimate. The key advantage of this approach is that the requirements on the labeled samples (and their accuracy) are lower. This way, the system can train a detector that operates on individual proxy-logs while the training uses only domains to indicate malicious or legitimate traffic. Since the labeling is at the level of domains while the system trains a proxy log classifier, it can happen that some proxy logs in the positive bags (labeled positive based on the domain) can be negative (legitimate). The training algorithm correctly handles such cases.</p>
<p>The problem is formulated as weakly supervised learning since the bag labels are used to train a classifier of individual flows. We propose an algorithm based on the Multiple Instance Learning (MIL) that seeks for the Neyman-Pearson detector with a very low false positive rate that is necessary in the deployment of the system. The approach is illustrated in Figure 1.</p>
<div class="row t30">
<div class="col-xs-12">
<img class="cust-padd" src="/images/ML-detectors/ML-Figure1-550x314.png" />
</div>
<div class="col-xs-12">
<p><i>Figure 1: (1) Flows from the training set are associated with either malicious or legitimate traffic. This fact is illustrated by a plus or a minus sign, for a malicious or a legitimate flow respectively. Unfortunately, such information is hard to obtain and is often not available for training. Therefore, a third party feeds or blacklists are used to label the training data. These lists are mostly domain-based and introduce mistakes in labeling (2), resulting in poor performance of classifiers trained on such mislabeled data, as shown in (3). Our solution uses blacklists and feeds to create weak labels of bags (4). A bag is labeled as positive if at least one flow included in the bag is labeled as positive. Otherwise, the bag is labeled as negative. An example of a bag is a set of flows with the same user and domain. The MIL classifier learns a flow-level model based on weak labels from the bags and optimizes the decision boundary, which results in better separation of malicious and legitimate flows (5) and thus higher efficacy.</i></p>
</div>
</div>
<p>Learning of the Neyman-Pearson detector is formulated as an optimization problem with two terms: false negatives are minimized while choosing a detector with prescribed and guaranteed (very low) false positive rate. False negatives and false positives are approximated by empirical estimates computed from the weakly annotated data. The hypothesis space of the detector is composed of a linear decision rules parameterized by a weight vector and an offset. The described Neyman-Pearson learning is a modification of the Multi-Instance Support Vector Machines (mi-SVM) algorithm. The mi-SVM treats the flow labels as unobserved hidden variables subject to constraints defined by their bag labels. The goal is to maximize the instance margin jointly over the unknown instance labels and a linear discriminant function.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Our evaluation of the detectors uses datasets that represent 14 days of real network traffic of a large international company (80,000 seats). The MIL detector is compared to the SVM detector learned by considering all instances in the malicious bags to be positive and instances in the legitimate bags to be negative. The Figure 2 presents results obtained on the first 150 test flows with the highest decision score computed by both detectors. The flows were automatically selected from a dataset of 10M test flows.</p>
<div class="row t30">
<div class="col-xs-12">
<img class="cust-padd" src="/images/ML-detectors/ML-New-Figure2-550x200.png" />
</div>
<div class="col-xs-12">
<p><i>Figure 2: The left figure shows the number of true positives and the right figure the precision of the detectors as a function of the number of detected flows. We also show results for a baseline detector selecting the flows randomly.</i></p>
</div>
</div>
<p>The MIL detector takes advantage of large databases of weak annotations (such as security feeds). Since the databases are updated frequently, the detectors are also retrained to maintain the highest accuracy. The training procedure relies on generic features and therefore generalizes the malware behavior from the training samples. As such the detectors find malicious traffic not present in the intelligence databases (marked by the feeds). The algorithm results in a general system that can recognize malicious traffic by learning from weak annotations.</p>
<h2 id="adapting-to-malware-behavior-changes">Adapting to Malware Behavior Changes</h2>
<p>Next, we focus on the problem of detecting variants of malicious behaviors. The detector uses a new representation of bags computed from sample feature values. The representation is designed to be invariant under shifting and scaling of the feature values and under permutation and size changes of the bags. In the context of malware, it means that any change in the number of flows of an attack (size invariance) or in the ordering of flows (permutation invariance) will not help evade the detection. Shift and scale invariance ensures that any internal variations of malware behavior as described by a predefined set of features will not change the representation. This means that new and unseen malware variants are represented with similar feature vectors as existing known malware, which greatly facilitates the detection of new or modified malicious behaviors. The ability to detect malware variants directly improves the system efficacy. The steps for creating the representation are described in Figure 3.</p>
<div class="row t30">
<div class="col-xs-12">
<img class="cust-padd" src="/images/ML-detectors/ML-Figure3-550x392.png" />
</div>
<div class="col-xs-12">
<p><i>Figure 3: (1) Each bag is initially represented as a set of flow-based feature vectors. Bags with less than 5 flows are not processed. The representation is then transformed to be invariant against specific malware variations. (2) Shift invariance is ensured by computing a self-similarity matrix for each feature and all flows in a bag. The element (i,j) of this symmetric positive semi-definite matrix corresponds to the distance between the feature value of the flows i and j. This transforms each bag into a set of self-similarity matrices, one for each feature. Scale invariance is achieved by normalizing all values in each self-similarity matrix onto interval (0,1). (3) Size and permutation invariance is ensured by creating a histogram of all elements in each normalized self-similarity matrix. (4) All histograms for each bag are concatenated to form the final bag representation.</i></p>
</div>
</div>
<h3 id="experimental-evaluation">Experimental Evaluation</h3>
<p>We have done experiments with datasets containing 5 malware categories: malware with command & control channels (marked as C&C), malware with domain generation algorithm (marked as DGA), DGA exfiltration, click fraud, and trojans. The rest of the background traffic is considered as legitimate. The number of flows and bags in each category is given in Table 1.</p>
<div class="row t30">
<div class="col-xs-12">
<img class="cust-padd" src="/images/ML-detectors/ML-Table1-300x173.png" />
</div>
<div class="col-xs-12">
<p><i>Table 1: Number of flows and bags of malware categories and background traffic.</i></p>
</div>
</div>
<p>The effectiveness of self-similarity matrix capturing malware variations is shown by comparing the results to the case where the histograms are obtained directly from the flow-based feature values (i.e. without computing the self-similarity matrices). Two-class SVM classifier was trained using both representations. The training set consisted of click fraud positive bags and 5977 legitimate negative bags. The testing set consisted of bags from C&C and DGA malware, DGA exfiltration, trojans, and 8000 negative background bags. The results are summarized in Table 2 and compared flow level signature-based blocks in Figure 4.</p>
<div class="row t30">
<div class="col-xs-12">
<img class="cust-padd" src="/images/ML-detectors/ML-Table2-550x74.png" />
</div>
<div class="col-xs-12">
<p><i>Table 2: Summary of the SVM results from the baseline and the invariant representation. Both classifiers have comparable results on the training set, however, the SVM classifier using the new invariant self-similarity representation achieved better performance on the test data.</i></p>
</div>
</div>
<div class="row t30">
<div class="col-xs-12">
<img class="cust-padd" src="/images/ML-detectors/ML-Figure4-550x195.png" />
</div>
<div class="col-xs-12">
<p><i>Figure 4: Analysis of false negatives (number of missed malware samples) and true positives (number of detected malware samples) for flow level blocks (e.g. Cloud Web Security) and SVM classifier based on two types of representations: histograms computed directly from feature vectors, and the new self-similarity histograms. Thanks to the self-similarity representation, SVM classifier was able to correctly classify all DGA exfiltration, trojan, and most of DGA malware bags, with a small increase of false negatives for C&C. Overall, the new representation shows significant improvements when compared to flow level blocks, and better robustness than the approach without the self-similarity.</i></p>
</div>
</div>
<p>In the next experiment, the representation is used in a clustering to group malware belonging to the same category. This analysis shows how changing malware parameters influences similarity of samples, i.e. whether a modified malware sample is still considered to be similar to other malware samples of the same category. Two malware categories were included in the training set (click fraud and C&C) together with 5000 negative bags. The result is in Figure 5.</p>
<div class="row t30">
<div class="col-xs-12">
<img class="cust-padd" src="/images/ML-detectors/ML-Figure5-550x727.png" />
</div>
<div class="col-xs-12">
<p><i>Figure 5: Graphical illustration of the clustering results, where the input bags were represented with the new invariant representation. Legitimate bags are concentrated in three large clusters on the top and in a group of non-clustered bags located in the center. Malicious bags were clustered into six clusters.</i></p>
</div>
</div>
<h2 id="conclusion">Conclusion</h2>
<p>We have shown how to use bags of flows to represent communication of malware samples. The bags can be used to train a classifier of malicious flows by computing statistical feature vectors of the flows in a bag and labeling the bags by feeds and other security intelligence. This has the advantage that the labels of individual flows do not need to be provided which makes the labeling process tractable. The MIL algorithm used in the detector training minimizes a weighted sum of errors made by the detector on the negative and the positive bags. The trained flow-based classifier has better performance than a classifier trained from individual flows without forming the bags. The entire bags can also be classified by computing a new representation that leverages all flows in a bag to capture malware dynamics and behavior in time. The representation is robust to malware variations attempting to evade detection (e.g. by changing the URL pattern, number of transferred bytes, user agent, etc.). The invariant representation is based on the idea that malicious flows in a bag will have different statistical properties than legitimate flows in another bag. This richer information makes it possible to improve the efficacy of learning-based detectors.</p>
<p>The technology is integrated into Cisco CWS Premium product (Cognitive Threat Analytics). The work will be presented in more detail at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), in Sep. 7-11, 2015. More reading can be found in the articles published in the conference proceedings:</p>
<ol class="bibliography"><li>
<div class="cf">
<img src="/assets/img/franc-ecml15-1.jpg" class="thumb" />
<img src="/assets/img/franc-ecml15-2.jpg" class="thumb" />
<span id="franc:ecml15">Franc, V., Sofka, M., Bartos, K., 2015. Learning detector of malicious network traffic from weak labels. In: Proceedings of the European Conference on Machine Learning
and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Porto, Portugal, pp. 85–99.</span>
</div>
<div id="franc:ecml15-materials">
<button class="button0" onclick="$('#franc-ecml15-abstract').toggle();">abstract</button>
<a href="/pdfs/franc-ecml15.pdf"><input class="button4" type="button" value="pdf" /></a>
<button class="button1" onclick="$('#franc-ecml15-bibtex').toggle();">bibtex</button>
</div>
<div class="dispinline">
<p class="likepre" id="franc-ecml15-abstract" style="display: none;">We address the problem of learning a detector of malicious behavior in network
traffic. The malicious behavior is detected based on the analysis of network
proxy logs that capture malware communication between client and server
computers. The conceptual problem in using the standard supervised learning
methods is the lack of sufficiently representative training set containing
examples of malicious and legitimate communication. Annotation of individual
proxy logs is an expensive process involving security experts and does not
scale with constantly evolving malware. However, weak supervision can be
achieved on the level of properly defined bags of proxy logs by leveraging
internet domain black lists, security reports, and sandboxing analysis. We
demonstrate that an accurate detector can be obtained from the collected
security intelligence data by using a Multiple Instance Learning algorithm
tailored to the Neyman-Pearson problem. We provide a thorough experimental
evaluation on a large corpus of network communications collected from various
company network environments.</p>
<p>
<pre id="franc-ecml15-bibtex" style="display: none;"><small>@inproceedings{franc:ecml15,
author = {Franc, Vojtech and Sofka, Michal and Bartos, Karel},
title = {Learning detector of malicious network traffic from weak labels},
booktitle = {Proceedings of the European Conference on Machine Learning
and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)},
year = {2015},
month = "7--11~" # sep,
pages = {85--99},
address = {Porto, Portugal}
}
</small></pre>
</p>
<br />
</div>
</li></ol>
<ol class="bibliography"><li>
<div class="cf">
<img src="/assets/img/bartos-ecml15-1.jpg" class="thumb" />
<img src="/assets/img/bartos-ecml15-2.jpg" class="thumb" />
<span id="bartos:ecml15">Bartos, K., Sofka, M., 2015. Robust representation of network traffic for detecting malware variations. In: Proceedings of the European Conference on Machine Learning
and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Porto, Portugal, pp. 116–132.</span>
</div>
<div id="bartos:ecml15-materials">
<button class="button0" onclick="$('#bartos-ecml15-abstract').toggle();">abstract</button>
<a href="/pdfs/bartos-ecml15.pdf"><input class="button4" type="button" value="pdf" /></a>
<button class="button1" onclick="$('#bartos-ecml15-bibtex').toggle();">bibtex</button>
</div>
<div class="dispinline">
<p class="likepre" id="bartos-ecml15-abstract" style="display: none;">The goal of domain adaptation is to solve the problem of different joint
distribution of observation and labels in the training and testing data sets.
This problem happens in many practical situations such as when a malware
detector is trained from labeled datasets at certain time point but later
evolves to evade detection. We solve the problem by introducing a new
representation which ensures that a conditional distribution of the observation
given labels is the same. The representation is computed for bags of samples
(network traffic logs) and is designed to be invariant under shifting and
scaling of the feature values extracted from the logs and under permutation and
size changes of the bags. The invariance of the representation is achieved by
relying on a self-similarity matrix computed for each bag. In our experiments,
we will show that the representation is effective for training detector of
malicious traffic in large corporate networks. Compared to the case without
domain adaptation, the recall of the detector improves from 0.81 to 0.88 and
precision from 0.998 to 0.999.</p>
<p>
<pre id="bartos-ecml15-bibtex" style="display: none;"><small>@inproceedings{bartos:ecml15,
author = {Bartos, Karel and Sofka, Michal},
title = {Robust representation of network traffic for detecting malware variations},
booktitle = {Proceedings of the European Conference on Machine Learning
and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)},
year = {2015},
month = "7--11~" # sep,
pages = {116--132},
address = {Porto, Portugal}
}
</small></pre>
</p>
<br />
</div>
</li></ol>
<p>This post was authored by Karel Bartos, Vojtech Franc, & Michal Sofka.</p>
</div><!-- /.medium-8.columns -->
</div>
<!-- /.row -->
2015-09-03T00:00:00-04:00