The DARPA AI Colloquium and Expedition

Mr. Mattei’s talk on RFMLS at the AI Colloquium

DARPA’s Artificial Intelligence Colloquium (AIC) is taking place this week at the Hilton Alexandria Mark Center and aims to “highlight recent research results across the full breadth of DARPA’s investment in advancing the state of the art in AI”. As we’ve noted before, Expedition has been at the forefront of this work, architecting novel deep learning architectures and applying them to both image and signal solutions for our country.

But you don’t need to take our word for it. This week our own Mr. Enrico Mattei, a research scientist at EXP, was requested to to present the summary of the goals of DARPA’s RFMLS program. (UPDATE 26 March 2019: The video of his presentation is now available on YouTube and is embedded above.) RFMLS is aimed at mapping the internet of things with machine learning to improve security through spectrum awareness and emitter identification. Enrico, as the Principal Investigator of our RFMLS effort, is one of only a few non-Government attendees asked to give a talk at the Colloquium. We’re especially proud to have him represent the foundational and novel work the team has been doing on the RFMLS effort.

A Review of Focal Loss at Women in Data Science Blacksburg

Summary: I enjoyed listening to talks, meeting Virginia Tech students, and giving a tutorial on deep learning at the Women in Data Science (WiDS) Blacksburg conference. WiDS events all over the world are happening now to encourage and support current and future women in this field. Some of the material from my tutorial on focal loss, intended for people with a basic background in machine learning, is included below with context for an accompanying Jupyter notebook.

Last week I had the opportunity to attend and present at Women in Data Science Blacksburg, the first WiDS regional event at Virginia Tech, hosted by Dr. Eileen Martin, one of my classmates from grad school. For those of you unfamiliar with WiDS, the first Women in Data Science conference was organized and run at Stanford, led by Dr. Margot Gerritsen, who was the director of my department at the time. One of the things I love about WiDS is how from early on there were efforts to have it reach beyond Silicon Valley by encouraging people around the world to host their own WiDS events. At this point, I’ve attended WiDS conferences at Stanford; Cambridge, MA; Washington, D.C.; and now Blacksburg, VA. Video clips and images from various regional events are compiled and broadcast, so for me, there is a sense of this much broader community extending beyond the people in my locality. Speaking of video clips, the Virginia Tech College of Science has already put together a brief video about the WiDS Blacksburg. I hope they continue to support this event in the future.

If you are a woman or male ally interested in data science and machine learning, WiDS Blacksburg was held earlier than most, so there may still be time to register for a WiDS event in your region or participate remotely via the livestream from Stanford.

The tutorial session that I presented at this WiDS focused on focal loss, a variant of the cross-entropy function commonly used by neural networks to perform classification. The paper Focal Loss for Dense Object Detection was published (pre-published?) on arXiv in mid-2017, so it has been around for a bit, but many people are still not familiar with this simple but effective technique. To prepare an interactive example that students could run easily, naturally the first thing I did was search GitHub, because while I could write my own from scratch, let’s be real — I have a day job and a life outside of work, and I strongly believe in minimizing duplication of effort. I found a great example Jupyter notebook and accompanying blog post by user Tony607, forked the repository, and started making changes. I ended up changing a fair amount in order to approach the problem in the way that made the most sense to me and to emphasize certain aspects of how focal loss works. My version of the notebook is available here, although I encourage you to read more of this post before trying it out. (Yes, it’s a toy example with a teeny tiny neural net, and it’s what made the most sense for a live demo.)

In my presentation, I tried to break down the main ideas from the focal loss paper to be more intuitive and digestible for people with less experience in deep learning. Read on for the full explanation, intended for people with a basic background in machine learning, or skip to the last paragraph for a couple sentences’ worth of closing thoughts.

First, let’s take a step back and ask “what problem are we trying to solve?” Say you want to classify each sample from a dataset as one of two classes, and to add a slight complication, the class distribution is imbalanced. (Don’t worry, we can extend focal loss to N classes, but I’m using two for simplicity.) To make this example more concrete, let’s say that the problem is detecting fraudulent financial transactions in a dataset with a large proportion of normal transactions and relatively few fraudulent transactions. In fact, this is the problem used in the Jupyter notebook. You build a neural network and train it for a bit, and it quickly attains the ability to distinguish between normal and fraudulent transactions at a basic level. From this point on, most of the training examples are not doing much to improve your performance because the model is already doing a decent job on them. We will call those “well-classified” or “easy” examples. There is a smaller subset of “hard” examples in the training dataset that are more informative to the model, and focal loss allows us to place more emphasis on those examples.

How does focal loss achieve this objective? Figure 1 from the paper illustrates this well. I’ve taken the original figure from the paper and added my own annotations below. This plot shows curves for the standard cross-entropy loss function and a few variations of the focal loss function, where the variations use different values for the hyperparameter gamma. On the x-axis is the input pt, the predicted probability for the true class, and on the y-axis is the corresponding loss. Consider what happens with a well-classified example — say a training example with a true label of “normal” has a predicted “normal” score of 0.8. Looking at the cross-entropy function, the loss is small and, more importantly, the gradient is small. If we compare that to a hard example, such as a “normal” example with a score of 0.2, where we are not doing well on this example at all, the gradient for the hard example is larger. This is good — the standard cross-entropy loss function already has some built-in ability to place more emphasis on examples where the predictions are further from the truth labels.

However, if we go through the same thought exercise with one of the focal loss curves, we see that the gradient for the well-classified example is even smaller and the gradient for the hard example is even larger. We could interpret this difference in the shape of the loss functions by saying that a model trained with standard cross-entropy loss will continue trying to push scores for the well-classified examples further and further all the way to 1.0, where a model trained with focal loss will not care too much about the well-classified examples and instead work more towards improving on the hard examples. This effect is evident in the Jupyter notebook, and this is a good point to take a look at it and see for yourself.

There are a couple more key parts of the focal loss paper that I want to discuss. First, the application we were just considering is a pure classification problem. Where does object detection fit in? A common standard design for deep learning object detection models uses a grid with several template boxes or “anchor boxes” with different aspect ratios at each cell within the grid, and the model learns to classify each cell as either having an object of interest (a ground truth box that roughly matches an anchor box) or having nothing of interest, i.e. belonging to the “background” class. In this example image from SSD: Single Shot Multibox Detector, only two anchor boxes match the cat, and one matches the dog. The vast majority of anchor boxes do not match a ground truth box, so we have a situation with potentially a large number of easy background examples and a smaller subset of hard examples. For a model like SSD, there is a regression component of the architecture where the model predicts the deltas between the truth bounding box and the anchor box, but focal loss does not directly impact that pathway, and that’s all we’ll say about it here.

Figure 1 from SSD: Single Shot MultiBox Detector

Finally, the focal loss paper also mentions the use of a prior probability for the rare class. Based on my team’s experiments, I can say that adjusting the prior is not necessary (and it is not used in the Jupyter notebook example), but it can help improve performance. The general idea here is that if we know ahead of time that a certain class is very rare (or conversely that a certain class will be overwhelmingly represented) we can initialize the weights for the last layer leading up to classification such that the model is biased toward predicting the rare class with low probability (or predicting the common class with high probability) instead of predicting each class uniformly. The final layer will begin training already able to predict the correct label for most of the examples and just needs to learn to recognize the rare class(es).

I think focal loss was a great topic for this setting because it is easy to implement, general enough to apply to many situations, and based on straightforward reasoning about gradients. Maybe I’m an idealist, but I think you or I could come up with an idea like this, too.

References

  1. Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2018). Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence.
  2. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016, October). Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21-37). Springer, Cham.

EXP at NAML

The Workshop on Naval Applications of Machine Learning

The 3rd Annual Naval Applications of Machine Learning (NAML) workshop was held the 11th through the 14th of February in San Diego. It is hosted by SPAWAR Systems Center Pacific and it featured “oral and poster presentations on technical topics including autonomy, computer vision, and cybersecurity”. It has quickly grown from a great idea to a significant venue for showcasing applied machine learning solutions in our community.

Expedition Technology at NAML

Expedition (EXP) has been attending this conference since the first and is very happy to be participating this year as well. EXP’s CTO, Greg Harrison, presented our most recent work on object detection and tracking in Wide Area Motion Imagery, while Senior Scientist Enrico Mattei outlined our progress in developing state-of-the-art deep learning systems for analysis in the radio frequency realm on our RFMLS project. These topics were also discussed in an open forum at the GTC-DC conference late last year.

The presented projects aim at taking the best approaches in this rapidly evolving field and developing them into deployed, modern solutions in support of the United States. They represent premier examples of the work our team perform here at EXP and why we are proud to be a part of it.

We are hiring!

If you think you would like to spend your days working on solutions to similar problems, apply to one of our positions or reach out to us here or to one of our folks on LinkedIn. We are happy to talk to you about life at EXP.

Expedition at the GPU Technology Conference

This week the team at Expedition Technology had the opportunity to publicly discuss a couple of the compelling projects we are working here. At NVIDIA’s GPU Technology Conference in DC (GTC DC) we presented results on computer vision and on signal processing. The talks were:

The projects outlined in these talks are great examples of the type of work we tackle here at EXP and are also representative of the state-of-the-art algorithms and results we are developing. If taking on these kinds of big ideas and building solutions to address them is the sort of thing you would love to be doing, drop us a line or check out our current job postings!

Expedition Technology Wins DARPA Award to Map the IoT Via Machine Learning

(Dulles, VA) August 2, 2018 – Expedition Technology, Inc., is proud to announce the receipt of a three-year prime contract award worth up to $9.1 million from the Defense Advanced Research Projects Agency (DARPA) for the Radio Frequency Machine Learning System (RFMLS) program.

RFMLS is the first DARPA program to emphasize the application of machine learning to the RF spectrum. Machine learning is demonstrating considerable success when used in related fields including speech recognition and computer vision, but it has not yet been similarly applied to the crowded spectrum of signals that currently exists.

Through this contract, Expedition Technology and its partners will develop the foundations for applying modern data-driven Machine Learning to the RF Spectrum domain as well as develop practical applications in emerging spectrum problems which demand vastly improved discrimination performance over today’s hand-engineered RF systems. Ultimately, these innovations will result in a new generation of RF systems that are goal-driven and can learn from data rather than being hand-engineered by experts.

The four technical components of the program include: feature learning, attention and saliency, autonomous RF sensor configuration and waveform synthesis. A successful RFML system is intended to address the need for enhanced spectrum situational awareness. By discerning subtle differences in signals transmitted by mass-produced devices, RFMLS strives to identify signals intended to spoof or hack into devices in the Internet of Things (IoT). Additionally, RFMLS investigates new paradigms for the rapid evaluation of broad spectrum use to better support cognitive radio applications.

“The RFMLS program is the centerpiece of Expedition Technology’s rapidly growing portfolio of RF machine learning capabilities,” says Marc Harlacher, President and CEO. Harlacher continues, “Success in this endeavor will give our military the ability to discern and characterize signals in the increasingly-crowded RF spectrum, enhancing the ability to understand what is going on in the wireless domain.”

EXP is a prime contractor for DARPA’s RFMLS program, leading a team that includes the International Computer Science Institute (ICSI) and Leidos as partnering subcontractors.

About Expedition Technology
Expedition Technology (EXP) is a leading developer of machine learning algorithms and autonomous systems for defense and intelligence C4ISR applications including radar, lidar, imaging, full motion video, communications, navigation, signal intelligence, and data analytics. As a small business with extensive experience researching, engineering, developing and operating civil and military defense and aerospace systems, EXP is applying rapidly evolving machine learning capabilities to provide our U.S. Government customers with improved situational awareness and actionable intelligence.

Fighting GAN Mode Collapse by Randomly Sampling the Latent Space 

At Expedition Technology (EXP) we develop a broad set of deep learning solutions for our customers. Each deep learning development cycle typically starts with

  • Understanding the problem space
  • Getting acquainted with the research landscape
  • Tweaking an existing algorithm or developing entirely new architectures
  • Training on an army of GPUs

This is the standard process, but with a constraint: it requires very large diverse data sets to get good results. As many of our customer’s problems grow more sophisticated, absence of that constraint is becoming an ever rarer occurence. In these cases where data is scarce, there is a necessary additional step – amplifying the data that you have.

For help with this, we have been turning to Generative Adversarial Networks (GANs). Despite their wide-ranging success, deep generative methods are hindered by well-known drawbacks such as unstable minima and mode collapse. We have recently made progress regarding the latter and would like to share our methods with the rest of the deep learning community. In this post we will introduce GANs, describe mode collapse, and then explain how we’ve attempted to mitigate this problem while adding justifications and results to support our claims.

GANs

Generative Adversarial Networks [1] (GANs) are an incredible technology. Although classification and segmentation are necessary problems, they don’t have the catchy, easy-to-appreciate results GANs do. After all, you can’t become a great artist just by learning to distinguish Van Gogh from Monet. You have to actually pick up a paintbrush and try your hand at it. Similarly, if we strive to make intelligent systems, they must be able to not only discriminate, but to generate believable outputs. That’s where we cross the border from a passive to an active agent.

[6] – Architecture for a GAN generating MNIST digits

GANs operate by combining two networks – one that creates output, and one that provides feedback. The ‘generator’, as it’s called, is provided a random input and tries to return a correspondingly random output. The ‘discriminator’ then compares this generated sample to real world ones and gives a zero to one score of how believable it is. It’s really just a competition: the generator is trying to fool an ever-improving discriminator. If you let them duke it out a few million times, you end up with a discriminator that learns the real world from the fake world, as well as a generator that does a pretty good job at making realistic looking samples.

This is a powerful tool, as it theoretically allows for creating unlimited additional data. If the generated samples are within the set of all possible inputs, then we can turn 100 data points into 1000 by letting the generator hallucinate 900 new but plausible examples.

Mode collapse

There’s a problem, though. Let’s look at the following situation [2] as a GAN tries to make pictures of cars:

  1. After bumbling around for a bit, the generator learns to draw convincing Honda Civics
  2. The discriminator picks up on this and starts labeling most Honda Civics as generated
  3. In response to this, the generator tweaks its algorithm a bit and begins making a similar but separate class – Honda Accords
  4. Now the discriminator has to adjust, so it starts calling Honda Accords fake
  5. While the discriminator is distracted by Accords, the opportunity presents itself to start making convincing Civics again, which the generator happily reverts to
  6. Repeat steps 2-5

This infinite loop of similar outputs is termed mode collapse, and it is one of the things restricting GANs from being widely used as a data amplification tool. The consequence of mode collapse is that we cannot create an unlimited supply of unique samples, since our generator only flicks back and forth between a couple very similar outputs. This minimally satisfies the job of fooling the discriminator but is ultimately unhelpful if we are trying to stretch the effectiveness of our currently available data.

How to avoid mode collapse

To reconcile this, we decided to add a constraint: the generator outputs must be random, but in such a way that any such random output is believable. An intuitive way to enforce this is to find some compressed space Χ that is densely packed with examples, such that any point within that space corresponds to a true data sample. If we can also find a bijection f: Χ→Y from X, our densely packed space, to Y, our space of real examples, then we can randomly sample Χ, and convert those points to plausible outputs.

Luckily for us, autoencoders are great at finding exactly such a space and such a function. The basic idea is that an autoencoder takes input, processes it to a lower dimensionality vector, then reconstructs the input from that vector. The bottleneck in the middle, then, contains the relevant information about the input with fewer variables, providing us a compressed space, referred to as the latent space. The decoder, given a point in that space, recreates the input that was encoded, which provides us with our bijection f. This relies on two assumptions that we will provide evidence for in the next section.

[5] – Architecture for an autoencoder that compresses MNIST digits

What does this all mean? If we set up an autoencoder to densely encode inputs to a latent space, then any randomly sampled point in that latent space should give a realistic, equally random output upon decoding. Somewhat surprisingly, with a small enough dimensionality of the latent space, this actually works.

Our architecture for the L-GAN

To employ this effectively, we make a small GAN that finds a sub-basis of this latent space, and then take random samples from this sub-basis. In practice, this means that we train a GAN to generate a batch of vectors, enforce that they are orthogonal using their dot product, and then take random linear combinations of these vectors. The discriminator then decides whether these linear combinations are convincing latent space encodings. Those that fool the discriminator get decoded into realistic samples. Due to the sampling being random and the decoder being a bijection, our results are random elements that are indiscernible from the true data. See the figure below for some examples of non-cherrypicked eights generated by the network.

Random 8’s generated by our GAN + Decoder

The reason for having the GAN find a sub-basis is that it is difficult to find a perfect dimensionality of the latent space. This means that not every one of the axes is guaranteed to be utilized evenly. Therefore, it is more sensible to choose a dimensionality that allows the autoencoder some leniency, and to then let the generator learn the necessary basis of ‘highest plausibility’.

This approach is reminiscent of variational autoencoders (VAEs) [4], which also encode the data samples for the purposes of generation. VAEs, however, sample the latent space differently, electing instead to add random std. normal vectors to the encodings. In a VAE, the normal vectors are based on a mean and standard deviation that are also created by the encoder. In our approach, the encoder simply defines the latent space, which is then sampled by a wholly separate GAN.

Reasoning for why this works

There are two critical assumptions that substantiate our approach:

  1. The latent space is densely packed
  2. The decoder approaches a bijection

We provide two points of evidence to show that the latent space is densely packed. The first is a thought experiment. Given inputs that have 10 independent variables, and an encoded vector of length 5, we should expect that an autoencoder learns to utilize every degree of freedom to its fullest extent. If, instead, it only uses three axes of the five provided to it, the autoencoder will be further from representing the ten independent variables of the input space, implying that an easy lower minimum is available on the error landscape. This presents the caveat that our encodings need to be smaller in dimensionality than the number of independent variables in the input space. Such a requirement ensures that the optimal encoder takes advantage of every axis provided to it. Simply said, if you don’t give the encoder adequate dimensionality to represent the information, it must learn to take advantage of everything it has.

The second point is empirical, as seen by traveling through a latent space. It turns out, if we encode two handwritten MNIST digits to a latent space, the points between their encodings also represent plausible outputs, as seen in the figure [3] below. This implies that, given two known points in latent space, any point randomly between them is likely to also represent believable outputs. Our approach treats the latent representations differently by making a unique space for each digit, rather than a single latent space for all of them. In either case, the result should still hold.

[3] – Movement in the latent space from the encoding of a five to the encoding of a nine

Towards the second assumption, it is not true that the decoder is a true bijection, in part due to the discrete nature of the dataset. However, we can make a case that the decoder of a functional autoencoder will approach a bijection, as long as the encodings map to a densely packed space. We do this by showing that the encoder approaches a bijection from true inputs to a unique point in the latent space. The decoder then, as the inverse of the encoder, must learn the inverse bijection.

Before explaining the reasoning for the decoder being a bijection, we want to touch on why this is necessary. A bijection is a function fY that is both ‘onto’ and ‘one-to-one’. This means that any possible value O ∈ {Outputs} has exactly one corresponding input I for which f(I) = O. If both the encoder and the decoder are bijections, then any point randomly sampled in the latent space must have a unique, correspondingly random point in the true data space.

We can claim that the encoder is ‘onto’ as a consequence of our reasoning for the latent space being densely filled. In order to fill that dimensionality, the encoder must attempt to map the inputs into different locations within the latent space. As such, if the whole constrained-dimensionality latent space is filled, then the encoder is onto. We can also show that a working autoencoder’s encoder is ‘one-to-one’ by contradiction. If it were not one-to-one, then two different inputs could map to the same latent representation. Due to the assumption that the autoencoder is functional, this point in the latent space would be decoded back out to the two different inputs. This is not possible by the definition of a function. As such, an optimal encoder approaches a bijection, therefore the decoder must also do the same.

These assumptions come together for the logic of our generative approach. Autoencoders can find a latent space in which every point maps to plausible outputs, and simultaneously approximate the bijection between this latent space and the output space. Therefore, randomly sampling the dense latent space corresponds to randomly sampling the set of realistic data samples. The quality of decoded samples is then a direct result of how ‘bijective’ the encoding and decoding operations are.

Results

The ultimate goal is to amplify our existing data by generating new samples that are indiscernible from the original set. To this end, we set up an experiment where we trained a basic MNIST classifier on the full train set, on a tenth of the train set, and on a tenth of the train set along with generated samples. The GAN in this case was also trained on the same tenth.

We trained the GAN on each digit independently and created 5000 new samples for each. Upon training the classifier with GAN input, we split each batch as either 25, 50 or 75 percent composed of generated digits. The rest of each batch was taken from the tenth of the train set.

We found that the network trained on a tenth of the dataset plus generated samples is more accurate on the test set than the network trained without generated samples. Specifically, we see a decrease in the error rate of up to 17% after training on our amplified dataset.

Train setAll train dataTenth of train data Tenth of train data and generated 75/25Tenth of train data and generated 50/50Tenth of train data and generated 25/75
Test set accuracy96.85%94%94.3%95%92.6%

 

 

References:

  1. Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adversarial nets.” In Advances in neural information processing systems, pp. 2672-2680. 2014
  2. Nibali, http://aiden.nibali.org/blog/2017-01-18-mode-collapse-gans/
  3. Despois, https://medium.com/@juliendespois/latent-space-visualization-deep-learning-bits-2-bd09a46920df
  4. Kingma, Welling. “Auto-Encoding Variational Bayes.” https://arxiv.org/pdf/1312.6114.pdf
  5. Chollet, Building Autoencoders in Keras”, https://blog.keras.io/building-autoencoders-in-keras.html, 2016
  6. Chablani, “GAN – Introduction and Implementation”, https://towardsdatascience.com/gan-introduction-and-implementation-part1-implement-a-simple-gan-in-tf-for-mnist-handwritten-de00a759ae5c, 2017