Bats, machine learning & tequila

When you’re having a shot of tequila at your next fiesta, make sure to raise a glass to bats. Yes, bats, because they’re the animal responsible for pollinating the blue agave plant needed to make tequila.

The small winged animals, which represent a whopping fifth of all mammals, do quite a lot of things for us. They pollinate many other plants – like wild bananas – and do quite a bit of insect control. Some species, like the little brown bat, can eat up to 1000 mosquitoes in a single hour. In fact, bats are one of our best defenses against the spread of mosquito-borne diseases such as Zika.

Knowing where bats live and how their populations are faring in response to our impact on the planet, is clearly an important task. But they’re small, largely nocturnal and like to hide – so how do we know where the bats are?

One answer is through sound. Around 80% of bats emit series of acoustic pulses, which they use – along with their resounding echoes – to navigate the nocturnal world. This is called echolocation. Although these pulses are beyond the frequencies that humans can hear, they’re distinctly audible to a range of devices called ultrasonic detectors.

About a decade ago, a big project was set up in Europe to gather lots of recordings of bats across several countries, called the Indicator Bats (iBats) program. Today, volunteers drive through the countryside of 22 countries with bat detectors attached to the roofs of their car, collecting acoustic information that bats leak to the world about their whereabouts. This has generated a vast amount of audio recordings – more than any human alone could ever listen through.

The big challenge is to develop automated ways of telling us just how many bats are in those recordings, and of which species. Though each species has its own signature echolocation ‘call’, telling a computer how to distinguish between them is complicated. Just take a look at some calls. We can’t hear them, so plotting them on a spectrogram (a frequency-time plot) is one way of looking at them. For some species, the call ‘shapes’ look completely different, making them easy to tell apart, like the calls in the top image below. But within some groups of bats – like brown bats, shown in the bottom image – calls between species can look pretty alike. On top of this, there can be a lot of variation in calls between a species, so it’s difficult to tell a computer what to do.

Screen Shot 2017-03-31 at 20.45.35.png

Screen Shot 2017-03-31 at 20.45.58.png

Spectrograms of calls of various bat species

This is the type of problem a new branch of computer science is cut out to solve: machine learning (ML). ML is a kind of artificial intelligence that allows algorithms to ‘learn’ from data you give them without being explicitly programmed to do so. In discriminating between bat species, they’ve proven to be more accurate than other computational methods and even well-trained experts.

Let’s build a simple classifier. We’ll pick a machine learning algorithm called Random Forest, which is easy to understand. To train the algorithm, we have a bunch of cleanly recorded bat calls from 33 European bats.

Before we do, we’ll need to package the bat calls somewhat differently – Random Forest doesn’t know what to do with raw audio. Using simple programming tools, we can extract some simple parameters from the calls. For instance, we can take the mean frequency for each millisecond time slice across a call – which you can see represented as black dots in the spectrogram below.

Screen Shot 2017-03-31 at 20.47.27.png

We can be crazier and fit a curve to that line, and calculate some fancier parameters from the curve itself, like the slope, or the steepest slope – stuff you may have forgotten from high school math. Using our imagination, we came up with 34 parameters in total. Remarkably, they shared a high similarity score with the parameters produced by a commercial software normally used by ecologists for this purpose.

Screen Shot 2017-03-31 at 20.48.57.png

We’re not going to tell Random Forest how it’s going to use these parameters to discriminate between species. The beauty of machine learning is that the algorithms are capable of “learning” the best way to order data. We simply feed Random Forest the parameters, it will learn from these and will build us a classifier, i.e. an algorithm that is capable of assigning a species identification to a call it hasn’t seen before.

We set aside 20% of our data to test the algorithm on, and feed it the parameters from 80% of the bat calls. Random Forest works on the basis of decision trees. For each decision tree, it will take a handful of data – in this case, bat calls – and uses the parameters we extracted to create a pathway of decisions for the algorithm to decide which classification to make. For instance, a simple decision tree would classify every bat call with a start frequency above 100kHz as a horseshoe bat. If it’s below that frequency, it’ll, say, look at the steepest slope of the call. If it’s steep, it could be a brown bat, and so on. Random Forest algorithm computes many of such decision trees and averages over all of them to create a final classifier for prediction.

Screen Shot 2017-03-31 at 20.49.45.png

 

After growing the classifier, we test it on the 20% of data we set aside initially. We see it correctly predicted the species in nearly 70% of cases. It performed worse in some bat groups than others – it performed slightly worse in the group of brown bats, which – as we previously discussed – are tricky to distinguish.

70% is not bad for a start. When a research group used an artificial neural network, a different type of machine learning algorithm, for the same species of bats, they achieved 83.7% accuracy. The trouble with artificial neural networks, is that they are very much a “black box” in terms of how they work. We can test if they are accurate, in the same way we tested the accuracy of our Random Forest. But what happens inside the box, i.e. why the algorithm produces a certain prediction, we can’t say.

Random Forest, in contrast, can tell us which parameters it found more useful in discriminating between species (which was the lowest frequency of the call). We know exactly how it works. The downside is that at least for this data, it wasn’t quite as effective as the artificial neural network. When we tried our classifier on some real-world data collected on the island of Jersey a few years ago, it really didn’t seem to do so well. It kept categorizing calls to species not known to occur in that area for one. The program we wrote to extract features on the calls seemed to falter, too: the curve was unable to align to the bat calls in most cases. The reason was simply that these new, real-world recordings were particularly noisy, unlike the clean data we had trained our algorithm on.

When it comes to machine learning algorithms, there is often a trade-off between precision and transparency. Of course we want to know exactly how an algorithm works and why it makes the decisions it does, but sometimes we have to forsake transparency for precision.

The Jones group at the Center for Biodiversity and Environmental Research at University College London is making use of the latest trend in machine learning to classify bat calls: deep learning. These algorithms are particularly opaque and extremely difficult to understand. They’re highly complex but they’ve excelled in accuracy for certain tasks. The Jones group is working towards using deep learning techniques to create a new classifier for European bats.

Although they are less transparent, they are extremely promising. As long as they are vigorously tested to make sure that they’re drawing the right conclusion for each bat, such algorithms hold big promise for ecology.

The more we know about bats, the better. It seems that artificial intelligence can get us a good step of the way there. And it might just help keep the tequila flowing.