1. 程式人生 > >Convolution: An Exploration of a Familiar Operator’s Deeper Roots

Convolution: An Exploration of a Familiar Operator’s Deeper Roots

Convolution: An Exploration of a Familiar Operator’s Deeper Roots

In the world of modern machine learning, the convolution operator occupies the strange position: it’s both trivially familiar to anyone who’s read a neural network paper since 2012, and simultaneously an object whose deeper mathematical foundations are often poorly understood. In audio processing contexts, you might hear it described as a signal smoother. In convolutional nets, it aggregates information from nearby pixels and is also described as a pattern-matching filter, activating more strongly in response to specific local pixel patterns. From the surface, it’s not immediately obvious how all of these different conceptual aspects of convolution are rooted to a shared mathematical reality.

The process of piecing all of these interpretations together into a shared web of understanding was reminiscent of finally really looking at a mural on a building you pass every day, only to find it deep with connotations you hadn’t previously seen. On a superficial level, much of this theory may seen irrelevant to the work of a practitioner, but I think that having a more solidly rooted understanding of the intellectual lineage of the network convolutions we use everyday allows us to engage with literature on the subject from a footing of confidence, rather than confusion.

The Very Best Place to Start

Our story begins, not with the convolution’s various wanders into areas of applied science, but here, with the simple mathematical definition underneath it all. I know that to those of you familiar with filter-based convolutions, this framing may be new and strange, but hang in there — I will do my best to connect all the dots eventually.

In the equation above, f and g are both functions which operate on the same domain of inputs. In the context of probability, you might frame these as density functions; in audio processing, they might be waveforms; but, fundamentally, we can just thought of as some generic functions. When you convolve two functions together, the convolution itself acts as a function, insofar as we can calculate the value of the convolution at a particular point t, just as we can with a function.

To walk through what is happening here, to calculate a convolution between f and g, at point t, we take the integral over all values tau between negative and positive infinity, and, at each point, multiply the value of f(x) at position tau by the value of g(x) at t — tau, that is, the difference between the point for which the convolution is being calculated and the tau at a given point in the integral.

At first glance, this calculation may seem arbitrary: it’s easy enough to understand mechanically what’s being calculated, but harder to intuit why this would be a useful operation to perform. To make it’s usefulness more clear, I’ll examine the operation through a few different, but interconnected lenses.

Convolution as Function Smoothing

One way I find it useful to think of convolutions is by analogy to Kernel Density Estimation. In KDE, you pick some kernel — a Gaussian function, perhaps— and add up copies of that kernel centered at each point in your dataset. Where many data points are densely located close to one another, these function copies will add up; ultimately, this process results in an empirical approximation to a probability density function. The important thing to take away from this, for the purposes of this analogy, is the way that the kernel acts as a smoother around the data points. If we just used a simple histogram, and added one unit of mass directly at the position of each observed data point, our results would be very choppy and discontinuous, especially at small sample sizes. By using a kernel that spreads that mass around to nearby points, we express a preference for smooth functions, leveraging the intuition that many underlying functions in nature are smooth.

In a convolution, rather than smoothing the function created by the empirical distribution of datapoints, we take a more general approach, which allows us to smooth any function f(x). But we use a similar approach: we take some kernel function g(x), and at each point in the integral we place a copy of g(x), scaled up by — which is to say, multiplied by — the value of f(x) at that point.

A helpful visualization of this effect is shown above, pulled from this excellent summary of convolution.

The top plot shows an original f(x), the noisy signal we want to smooth. The bottom left shows the “kernel weighting” process — here a Gaussian function is placed centered at each point, and is being scaled up by the value of f(x) at that point. [As an aside: this is technically showing a finite convolution, where you sum up a finite set of scaled kernels, each centered at one of a finite set of sampled points along the function, but the intuition transfers over to the harder-to-visualize infinite case where you’re taking a true integral over a function, rather than just a sum over sampled points]. Finally, the plot on the bottom right shows what you get when you sum up all of those scaled kernels: the plot of the convolution of (f*g)(x).

Convolution as Localized Information Aggregation

A quite different lens — and one that gets us a step closer to the way convolutions are used in neural nets — is convolution as a kind of information aggregation, but an aggregation from the perspective of a specific point: the point at which the convolution is being evaluated. (For the sake of compactness, I’ll refer to this point as the “output point”, and the points over which the integral is being taken as “input points”) The output point wants to summarize information about the value of f(x) across the function’s full domain, but it wants to do so according to some specific rule. Perhaps it only cares about the value of f at points one unit away from it, and doesn’t need to know about points outside that window. Perhaps it cares more about points closer to itself in space, but the extent to which it cares decays smoothly, rather than dropping off to zero outside of some fixed window, like in the first example. Or, perhaps it cares most about points far away from it. All of these different kinds of aggregations can fall under the framework of a convolution, and they are expressed through the function g(x).

From a mathematical viewpoint, convolutions can be seen as calculating a weighted sum of the values of f(x), and doing so in a way where the contribution from each x is determined by its distance between it and your output point t. More precisely, the weight is determined by asking “if there were a copy of g(x) centered at each point x, what would the value of that copied function be, if we evaluated it at the output point t”.

To visualize this, take a look at the plot below. In this example, we are using g(x) = x², and imagining how we’d calculate weights for a weighted sum being calculated at x = -2.

The orange curve is g(x), but centered at x=1, and its value at x=-2 is 9. The green and purple curves represent g(x) centered at x=2 and x=3 respectively. If we were to imagine that f is a function that only takes non-zero values at 1, 2, and 3, we could calculate (f * g)(-1) as: 9*f(1) + 16*f(2) + 25*f(3)

Looking at this image above, with it’s multiple overlaid copies of g(x), it’s hopefully easy the connection between the information aggregation frame and the KDE one. When we focus attention on a specific input point, each point’s copy of g(x) is answering the question “how will the f(x) information from this point be spread out”. When we focus a specific output point, we’re asking “how strong is the contribution from each other point to the aggregation of f(x) at this output point”. Each vertical line on the multiple-copies graph from the prior section corresponds to the convolution at a given x, and reading off the intersections of that line with each g(x) copy gives you the weight of f(x) at the x where that copy is centered.

Centering On a Point of Confusion

Thinking of convolutions as information aggregators gets us quite a bit closer to the role they play in neural networks, where they summarize the behavior of the layer below in the local region around a point. However, two meaningful gaps still exist.

The first is the fact that, on an initial glance, the image convolution filter seems quite structurally different than the examples this post has so far used, insofar as the filters are 2D and discrete, whereas the examples have been 1D and continuous. But, in practice, these differences are more cosmetic than they are fundamental: its simply the case that, on images, f(x) and g(x) are defined and indexed over a 2D space (x_1, x_2), and that that space is chunked into discrete bins within which the function value doesn’t vary.

What’s actually difficult and unintuitive about the jump from traditional convolution to network convolution is that, when we see localized aggregation happening in the neural network context, we typically frame it as happening through a weighting kernel centered at the aggregation point, which pulls in information from the surroundings according to that kernel. A convolution filter in a network is composed the weights of all the input points surrounding the output point (as well as the output point itself).

You’ve probably seen this image before, or something very like this

However, as you may have recall from the discussion earlier in the post, traditional mathematical convolution doesn’t actually center its g(x) weight kernels around the output point. In the convolution formula, copies of g(x) get centered at each input point, and the value of that copy at a given output determines the weight of that input point there. So, confusingly, filter kernels as we know them in network convolution are not the same thing as g(x) in the convolution formula. If we wanted to take a network filter and apply it using the convolution formula, we would have to flip it (in the 1D case) or flip it diagonally (in the 2D case). And, going in reverse, if we wanted to take a g(x) from the convolution integral and use it as an output-centered filter, that filter would be g(-x).