Bayesian Inference, Complex Measurement, and Model Fitting
References for today
- MPoL introduction
- Data Analysis: a Bayesian Tutorial by Sivia and Skilling
- Data Analysis Recipes: Fitting a Model to Data, by Hogg et al.
- Data analysis recipes: Probability calculus for inference by Hogg
- Machine Learning: A Probabilistic Perspective by Murphy, Chapter 10
- Pattern Recognition and Machine Learning by Bishop, Chapter 8
- Probabilistic Graphical Models by Sucar, especially Chapters 7, 8
Last time
- general interferometer with north-south and east-west baselines
- arrays with multiple antennas and Earth aperture synthesis
- point spread functions (PSFs) and their relationship to array configuration
Today’s lecture
Now that we’ve covered how interferometers work to observe a source and the Fourier transform theory behind that, we’re going to focus on the data products (called “visibilities”) and Bayesian inference techniques for analyzing the data in its natural space. In subsequent lectures we will talk about how we might use the visibilities and Fourier inversion techniques to synthesize images of the source, but this lecture occupies an important intermediate (and foundational) step, where we are treating the visibilities as the “raw” data product and thinking about how we bring our analysis techniques into that space.
The topics we will cover include:
- Bayesian inference
- Forward-modeling with a generative model
- Complex-valued noise and measurement (weights)
- Statistical weight and relationship to point source uncertainty
- Forward-modeling visibility data
- Missing spatial frequencies (model constraints)
Probability calculus and Bayesian Inference
- A good reference for this section is Data analysis recipes: Probability calculus for inference by Hogg
Generally, we write probability distributions like
Probability functions are normalized.
Say that
TODO: draw a bell curve
Probability functions have units. In this case,
If selected an individual from the US population and we wanted to know the probability that their height was between 1.7 and 1.9 meters, we could do an integral over this range
Conditional probabilities
It’s very common that we will talk about multiple parameters at the same time. For example, we can continue our example and let
What are the units of this probability distribution? It’s actually the same as before, it’s
You can’t do the integral
Factorizing probabilities
Now let’s consider the probability distribution
The same normalization rules apply, only now these need to be done over two dimensions.
You can take any joint probability distribution and factor it into conditional distributions. So, we could write
As I said, this factorization can apply to any joint probability distribution.
Side note that if
We can put the two factorization equations together to arrive at another relationship
Marginalization
One amazing thing you can do with probability distributions is marginalization. Say that I told you the joint distribution
As we’ll see in a moment, this has huge implications when it comes time to do inference.
Likelihood functions
Now, let’s revisit Bayes’s rule and rewrite it like
The term on the left hand side is called a posterior distribution and is really a wonderful thing to report at the end of your analysis. Say you collected many years of measurements on the positions (orbits) of Jupiter, Saturn, and their moons, and then used those measurements to infer the mass of Saturn (as Laplace famously did). The posterior you would be most interested in would be a 1D distribution of the probability of the mass of Saturn and would represent your degree of belief that Saturn truly had that particular mass. This would be conditional on all of the measurements you made.
The term on the very right hand side is a prior probability distribution and expresses your belief about the mass of Saturn in the absence of data. For example, we might rightly say that the mass needed to be greater than zero, and less than the mass of the Sun. A simple prior would then ascribe equal probabilities to all values in between (or perhaps equal probabilities to the logarithm of the mass of Saturn).
The remaining term,
A quick note that likelihood functions show up in frequentist analysis all the time, too. However, the interpretations of probability are different.
Fitting a line
Let’s dive into a quick example with some real data to make these concepts clearer.
Typically, when astronomers fit a model to some dataset, such as a line
TODO: draw a bunch of points for putting a line through.
For most real-world datasets, we don’t measure the “true”
where
This information about the data and noise generating process means that we can write down a likelihood function to calculate the probability that we observe the data that we do, given a set of model parameters. The likelihood function is
Let’s call
The probability of observing each datum is a Gaussian (normal distribution) centered on the model value, evaluated at the
The logarithm of the likelihood function is
You may recognize the right hand term looks similar to the
Assuming that the uncertainty (
where
When it comes time to do parameter inference, however, it’s important to keep in mind
- the simplifying assumptions we made about the noise uncertainties being constant with respect to the model parameters. If we were to “fit for the noise” in a hierarchical model, for example, we would need to use the full form of the log-likelihood function, including the
term. - that in order to maximize the likelihood function we want to minimize the
function. - that constants of proportionality (e.g., the 1/2 in front of the
) can matter when combining likelihood functions with prior distributions for Bayesian parameter inference.
To be specific,
Visibility Data
Now that we have reviewed likelihood functions, let’s turn back to radio astronomy and go into further detail how the visibility function is sampled. Recall that the visibility domain is the Fourier transform of the image sky brightness
The visibility function is complex-valued, and each measurement of it (denoted by
Here
where
The full complex noise-draw is given by
Radio interferometers will commonly represent the uncertainty on each visibility measurement by a “weight”
Like
A full interferometric dataset is a collection of visibility measurements, which we represent by
each one having a corresponding
Likelihood functions for Fourier data
Now that we’ve introduced likelihood functions in general and the specifics of Fourier data, let’s talk about likelihood functions for inference with Fourier data. As before, our statement about the data generating process
leads us to the formulation of the likelihood function.
First, let’s assume we have some model that we’d like to fit to our dataset. To be a forward model, it should be able to predict the value of the visibility function for any spatial frequency, i.e., we need to be able to calculate
Following the discussion about how the complex noise realization
Because the data and model are complex-valued,
where
Because images of the sky are real, therefore the real part of the visibility function must always be even and the imaginary part odd. The visibility function is Hermitian. This means that
So, if you make a measurement of
Point source sensitivity
We’ll use our forward modeling formalism to fit for the flux of a point source.
Se also Casussus and Carcamo 2022, appendix A4.
More complex visibility models
It’s difficult to reason about all but the simplest models directly in the Fourier plane, so usually models are constructed in the image plane
For marginally resolved sources, it’s common to fit simple models like a 2D Gaussian. We can write down an image plane model and then calculate its Fourier transform analytically.
But, these could be more complicated models. For example, these models could be channel maps of carbon monoxide emission from a rotating protoplanetary disk (as in Czekala et al. 2015, where
With the likelihood function specified, we can add prior probability distributions
All of these type of models would be called parametric models, because we can represent the model using a finite set of parameters. E.g., for the Gaussian model, we have width in the major and minor axes, rotation angle, and 2D position. So these parameters fully represent the model. One thing you need to be concerned with inference using parametric models is whether you have the right model! If your source is actually a ring instead of a Gaussian, the posterior distribution of your parameters can be rendered meaningless.
Forward modeling with (simple) parametric models can be very useful for understanding
Discussion about model mis-specification and unsampled visibilities (model constraints)
Could use u / v undersampling to constrain width or size, say.