Ian Czekala's homepage on Czekala Group

Syllabus

Fri, 25 Aug 2023 00:00:00 +0000

Objective: radio astronomy and interferometric imaging

Interferometric astrophysical observations are a vast and deep topic. The goal of this module component is to help students develop a practical understanding of how radio interferometers (in particular arrays like the VLA or ALMA) observe an astrophysical source, build a mathematical foundation for working with complex-valued, Fourier-plane data, and survey some of the many approaches that are used to investigate astrophysical phenomena, focusing on forward-modeling and regularised maximum likelihood techniques.

Instructor

Dr. Ian Czekala (he/him/his)
Email: ic95@st-andrews.ac.uk

Office hours by appointment. J.F. Allen building, Rm. 308.

If you are in any way feeling ill or suspect you might have been contact with an individual infected with COVID, please stay home and seek medical care if necessary. We plan on posting all lecture notes, and we will work with you to provide you with the course materials you need.

Course Grade

The course grade (100%) will be assessed via a 2 hour written examination.

Tutorials

This module component will have 2 tutorial sessions, as listed in the schedule. These in-class, interactive sessions will be used to build practical understanding of the module material and apply your knowledge to solving the types of problems frequently encountered in radio astronomy research.

Tutorial problems will be distributed via Moodle approximately one week in advance. You are encouraged to collaborate and work through the tutorial problems together, in advance, with other members of the module. However, each student should be prepared to discuss their answers on their own during the tutorial session.

Programming

Some tutorial problems may require a small amount of programming. Students are encouraged to use whatever programming language they are most comfortable with. IC is most familiar with Python and Julia, so he will be more able to assist you with those.

Reference Materials

There are many additional resources that will be helpful during this course (and beyond) and will be called out in the course at the appropriate juncture. Many of these resources are freely available online or through the University library.

Textbooks

Essential Radio Astronomy by James Condon and Scott Ransom (online resource)
Tools of Radio Astronomy by Rohlfs and Wilson (ebook)
Interferometry and Synthesis in Radio Astronomy by Thompson, Moran, and Swenson (ebook)
The Fourier Transform and its Applications by R. Bracewell

Courses

18 NRAO Synthesis Imaging School slides and lectures

Videos

Cells to Galaxies Speaker Series Archive, in particular talks by Urvashi Rao and Sanjay Bhatnagar (opening ~15 minutes)
Lectures by David Wilner Part I and Part II

Syllabus

Tue, 21 Jun 2022 00:00:00 +0000

Objective: radio astronomy and interferometric imaging

Interferometric astrophysical observations are a vast and deep topic. The goal of this course is to help students develop a practical understanding of how interferometers (in particular arrays like the VLA or ALMA) observe an astrophysical source, build a mathematical foundation for working with complex-valued, Fourier-plane data, and survey some of the many approaches that are used to investigate astrophysical phenomena, focusing on forward-modeling and regularized maximum likelihood techniques.

Instructor

Professor (Dr.) Ian Czekala (he/him/his)
Email: ipc5094@psu.edu or iczekala@psu.edu (alias).

Office hours by appointment.

Course Calendar and Closure Policies

ASTRO 589 meets once a week on Wednesdays from 10:10am to 11:00am ET (prompt) in Davey Lab Room 538.

For full information on lecture dates and topics, see the Course Schedule.

If campus should be closed (e.g. for a weather-related event or COVID precautions), the instructor will provide instructions via email on course lecture format (possibly remote, keeping the same schedule) and examinations/assignments (due dates to be rescheduled no earlier than 48 hours after closure announcement).

Assignments and Project

Course Grade

The course grade will be based on

problem sets (55%)
group project and presentation (45%)

Problem sets

Problem sets will be assigned approximately every 2-3 weeks.

Please turn in homeworks in .PDF format to Ian by email at ipc5094@psu.edu with the subject line ASTRO 589 problem set. Handwritten solutions are fine, but please digitize your submission using a department scanner or a smartphone scanner app (e.g., Adobe Scan).

Due dates will be set to correspond to the beginning of a class period (i.e., 10:10am).

Programming

For problem sets, students are encouraged to use whatever programming language they are most comfortable with.

Some project choices, especially those involving CASA, may require basic familiarity with the Python programming language.

Problem set late policy

The following percentages will be deducted from your score

one day late: 5%
two days late: 10%
three days late: 25%
more than 72 hours late: no credit

If extenuating circumstances arise such that you will be unable to complete the homework on time, please contact Ian before the homework deadline and we can most likely arrange an extension.

Problem set collaboration policy

You are encouraged to collaborate and work through the problem sets together. However, each student must complete the final write up on their own, i.e., no problem set should be duplicated verbatim between students.

Group project and presentation

Students will form groups of 2-3 for their course projects. Groups will propose a project concept around some aspect of interferometry and discuss in detail at least one major astrophysical application. The project idea may be based of the list of example projects (below) or may be an original idea proposed by the group.

The project will culminate in a presentation to the class. This presentation will be a comprehensive lecture on your topic, which should take 35 minutes of our class period, followed by 15 minutes of detailed questions and answers. Students may submit any additional materials beyond their presentation for review, if so desired; however, no final report is required.

Good presentations will:

Provide a cogent astrophysical and technical introduction
Connect to course material covered (if applicable)
Introduce the key supporting background material, including any relevant equations or seminal figures from related works
Copious citations to and explanations of relevant literature
Explain the key observational or theoretical instruments/methodologies (as applicable)
Identify at least one major astrophysical application and discuss in detail
Contain some new aspect of technical development or investigation by the group
Finish the presentation within the allotted time (35 +/- 5 minutes)
Adequately answer questions raised by the instructor and students

Each group should provide the instructor with their proposed group by Wednesday, September 21st. The instructor will then assign one presentation date to each group. You should notify the instructor of your project topic at least two weeks prior to your presentation date to confirm that it is an acceptable choice.

It is a wise idea to practice your presentation from start to finish “live” to make sure your timing is correct. Groups whose talks are wildly under/over time will find it difficult to achieve full credit.

Example Project Ideas

Cross-validation strategies for imaging (including for RML)
How do MRI imagers work, and what are their fundamental relationships to radio interferometers?
Optical interferometers and science results
Download a dataset from the ALMA archive, image with CLEAN or RML techniques
Calibration techniques for ALMA and EHT
Self-calibration theory and application with CASA
Spectral line capabilities of ALMA, image a molecular line from ALMA archive
Wavelets: what are they, how have they been used in radio astronomy applications
Design new ALMA observations using CASA/simobserve
Expanding MPoL to do model fitting with Pyro
Expanding MPoL for fake data generation
Expanding MPoL to use nufft for gridding
Expanding MPoL to do self-calibration
Expanding MPoL to do primary beam corrections
Wide-field imaging: what changes about the imaging assumptions, applications
Briggs weighting: (introduction, Briggs’s thesis, application w/ CASA)
Time and bandwidth smearing: theory, demonstrations with CASA and real ALMA data
Non-parametric techniques for protoplanetary surface brightness profiles, including frank
Beam-forming and applications

A note about scope

Radio astronomy is a vast and deep topic—some of these project topics are considerable in scope and could constitute Ph.D. theses (some have). The primary purpose of the group project is to educate yourself and your classmates on an advanced radio astronomy and/or imaging topic. To that end, I encourage you to focus most of your time on developing a solid understanding of the concept and designing a pedagogical presentation that clearly develops and introduces core practical and theoretical components.

For example, if you were to choose to cover time and bandwidth smearing, I would expect at least half and potentially 2/3 of your presentation to be pedagogical, i.e., extending the mathematical formalism we discussed in class to arrive at why time and bandwidth smearing occur in an interferometric system, highlighting and presenting classical treatments on the topic from our class textbooks, and presenting an astrophysical observation (i.e., published journal article) that discussed their approach to mitigating these effects. You may also choose to cover historical development of the topic, if relevant.

The “technical” component of the presentation is designed to be an opportunity for you to practically apply your knowledge, either by application to a real dataset or development of a new algorithm (e.g., for MPoL). It is meant as a supplement to the aforementioned pedagogical treatment, not a substitute for it. Keeping with the time/bandwidth smearing example, an appropriate technical component to the project would be to download a dataset from the ALMA archive and use CASA to synthesize a range of images after various amounts of time and/or bandwidth averaging have been applied, and discuss if/how time or bandwidth smearing has compromised the image quality.

Presentations that do not devote proper coverage to introductory/pedagogical material will find it difficult to achieve a satisfactory grade—even those that have an impressive technical application. Conversely, group projects that fail to write a single line of new code but deliver a quality pedagogical component would likely still receive a passing grade.

I strongly encourage all groups to keep in contact with me as you are developing your project ideas, so that I may provide feedback on scope.

Reference Materials

Textbooks

Essential Radio Astronomy by James Condon and Scott Ransom
Tools of Radio Astronomy by Rohlfs and Wilson
Interferometry and Synthesis in Radio Astronomy by Thompson, Moran, and Swenson
Fourier Analysis and Imaging by R. Bracewell
The Fourier Transform and its Applications by R. Bracewell

Courses

18 NRAO Synthesis Imaging School slides and lectures

Videos

Cells to Galaxies Speaker Series Archive, in particular talks by Urvashi Rao and Sanjay Bhatnagar (opening ~15 minutes)
Lectures by David Wilner Part I and Part II

Masking Policy

We encourage you to get COVID-19 vaccinated and follow University policies on masking, especially in indoor spaces. In consultation with background rates in Centre County, masks may be required part or all of the fall semester. Please consult PSU VirusInfo for the latest policy.

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.

Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.

Disability Accommodation

Penn State welcomes students with disabilities into the University’s educational programs. Student Disability Resources (SDR) website provides contact information for every Penn State campus (Links to an external site.). For further information, please visit Student Disability Resources website.

In order to receive consideration for reasonable accommodations, you must contact the appropriate disability services office at the campus where you are officially enrolled, participate in an intake interview, and provide documentation: See documentation guidelines. If the documentation supports your request for reasonable accommodations, your campus disability services office will provide you with an accommodation letter. Please share this letter with your instructors and discuss the accommodations with them as early as possible. You must follow this process for every semester that you request accommodations.

Counseling and Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere with their academic progress, social development, or emotional wellbeing. The university offers a variety of confidential services to help you through difficult times, including individual and group counseling, crisis intervention, consultations, online chats, and mental health screenings. These services are provided by staff who welcome all students and embrace a philosophy respectful of clients’ cultural and religious backgrounds, and sensitive to differences in race, ability, gender identity and sexual orientation.

Counseling and Psychological Services at University Park CAPS: 814-863-0395
Counseling and Psychological Services at Commonwealth Campuses
Penn State Crisis Line (24 hours/7 days/week): 877-229-6400
Crisis Text Line (24 hours/7 days/week): Text LIONS to 741741

Educational Equity and Reporting Bias

Penn State takes great pride to foster a diverse and inclusive environment for students, faculty, and staff. Acts of intolerance, discrimination, or harassment due to age, ancestry, color, disability, gender, gender identity, national origin, race, religious belief, sexual orientation, or veteran status are not tolerated and can be reported through Educational Equity via the Report Bias webpage.

Whom should I contact if I need additional assistance? I encourage you to be in contact with your academic adviser for specific needs you might have outside this course. Academic adviser information and scheduling can be found at https://sites.psu.edu/starfishinfo/. There are also additional resources available at https://keeplearning.psu.edu/student-support/

Counseling & Psychological Services (CAPS)

https://studentaffairs.psu.edu/caps-contact-form
CAPS Phone: (814) 863-0395
Penn State Crisis Line 1-877-229-6400
Student Care and Advocacy: Email: StudentCare@psu.edu
Share a Concern: Share a Concern Form Phone: 814-863-2020 (voicemail)

Code of Mutual Respect and Cooperation

The Eberly College of Science (ECoS) Code of Mutual Respect and Cooperation embodies the values that we hope faculty, staff, and students possess and will endorse to make ECoS a place where every individual feels respected and valued, as well as challenged and rewarded. Please review these principles, linked here.

Syllabus

Mon, 21 Jun 2021 00:00:00 +0000

Instructor

Professor (Dr.) Ian Czekala (he/him/his)
Email: ipc5094@psu.edu or iczekala@psu.edu (alias).

Office hours by appointment (remote only).

Objectives

The interstellar medium (ISM) is the space between the stars. This space is far from empty, however, and consists of large reservoirs of atomic and molecular gas and solids in the form of “dust” that mediate or contribute to many important astrophysical processes such as star formation, planet formation, stellar feedback, and stellar death/supernovae. Considered in aggregate, these processes are important on galactic- (and even intergalactic-) scales, influencing the distribution of giant molecular clouds and stellar populations. As such, Astro 542 is at once a course about nothing and a course about everything. We will learn the astrophysical processes that govern the interactions between the ISM and its many astrophysical interfaces. Students will also learn telescope proposal writing strategies, prepare a mock ALMA proposal, and simulate the dual-anonymous distributed peer review process.

Format

ASTRO 542 meets three times a week on Monday, Wednesday, and Friday on Zoom in a remote synchronous format from 9:05am to 9:55am ET (prompt). If you are in any way feeling ill or suspect you might have been contact with an individual infected with COVID, please stay home and seek medical care. We plan on recording and posting all lectures, and we will work with you to provide you with the course materials you need.

Masking Policy

Though this course is remote synchronous, we encourage you to get vaccinated and follow University policies on masking, especially in indoor spaces.

University policy: Penn State University requires everyone to wear a face mask in all university buildings, including classrooms, regardless of vaccination status. ALL STUDENTS MUST wear a mask appropriately (i.e., covering both your mouth and nose) while you are indoors on campus. This is to protect your health and safety as well as the health and safety of your classmates, instructor, and the university community. Anyone attending class without a mask will be asked to put one on or leave. Instructors may end class if anyone present refuses to appropriately wear a mask for the duration of class. Students who refuse to wear masks appropriately may face disciplinary action for Code of Conduct violations. If you feel you cannot wear a mask during class, please speak with your adviser immediately about your options for altering your schedule.

Textbook

This course has one required textbook:

Title: Physics of the Interstellar and Intergalactic Medium
Author: Bruce Draine
ISBN-13 978-0-691-12213-7

It is available through through textbook sellers online and will be available in the PSU bookstore. We recommended that you secure access to this textbook for readings and reference during the course. The PAMS library (201 Davey Lab) also has a copy on course reserve.

Additional Reference Materials

Course materials and lecture notes

Notes on Star Formation by Mark Krumholz
Lecture notes on Radiative Transfer in Astrophysics by C.P. Dullemond.

Review Articles

Astrochemistry and Compositions of Planetary Systems by Karin Oberg and Ted Bergin, Physics Reports 2021
Stellar Multiplicity by Gaspard Duchene and Adam Kraus, ARA&A 2013
Observations of Protoplanetary Disk Structures by Sean Andrews, ARA&A 2020
Dynamics of Protoplanetary Disks by Phil Armitage, ARA&A 2011

Textbooks

Interstellar and Intergalactic Medium by Barbara Ryden and Richard W. Pogge, The Ohio State Astrophysics Series, 2021
Essential Radio Astronomy by James Condon and Scott Ransom
Physics and Chemistry of the Interstellar Medium by Sun Kwok, University Science Books, 2007
The Origin of Stars by Michael D. Smith, Imperial College Press, 2004
The Physics and Chemistry of the Interstellar Medium by A.G.G.M Tielens, Cambridge University Press, 2005
The Formation of Stars by Steven W. Stahler and Francesco Palla, Wiley-VCH, 2004
Protostars and Planets V, B. Reipurth, D. Jewitt, and K. Keil (eds.), University of Arizona Press, 2007 Chapters available here.

Assignments and Exams

Course Grade

The course grade will be based on

problem sets (20%)
a paper presentation (15%)
three mid-term exams (40%)
reviews from our mock ALMA TAC (5%)
an ALMA proposal (20%)

Problem sets

Problem sets will be assigned approximately every 3-4 weeks.

Please turn in homeworks in .PDF format to Ian by email at ipc5094@psu.edu with the subject line ASTRO 542 problem set. Handwritten solutions are fine, but please digitize your submission using a department scanner or a smartphone scanner app (e.g., Adobe Scan).

Due dates will be set to correspond to the beginning of a class period (i.e., 9:05am).

Problem set late policy

The following percentages will be deducted from your score

one day late: 5%
two days late: 10%
three days late: 25%
more than 72 hours late: no credit

If extenuating circumstances arise such that you will be unable to complete the homework on time, please contact Ian before the homework deadline and we can most likely arrange an extension.

Problem set collaboration policy

You are welcome to collaborate and work through the problem sets. However, each student must complete the final write up on their own, i.e., no problem set should be duplicated verbatim between students.

It should go without saying that students may not collaborate on exams.

Paper presentation

Each student will give one 15 minute presentation with 5 minutes of questions (equivalent to journal club talk) on an article from the last 1-2 years. Please select an article submitted to the Astrophysical Journal, Astronomical Journal, MNRAS, or A&A that has completed the referee process (i.e., is available from the journal itself, or is labeled “accepted” on the arXiv). Chosen articles should fall under the subjects that have been covered by the intervening topics since the previous student talk. To select a paper, you can search for papers in ADS based in terms (or combinations of them) appearing in titles or abstracts (e.g., extinction, molecular cloud).

Each student should provide the instructor with the two dates that they prefer by Monday, August 30th. The instructor will then assign one date to each student. You should notify the instructor of your paper selection at least one week prior to your presentation date to confirm that it is an acceptable choice.

Good talks will

Provide a cogent introduction to the subfield of the paper
Introduce the key supporting background material, including any relevant equations or seminal figures from related works
Explain the key observational or theoretical instruments/methodologies (as applicable)
Discuss the scientific findings of the paper, and their implications for the broader astrophysical subfield
Finish the presentation within the allotted time (15 +/- 2 minutes)
Adequately answer questions raised by the instructor and students

It is a wise idea to practice your presentation from start to finish “live” to make sure your timing is correct. Students whose talks are wildly under/over time will find it difficult to achieve full credit.

Midterm Exams

There will be three midterm exams throughout the semester. Each exam is designed to assess your understanding of the topics covered in the previous ~4 weeks of lectures. Each exam will be scheduled during a class period (see the Course Schedule for the precise date).

Exam makeup: If you are unable to attend class on the date of an exam, please contact the instructor before the exam date to schedule a makeup exam.

There will be no final exam (the ALMA proposal + review plays this role).

ALMA mock TAC and proposal

Throughout the course, students will learn general telescope and grant proposal writing strategies, as well as strategies specific to the ALMA observatory. We will spend a few class periods describing the capabilities of the ALMA observatory, exploring how to write an ALMA proposal, and engaging in a mock TAC panel discussion using previously submitted (successful and unsuccessful) ALMA proposals.

TAC reviews

Our class will simulate the dual-anonymous peer review process in a distributed fashion. In the actual ALMA review, for every proposal that the P.I. submits, they are sent (electronically) 10 proposals to review. Our review process will work as follows:

You will receive approximately 5 proposals related to course topics we have covered.
You are responsible for providing a written report describing (at minimum) the strengths and weaknesses of each proposal, at least two paragraphs in length. The ALMA guidelines for reviewers are available here. We also recommend reviewing the JWST cycle 1 review criteria, which are similar and provide another data point for “what makes a good review.”
You are responsible for evaluating the scientific merit of the proposal with a numerical score between 1 and 5 (1 being excellent, 5 being unsatisfactory).

The TAC reviews will be worth 5% of your final grade.

ALMA proposal

As a final course project, students will be responsible for preparing their own ALMA observing proposal using ALMA Cycle 8 capabilities and materials as a baseline (worth 20% of your final grade). Proposals will be due during the final examination period.

An overview of the ALMA proposal process is described in the ALMA Cycle 8 Proposer’s Guide. If you have any questions about preparing an ALMA proposal, answers can most likely be found in this document.

The following are some sections of the Proposer’s Guide that we recommend reading to understand this course project

The assignment will consist of preparing a Scientific Justification (Section 5.3) PDF only. You may need to install the observing tool (OT) for help with your resolution/sensitivity calculations, but your submitted assignment does not need a cover sheet, technical justification, or abstract.

We will grade your proposal following the spirit of the actual ALMA review critera. Specifically, your Scientific Justification should explicitly include:

a title
a discussion of the overall scientific merit of the proposed investigations and their potential contribution to the advancement of scientific knowledge
a clear description of the proposed observations. I.e.,
- what targets will be observed
- at what angular resolution
- at what frequency (including continuum/spectral line)
- to what sensitivity (Jy for point sources, and Jy/beam or Jy/arcsec² for resolved sources)
a robust data analysis plan
a discussion of why the capabilities of ALMA are required to carry out this science
figures and tables supporting the proposal, as necessary
references

Your Scientific Justification should follow all rules for page limits and fonts, especially the requirements for 4 single-spaced pages (A4 or US letter size) in 12pt font. We suggest that you allow two pages for the science case and two pages for figures, tables, and references; however, you may allocate space as best you see fit.

Your Scientific Justification should follow all guidelines for dual-anonymous peer review such that your PDF does not contain any information that could be easily used to identify the proposer.

Course Calendar and Closure Policies

For full information, see the Course Schedule.

Academic Integrity

Disability Accommodation

Counseling and Psychological Services

Counseling and Psychological Services at University Park CAPS: 814-863-0395
Counseling and Psychological Services at Commonwealth Campuses
Penn State Crisis Line (24 hours/7 days/week): 877-229-6400
Crisis Text Line (24 hours/7 days/week): Text LIONS to 741741

Educational Equity and Reporting Bias

Counseling & Psychological Services (CAPS)

https://studentaffairs.psu.edu/caps-contact-form
CAPS Phone: (814) 863-0395
Penn State Crisis Line 1-877-229-6400
Student Care and Advocacy: Email: StudentCare@psu.edu
Share a Concern: Share a Concern Form Phone: 814-863-2020 (voicemail)

Code of Mutual Respect and Cooperation

Class Schedule: Martinmas (Fall 2023)

Fri, 25 Aug 2023 00:00:00 +0000

AS5003 meets three times a week. See the Moodle for class times and meeting locations.

The following is the anticipated schedule for course lectures and activities. Lecture topics/dates may shift slightly depending on course progress.

Week	Date	Activity	Topic
1	T 12 Sep	Lecture	Intro and Overview
1	W 13 Sep	Lecture	The Fourier Transform: Analytical
1	F 15 Sep	Lecture	The Fourier Transform: Numerical
2	T 19 Sep	Tutorial	The Fourier Transform and Digital Signal Processing
2	W 20 Sep	Lecture	Interferometry in Practice
2	F 22 Sep	Lecture	Making Images: PSFs, Gridding, and Dirty Images
3	T 26 Sep	Lecture	Image Plane Deconvolution (CLEAN)
3	W 27 Sep	Lecture	Regularized Maximum Likelihood (RML)
3	F 29 Sep	Tutorial	Imaging in Practice, Concept Review
4	T 3 Oct	Peer Study

Fall 2022 Class Schedule

Tue, 21 Jun 2022 00:00:00 +0000

ASTRO 589 meets once a week on Wednesdays from 10:10am to 11:00am ET (prompt) in Davey Lab Room 538.

Please be aware that we will not be meeting the week of the Thanksgiving Holiday (Nov 21 - 25), following PSU’s Fall 2022 academic calendar.

The following is the anticipated schedule for course lectures and activities. Lecture topics/dates may shift slightly depending on course progress.

Date	Activity	Topic
Aug 24	Lecture	Intro and Overview
Aug 31	Lecture	The Fourier Transform I
Sep 7	Lecture	The Fourier Transform II
Sep 14	Lecture	The Fast Fourier Transform and Numerical Implementation
Sep 21	Lecture	Interferometry in Practice
Sep 28	Lecture	Making Images: PSFs, Gridding, and Dirty Images
Oct 5	Lecture	Bayesian Inference and Model Fitting
Oct 12	Lecture	Image Plane Deconvolution (CLEAN)
Oct 19	Lecture	Regularized Maximum Likelihood (RML) I
Oct 26	Lecture	Regularized Maximum Likelihood (RML) II
Nov 2	Project presentation	TBD
Nov 9	Project presentation	TBD
Nov 16	Project presentation	TBD
Nov 30	Project presentation	TBD
Dec 7	Lecture	Correlation in ALMA data

Fall 2021 Class Schedule

Mon, 21 Jun 2021 00:00:00 +0000

ASTRO 542 meets three times a week (9:05am to 9:55am) on Monday, Wednesday, and Friday in a remote synchronous format over Zoom. The Zoom link is provided to enrolled students via email for security purposes. If you would like to audit a course, please email the instructor.

Please be aware that we will not be meeting on Labor Day (Sep 6) or over Thanksgiving Holiday (Nov 22 - 26), following PSU’s Fall 2021 academic calendar.

The following is the anticipated schedule for course lectures and activities. Lecture topics/dates may shift slightly depending on course progress. Exam and presentation dates will remain fixed.

Date	Activity	Topic
Mon Aug 23	Lecture	Introduction and Course Overview
Wed Aug 25	Lecture	Collisional Processes
Fri Aug 27	Lecture	Statistical Mechanics and Thermodynamic Equilibrium
Mon Aug 30	Lecture	Energy Levels of Atoms, Ions, and Molecules
Wed Sep 1	Lecture	Spontaneous Emission, Stimulated Emission, and Absorption
Fri Sep 3	Lecture	Radiative Transfer
Wed Sep 8	Lecture	H I 21-cm Emission and Absorption
Fri Sep 10	Lecture	Absorption Lines and Curve of Growth
Mon Sep 13	Lecture	Emission and Absorption by a Thermal Plasma
Wed Sep 15	Lecture	Propagation of Radio Waves through the ISM
Fri Sep 17	Presentation by Laura Duffy (#1)	H I 21 cm mapping of the host galaxy of AT2018cow: a fast-evolving luminous transient within a ring of high column density gas
Mon Sep 20	Lecture	Introduction to Radio Astronomy and ALMA
Wed Sep 22	Exam 1	Lectures up to Sep 13 (incl)
Fri Sep 24	Lecture	Interstellar Radiation Fields, Ionization Processes, and Recombination of Ions with Electrons
Mon Sep 27	Lecture	Photoionized Gas (HII regions)
Wed Sep 29	Lecture	Ionization in Predominantly Neutral Regions (cool + warm HI regions)
Fri Oct 1	Lecture	Collisional Excitation
Mon Oct 4	Lecture	Nebular Diagnostics
Wed Oct 6	Lecture	Radiative Trapping
Fri Oct 8	Lecture	Interstellar Dust
Mon Oct 11	Lecture	Scattering and Absorption by Small Particles
Wed Oct 13	Presentation by Malinda Baer (#2)	Tracing PAH Size in Prominent Nearby Mid-Infrared Environments
Fri Oct 15	Lecture	Telescope Proposals and Time Allocation Committees
Mon Oct 18	Lecture	Dust Composition
Wed Oct 20	Exam 2	Lectures up to Oct 11 (inclusive)
Mon Oct 25	Lecture	Grain Temperatures, Physics, and Dynamics
Wed Oct 27	Lecture	Heating and Cooling of HII regions
Fri Oct 29	Lecture	HI Clouds: Observations, Heating and Cooling
Mon Nov 1	Lecture	Molecular Hydrogen
Wed Nov 3	Mock TAC discussion pt 1
Fri Nov 5	Mock TAC discussion pt 2
Mon Nov 8	Lecture	Molecular Clouds, Observations, Chemistry, and Ionization
Wed Nov 10	Lecture	Fluids and Shocks
Fri Nov 12	Lecture	Supernovae Remnants and the Hot Ionized Medium
Mon Nov 15	Lecture	Gravitational Collapse and Star Formation: Theory
Wed Nov 17	Lecture	Star Formation: Observations
Fri Nov 19	Lecture	Protostars to PPDs
Mon Nov 29	Lecture	Circumstellar Disks
Wed Dec 1	Exam 3	Lectures up to TBD
Fri Dec 3	Lecture	Exoplanets and Planet Formation
Mon Dec 6	Presentation by Megan Delamer (#3)	TBD
Wed Dec 8	Lecture	Astrochemistry
Fri Dec 10	Workshop	ALMA proposals

Regularized Maximum Likelihood (RML)

Wed, 27 Sep 2023 00:00:00 +0000

References for today

MPoL introduction
Data Analysis: a Bayesian Tutorial by Sivia and Skilling
Data Analysis Recipes: Fitting a Model to Data, by Hogg et al.
Data analysis recipes: Probability calculus for inference by Hogg
Machine Learning: A Probabilistic Perspective by Murphy, Chapter 10
Pattern Recognition and Machine Learning by Bishop, Chapter 8
Probabilistic Graphical Models by Sucar, especially Chapters 7, 8

Last time

general interferometer with north-south and east-west baselines
arrays with multiple antennas and Earth aperture synthesis
point spread functions (PSFs) and their relationship to array configuration

Today’s lecture

Now that we’ve covered how interferometers work to observe a source and the Fourier transform theory behind that, we’re going to focus on the data products (called “visibilities”) and Bayesian inference techniques for analyzing the data in its natural space. In subsequent lectures we will talk about how we might use the visibilities and Fourier inversion techniques to synthesize images of the source, but this lecture occupies an important intermediate (and foundational) step, where we are treating the visibilities as the “raw” data product and thinking about how we bring our analysis techniques into that space.

The topics we will cover include:

Bayesian inference
Forward-modeling with a generative model
Complex-valued noise and measurement (weights)
Statistical weight and relationship to point source uncertainty
Forward-modeling visibility data
Missing spatial frequencies (model constraints)

Probability calculus and Bayesian Inference

A good reference for this section is Data analysis recipes: Probability calculus for inference by Hogg

Generally, we write probability distributions like $p(a)$. The probability distribution is a function describing the probability of the variable $a$ having some value.

Probability functions are normalized. $$ \int_{-\infty}^\infty p(a)\,\mathrm{d}a = 1 $$

Say that $a$ represents the height of an individual from the US population, measured in meters.

TODO: draw a bell curve

Probability functions have units. In this case, $p(a)$ has units of $a^{-1}$, or $\mathrm{m}^{-1}$.

If selected an individual from the US population and we wanted to know the probability that their height was between 1.7 and 1.9 meters, we could do an integral over this range

$$ \int_{1.7\,\mathrm{m}}^{1.9\,\mathrm{m}} p(a)\,\mathrm{d}a $$

Conditional probabilities

It’s very common that we will talk about multiple parameters at the same time. For example, we can continue our example and let $b$ be the age of an individual drawn from the US population. We can talk about the probability of an individual drawn from the US population having a certain height given that we know they are 20 years old

$$ p(a | b = 20\,\mathrm{yr}). $$

What are the units of this probability distribution? It’s actually the same as before, it’s $a^{-1}$, or $\mathrm{m}^{-1}$, because this probability distribution must obey the same normalization

$$ \int_{-\infty}^\infty p(a | b = 20\,\mathrm{yr})) \,\mathrm{d}a = 1 $$

You can’t do the integral $$ \int p(a | b ) \,\mathrm{d}b $$ because this integrand now has units of $a^{-1}b$, which is nonsensical.

Factorizing probabilities

Now let’s consider the probability distribution $$ p(a,b) $$ this is a two-dimensional probability distribution. It has units of $(ab)^{-1}$, or in our previous example $\mathrm{m}^{-1}\, \mathrm{yr}^{-1}$. You can read this as the probability of an individual having $a$ value of height and $b$ value of age, i.e., this is a joint probability distribution.

The same normalization rules apply, only now these need to be done over two dimensions.

You can take any joint probability distribution and factor it into conditional distributions. So, we could write $p(a,b)$ in two different ways $$ p(a,b) = p(a) p(b | a) $$ or $$ p(a,b) = p(a|b) p(b). $$ So, in words, we can say that the probability of $a$ and $b$ (the left hand side) is equal to the probability of $a$ times the probability of $b$ given $a$.

As I said, this factorization can apply to any joint probability distribution.

Side note that if $$ p(b | a) = p(b) $$ we would say that $a$ and $b$ are independent variables, and so we would write $$ p(a, b) = p(a) p(b), $$ but this isn’t true in the general case, only if the variables are independent.

We can put the two factorization equations together to arrive at another relationship $$ p(a | b) = \frac{p(b|a) p(a)}{p(b)} $$ which is called Bayes’s theorem.

Marginalization

One amazing thing you can do with probability distributions is marginalization. Say that I told you the joint distribution $p(a, b)$. As we talked about, we said this distribution had units of $(ab)^{-1}$ or $\mathrm{m}^{-1}\, \mathrm{yr}^{-1}$. But let’s say you only cared about $p(a)$, the distribution of heights. We can marginalize away the variable we don’t want by integration $$ p(a) = \int p(a, b)\,\mathrm{d}a. $$

As we’ll see in a moment, this has huge implications when it comes time to do inference.

Likelihood functions

Now, let’s revisit Bayes’s rule and rewrite it like $$ p(\mathrm{hypothesis} | \mathrm{data}) \propto p(\mathrm{data} | \mathrm{hypothesis}) \times p(\mathrm{hypothesis}). $$ We’re omitting a constant of proportionality, commonly called the Bayesian evidence.

The term on the left hand side is called a posterior distribution and is really a wonderful thing to report at the end of your analysis. Say you collected many years of measurements on the positions (orbits) of Jupiter, Saturn, and their moons, and then used those measurements to infer the mass of Saturn (as Laplace famously did). The posterior you would be most interested in would be a 1D distribution of the probability of the mass of Saturn and would represent your degree of belief that Saturn truly had that particular mass. This would be conditional on all of the measurements you made.

The term on the very right hand side is a prior probability distribution and expresses your belief about the mass of Saturn in the absence of data. For example, we might rightly say that the mass needed to be greater than zero, and less than the mass of the Sun. A simple prior would then ascribe equal probabilities to all values in between (or perhaps equal probabilities to the logarithm of the mass of Saturn).

The remaining term, $p(\mathrm{data} | \mathrm{hypothesis}) $ is called a likelihood function, and it is really where the rubber meets the road in most statistical analyses. Simply put, the likelihood function is how probable the observed data is for a given setting of the hypothesis. So, what is the probability of obtaining the observed positional measurements of Jupiter, Saturn, and their moons if the mass of Saturn were $5\times 10^{26}\,\mathrm{kg}$, for example.

A quick note that likelihood functions show up in frequentist analysis all the time, too. However, the interpretations of probability are different.

Fitting a line

Let’s dive into a quick example with some real data to make these concepts clearer.

Typically, when astronomers fit a model to some dataset, such as a line $y = m x + b$ to a collection of $\boldsymbol{X} = {x_1, x_2, \ldots, x_N}$ and $\boldsymbol{Y} = {y_1, y_2, \ldots, y_N}$ points, we require a likelihood function. Simply put, the likelihood function specifies the probability of the data, given a model, and encapsulates our assumptions about the data and noise generating processes.

TODO: draw a bunch of points for putting a line through.

For most real-world datasets, we don’t measure the “true” $y$ value of the line (i.e., $mx + b$), but rather make a measurement which has been partially corrupted by some “noise.” In that case, we say that each $y_i$ data point is actually generated by

$$ y_i = m x_i + b + \epsilon $$

where $\epsilon$ is a noise realization from a standard normal distribution with standard deviation $\sigma$, i.e.,

$$ \epsilon \sim \mathcal{N}(0, \sigma). $$

This information about the data and noise generating process means that we can write down a likelihood function to calculate the probability that we observe the data that we do, given a set of model parameters. The likelihood function is $p(\boldsymbol{Y} |\boldsymbol{\theta})$. Sometimes it is written as $\mathcal{L}(\boldsymbol{Y} |\boldsymbol{\theta})$, and frequently, when employed in computation, we’ll use the logarithm of the likelihood function, or “log-likelihood,” $\ln \mathcal{L}$ to avoid numerical under/overflow issues.

Let’s call $\boldsymbol{\theta} = {m, b}$ and $M(x_i |, \boldsymbol{\theta}) = m x_i + b$. This is a very simple example here, but we would still call $M$ a forward or generative model. By that, we mean our model is sophisticated enough that we can use it (and some noise model) to fully replicate the dataset, or alternative sets of data indistinguishable from the measured data.

The probability of observing each datum is a Gaussian (normal distribution) centered on the model value, evaluated at the $y_i$ value. So, the full likelihood function for this line problem is just the multiplication of all of these probability distributions

$$ \mathcal{L}(\boldsymbol{Y} |\boldsymbol{\theta}) = \prod_i^N \frac{1}{\sqrt{2 \pi} \sigma} \exp \left [ - \frac{(y_i - M(x_i |\boldsymbol{\theta}))^2}{2 \sigma^2}\right ]. $$

The logarithm of the likelihood function is

$$ \ln \mathcal{L}(\boldsymbol{Y} |,\boldsymbol{\theta}) = -N \ln(\sqrt{2 \pi} \sigma) - \frac{1}{2} \sum_i^N \frac{(y_i - M(x_i |\boldsymbol{\theta}))^2}{\sigma^2}. $$

You may recognize the right hand term looks similar to the $\chi^2$ metric,

$$ \chi^2(\boldsymbol{Y} |\boldsymbol{\theta}) = \sum_i^N \frac{(y_i - M(x_i |\boldsymbol{\theta}))^2}{\sigma^2} $$

Assuming that the uncertainty ($\sigma$) on each data point is known (and remains constant), the first term in the log likelihood expression remains constant, and we have

$$ \ln \mathcal{L}(\boldsymbol{Y} |\boldsymbol{\theta}) = - \frac{1}{2} \chi^2 (\boldsymbol{Y} |\boldsymbol{\theta}) + C $$

where $C$ is a constant with respect to the model parameters. It is common to use shorthand to say that “the likelihood function is $\chi^2$” to indicate situations where the data uncertainties are Gaussian. Very often, we (or others) are interested in the parameter values $\boldsymbol{\theta}_\mathrm{MLE}$ which maximize the likelihood function. Unsurprisingly, these parameters are called the maximum likelihood estimate (or MLE), and usually they represent something like a “best-fit” model.

When it comes time to do parameter inference, however, it’s important to keep in mind

the simplifying assumptions we made about the noise uncertainties being constant with respect to the model parameters. If we were to “fit for the noise” in a hierarchical model, for example, we would need to use the full form of the log-likelihood function, including the $-N \ln \left (\sqrt{2 \pi} \sigma \right)$ term.
that in order to maximize the likelihood function we want to minimize the $\chi^2$ function.
that constants of proportionality (e.g., the 1/2 in front of the $\chi^2$) can matter when combining likelihood functions with prior distributions for Bayesian parameter inference.

To be specific, $\chi^2$ is not the end of the story when we’d like to perform Bayesian parameter inference. To do so, we need the posterior probability distribution of the model parameters given the dataset, $p(\boldsymbol{\theta}|,\boldsymbol{Y})$.

Visibility Data

Now that we have reviewed likelihood functions, let’s turn back to radio astronomy and go into further detail how the visibility function is sampled. Recall that the visibility domain is the Fourier transform of the image sky brightness $\mathcal{V} \leftrightharpoons I$.

The visibility function is complex-valued, and each measurement of it (denoted by $V_i$) is made in the presence of noise

$$ V_i = \mathcal{V}(u_i, v_i) + \epsilon. $$

Here $\epsilon$ represents a noise realization from a complex normal (Gaussian) distribution. Thankfully, most interferometric datasets do not exhibit significant covariance between the real and imaginary noise components and the distributions of the values are similar, so we could equivalently say that the real and imaginary components of the noise are separately generated by draws from normal distributions characterized by standard deviation $\sigma$

$$ \epsilon_\Re \sim \mathcal{N}(0, \sigma) $$

$$ \epsilon_\Im \sim \mathcal{N}(0, \sigma) $$

where $\sigma$ is a real-valued quantity. If the units of the visibility function are Janskys, then the units of $\sigma$ are also Janskys.

The full complex noise-draw is given by $$ \epsilon = \epsilon_\Re + i \epsilon_\Im. $$

Radio interferometers will commonly represent the uncertainty on each visibility measurement by a “weight” $w_i$, where

$$ w_i = \frac{1}{\sigma_i^2}. $$

Like $\sigma$, the weight itself is a real quantity, in this case having units of $1/\mathrm{Jy}^2$.

A full interferometric dataset is a collection of visibility measurements, which we represent by

$$ \boldsymbol{V} = \{V_1, V_2, \ldots V_N\}_{i=1}^N $$

each one having a corresponding $u_i, v_i$ coordinate. For reference, a typical ALMA dataset might contain a half-million individual visibility samples, acquired over a range of spatial frequencies.

Likelihood functions for Fourier data

Now that we’ve introduced likelihood functions in general and the specifics of Fourier data, let’s talk about likelihood functions for inference with Fourier data. As before, our statement about the data generating process

$$ V_i = \mathcal{V}(u_i, v_i) + \epsilon $$

leads us to the formulation of the likelihood function.

First, let’s assume we have some model that we’d like to fit to our dataset. To be a forward model, it should be able to predict the value of the visibility function for any spatial frequency, i.e., we need to be able to calculate

$$ \mathcal{V}(u, v) = M_\mathcal{V}(u, v |, \boldsymbol{\theta}). $$

Following the discussion about how the complex noise realization $\epsilon$ is generated, this leads to a log likelihood function

$$ \ln \mathcal{L}(\boldsymbol{V}|,\boldsymbol{\theta}) = - \frac{1}{2} \chi^2(\boldsymbol{V}|,\boldsymbol{\theta}) + C $$

Because the data and model are complex-valued, $\chi^2$ is evaluated as

$$ \chi^2(\boldsymbol{V}|,\boldsymbol{\theta}) = \sum_i^N \frac{|V_i - M_\mathcal{V}(u_i, v_i |,\boldsymbol{\theta})|^2}{\sigma_i^2} $$

where $| |$ denotes the modulus squared. Equivalently, the calculation can be broken up into sums over the real and imaginary components of the visibility data and model

$$ \chi^2(\boldsymbol{V}|,\boldsymbol{\theta}) = \sum_i^N \frac{(V_{\Re,i} - M_\mathcal{V,\Re}(u_i, v_i |,\boldsymbol{\theta}))^2}{\sigma_i^2} + \sum_i^N \frac{(V_{\Im,i} - M_\mathcal{V,\Im}(u_i, v_i |,\boldsymbol{\theta}))^2}{\sigma_i^2}. $$

Because images of the sky are real, therefore the real part of the visibility function must always be even and the imaginary part odd. The visibility function is Hermitian. This means that

$$ \mathcal{V}(u, v) = \mathcal{V}^{*}(-u, -v). $$

So, if you make a measurement of $\mathcal{V}(u, v)$, this means you have also made the same measurement of $\mathcal{V}^{*}(-u, -v)$. If you are doing forward-modeling of the visibilities as we just described, you only need to use one of the Hermitian pairs, otherwise you will double count your measurements (this only turns out to be a scale factor in the likelihood for most analysis, so it’s technically OK). If you are gridding the visibilities to then image them, however, you will certainly want to include the Hermitian pairs. Otherwise, your image will not turn out to be real!

Point source sensitivity

We’ll use our forward modeling formalism to fit for the flux of a point source.

Se also Casussus and Carcamo 2022, appendix A4.

More complex visibility models

It’s difficult to reason about all but the simplest models directly in the Fourier plane, so usually models are constructed in the image plane $M_I(l,m |,\boldsymbol{\theta})$ and then Fourier transformed (either analytically, or via the FFT) to construct visibility models

$$ M_\mathcal{V}(u, v |, \boldsymbol{\theta}) \leftrightharpoons M_I(l,m |,\boldsymbol{\theta}) $$

For marginally resolved sources, it’s common to fit simple models like a 2D Gaussian. We can write down an image plane model and then calculate its Fourier transform analytically.

But, these could be more complicated models. For example, these models could be channel maps of carbon monoxide emission from a rotating protoplanetary disk (as in Czekala et al. 2015, where $\boldsymbol{\theta}$ contains parameters setting the structure of the disk), or rings of continuum emission from a protoplanetary disk (as in Guzmán et al. 2018, where $\boldsymbol{\theta}$ contains parameters setting the sizes and locations of the rings).

With the likelihood function specified, we can add prior probability distributions $p(\boldsymbol{\theta})$, and calculate and explore the posterior probability distribution of the model parameters using algorithms like Markov Chain Monte Carlo. In this type of Bayesian inference, we’re usually using forward models constructed with a small to medium number of parameters (e.g., 10 - 30), like in the protoplanetary disk examples of Czekala et al. 2015 or Guzmán et al. 2018.

All of these type of models would be called parametric models, because we can represent the model using a finite set of parameters. E.g., for the Gaussian model, we have width in the major and minor axes, rotation angle, and 2D position. So these parameters fully represent the model. One thing you need to be concerned with inference using parametric models is whether you have the right model! If your source is actually a ring instead of a Gaussian, the posterior distribution of your parameters can be rendered meaningless.

Forward modeling with (simple) parametric models can be very useful for understanding

Discussion about model mis-specification and unsampled visibilities (model constraints)

Could use u / v undersampling to constrain width or size, say.

References

MPoL introduction
Machine Learning: A Probabilistic Perspective by Murphy, Chapter 10
Pattern Recognition and Machine Learning by Bishop, Chapter 8
The fourth paper in the 2019 Event Horizon Telescope Collaboration series describing the imaging principles
Maximum entropy image restoration in astronomy AR&A by Narayan and Nityananda 1986
Multi-GPU maximum entropy image synthesis for radio astronomy by Cárcamo et al. 2018
Regularized Maximum Likelihood Techniques for ALMA Observations by Zawadzki, Czekala, et al.

Last time

Discussed $u,v$ coverage and sampling (weights)
Introduced the “dirty image” as the inverse Fourier transform of the visibility samples
Introduced the CLEAN image deconvolution procedure

This time

Review parametric vs. non-parametric models
Introduce Regularized Maximum Likelihood (RML) imaging

Parametric vs. Non-Parametric Models

Recall our discussion on Bayesian inference and what it means to forward-model a dataset and how to calculate a likelihood function.

Say we have a model that we are fitting to some data. In “Machine Learning: A Probabilistic Perspective,” Murphy defines a parametric model as one that has a fixed number of parameters, and a non-parametric one as one where the number of parameters grows with the size of the data.

Something like the line we discussed in our Bayesian Modeling lecture is a parametric model. It has two parameters, a slope and an intercept. If we were observing a source and we wanted to fit a 2D Gaussian to the visibility function, that visibility model would also be a parametric model. Its parameters would be the width of the Gaussian, the position of the source, and the amplitude of the source.

TODO: draw example of Gaussian function and label parameters

If you’ve ever fit a spline to a bunch of data, then you’ve used a non-parametric model. A Gaussian process would also a non-parametric model. In these models there in fact are parameters (like the number or type of splines/GP kernels), but these are usually nuisances to the problem. You wouldn’t necessarily fit a spline model to determine the exact number of spline position parameters, but you are interested in the approximation to $f(x)$ that the spline has enabled you.

TODO: draw points and a spline or GP drawn through them

I would consider a CLEAN model to be a type of non-parametric model too. Through the CLEANing process, you create a model of the source emission from a collection of $\delta$ functions. Each of these $\delta$ functions technically has parameters, but those are mostly nuisance parameters in pursuit of their aggregate representation of the model. In general, non-parametric models have the ability to be more expressive than parametric models, but sometimes at the expense of interpretability.

RML images as non-parametric models

Now, let me introduce a set of techniques that have been grouped under the banner “Regularized Maximum Likelihood Imaging” or RML Imaging for short. With RML imaging, we’re trying to come up with a model that will fit the dataset. But rather than using a parametric model like a protoplanetary disk structure model or a series of Gaussian rings, we’re using a non-parametric model of the image itself. This could be as simple as parameterizing the image using the intensity values of the pixels themselves, i.e.,

$$ \boldsymbol{\theta} = {I_1, I_2, \ldots, I_{N^2} } $$

assuming we have an $N \times N$ image.

A flexible image model for RML imaging is mostly analogous to using a spline or Gaussian process to fit a series of $\boldsymbol{X} = {x_1, x_2, \ldots, x_N}$ and $\boldsymbol{Y} = {y_1, y_2, \ldots, y_N}$ points—the model will nearly always have enough flexibility to capture the structure that exists in the dataset. The most straightforward formulation of a non-parametric image model is the pixel basis set, but we could also use more sophisticated basis sets like a set of wavelet coefficients, or even more exotic basis sets constructed from trained neural networks. These may have some serious advantages when it comes to the “regularizing” part of “regularized maximum likelihood” imaging. But first, let’s talk about the “maximum likelihood” part.

Given some image parameterization (e.g., a pixel basis set of $N \times N$ pixels, with each pixel cell_size in width), we would like to find the maximum likelihood image $\boldsymbol{\theta}_\mathrm{MLE}$. Fortunately, because the Fourier transform is a linear operation, we can analytically calculate the maximum solution (the same way we might find the best-fit slope and intercept for the line example). This maximum likelihood solution is called (in the radio astronomy world) the dirty image, and its associated point spread function is called the dirty beam.

In the construction of the dirty image, all unsampled spatial frequencies are set to zero power. This means that the dirty image will only contain spatial frequencies about which we have at least some data. This assumption, however, rarely translates into good image fidelity, especially if there are many unsampled spatial frequencies which carry significant power. It’s also important to recognize that dirty image is only one out of a set of many images that could maximize the likelihood function. From the perspective of the likelihood calculation, we could modify the unsampled spatial frequencies of the dirty image to whatever power we might like, and, because they are unsampled, the value of the likelihood calculation won’t change, i.e., it will still remain maximal.

When synthesis imaging is described as an “ill-posed inverse problem,” this is what is meant. There is a (potentially infinite) range of images that could exactly fit the dataset, and without additional information we have no way of discriminating which is best. As you might suspect, this is now where the “regularization” part of “regularized maximum likelihood” imaging comes in.

Regularization

There are a number of different ways to talk about regularization. If one wants to be Bayesian about it, one would talk about specifying priors, i.e., we introduce terms like $p(\boldsymbol{\theta})$ such that we might calculate the maximum a posteriori (MAP) image $\boldsymbol{\theta}_\mathrm{MAP}$ using the posterior probability distribution

$$ p(\boldsymbol{\theta} |\, \boldsymbol{V}) \propto \mathcal{L}(\boldsymbol{V} |\, \boldsymbol{\theta}) \, p(\boldsymbol{\theta}). $$

For computational reasons related to numerical over/underflow, we would most likely use the logarithm of the posterior probability distribution

$$ \ln p(\boldsymbol{\theta} |\, \boldsymbol{V}) \propto \ln \mathcal{L}(\boldsymbol{V} |\, \boldsymbol{\theta}) + \ln p(\boldsymbol{\theta}). $$

One could accomplish the same goal without necessarily invoking the Bayesian language by simply talking about which parameters $\boldsymbol{\theta}$ optimize some objective function.

We’ll adopt the perspective that we have some objective “cost” function that we’d like to minimize to obtain the optimal parameters $\hat{\boldsymbol{\theta}}$. The machine learning community calls this a “loss” function $L(\boldsymbol{\theta})$, and so we’ll borrow that terminology here. For an unregularized fit, an acceptable loss function is just the negative log likelihood (“nll”) term,

$$ L(\boldsymbol{\theta}) = L_\mathrm{nll}(\boldsymbol{\theta}) = - \ln \mathcal{L}(\boldsymbol{V}|\,\boldsymbol{\theta}) = \frac{1}{2} \chi^2(\boldsymbol{V}|\,\boldsymbol{\theta}) $$

If we’re only interested in $\hat{\boldsymbol{\theta}}$, it doesn’t matter whether we include the 1/2 prefactor in front of $\chi^2$, the loss function will still have the same optimum. However, when it comes time to add additional terms to the loss function, these prefactors matter in controlling the relative strength of each term.

When phrased in the terminology of function optimization, additional terms can be described as regularization penalties. To be specific, let’s add a term that regularizes the sparsity of an image.

$$ L_\mathrm{sparsity}(\boldsymbol{\theta}) = \sum_i |I_i| $$

In short, the L1 norm promotes sparse solutions (solutions where many pixel values are zero). The combination of these two terms leads to a new loss function

$$ L(\boldsymbol{\theta}) = L_\mathrm{nll}(\boldsymbol{\theta}) + \lambda_\mathrm{sparsity} L_\mathrm{sparsity}(\boldsymbol{\theta}) $$

Where we control the relative “strength” of the regularization via the scalar prefactor $\lambda_\mathrm{sparsity}$. If $\lambda_\mathrm{sparsity} = 0$, no sparsity regularization is applied. Non-zero values of $\lambda_\mathrm{sparsity}$ will add in regularization that penalizes non-sparse $\boldsymbol{\theta}$ values. How strong this penalization is depends on the strength relative to the other terms in the loss calculation.

We can equivalently specify this using Bayesian terminology, such that

$$ p(\boldsymbol{\theta} |\,\boldsymbol{V}) = \mathcal{L}(\boldsymbol{V}|,\boldsymbol{\theta}) \, p(\boldsymbol{\theta}) $$

where

$$ p(\boldsymbol{\theta}) = C \exp \left (-\lambda_\mathrm{sparsity} \sum_i | I_i| \right) $$

and $C$ is a normalization factor. When working with the logarithm of the posterior, this constant term is irrelevant.

The MPoL package for Regularized Maximum Likelihood imaging

Million Points of Light or “MPoL” is a Python package that is used to perform regularized maximum likelihood imaging. By that we mean that the package provides the building blocks to create flexible image models and optimize them to fit interferometric datasets. The package is developed completely in the open on Github

We strive to

create an open, welcoming, and supportive community for new users and contributors (see our code of conduct <https://github.com/MPoL-dev/MPoL/blob/main/CODE_OF_CONDUCT.md>and developer documentation <developer-documentation.html>)
support well-tested (|Tests badge|) and stable releases (i.e., pip install mpol) that run on all currently-supported Python versions, on Linux, MacOS, and Windows
maintain up-to-date API documentation <api.html>__
cultivate tutorials covering real-world applications

We also recommend checking out several other excellent packages for RML imaging:

There are a few things about MPoL that we believe make it an appealing platform for RML modeling.

Built on PyTorch: Many of MPoL’s exciting features stem from the fact that it is built on top of a rich computational library that supports autodifferentiation and construction of complex neural networks. Autodifferentiation libraries like Theano/Aesara, Tensorflow, PyTorch, and JAX have revolutionized the way we compute and optimize functions. For now, PyTorch is the library that best satisfies our needs, but we’re keeping a close eye on the Python autodifferentiation ecosystem should a more suitable framework arrive. If you are familiar with scientific computing with Python but haven’t yet tried any of these frameworks, don’t worry, the syntax is easy to pick up and quite similar to working with numpy arrays.
Autodifferentiation: PyTorch gives MPoL the capacity to autodifferentiate through a model. The gradient of the objective function is exceptionally useful for finding the “downhill” direction in a large parameter space (such as the set of image pixels). Traditionally, these gradients would have needed to been calculated analytically (by hand) or via finite-difference methods which can be noisy in high dimensions. By leveraging the autodifferentiation capabilities, this allows us to rapidly formulate and implement complex prior distributions which would otherwise be difficult to differentiate by hand.
Optimization: PyTorch provides a full-featured suite of research-grade optimizers designed to train deep neural networks. These same optimizers can be employed to quickly find the optimum RML image.
GPU acceleration: PyTorch wraps CUDA libraries, making it seamless to take advantage of (multi-)GPU acceleration to optimize images. No need to use a single line of CUDA.
Model composability: Rather than being a monolithic program for single-click RML imaging, MPoL strives to be a flexible, composable, RML imaging library that provides primitives that can be used to easily solve your particular imaging challenge. One way we do this is by mimicking the PyTorch ecosystem and writing the RML imaging workflow using PyTorch modules. This makes it easy to mix and match modules to construct arbitrarily complex imaging workflows. We’re working on tutorials that describe these ideas in depth, but one example would be the ability to use a single latent space image model to simultaneously fit single dish and interferometric data.
A bridge to the machine learning/neural network community: MPoL will happily calculate RML images for you using “traditional” image priors, lest you are the kind of person that turns your nose up at the words “machine learning” or “neural network.” However, if you are the kind of person that sees opportunity in these tools, because MPoL is built on PyTorch, it is straightforward to take advantage of them for RML imaging. For example, if one were to train a variational autoencoder on protoplanetary disk emission morphologies, the latent space + decoder architecture could be easily plugged in to MPoL and serve as an imaging basis set.

Last time

Recap of (parametric) forward modeling in a Bayesian context
Recap of the CLEAN procedural image deconvolution algorithm
Introduction of RML process as a non-parametric model
Discussion of regularization, in the context of priors
Discussion of loss function space (defined by probability distribution) vs. the optimization engineering that helps you navigate it

Today

Overarching question—how do you assess whether something is good? Forays into Machine Learning
Deeper dive into future RML applications and opportunities

Model comparison

Last time we talked about the difference between parametric and non-parametric models, i.e., the difference between fitting a line with slope and intercept vs. fitting a spline or Gaussian process. And we made the simple distinction that a parametric model has a fixed number of parameters, whereas a non-parametric model generally has parameters that grow with your number of data points. The truth is that in several contexts these exist as part of a continuum.

Today we’re going to take a journey along this continuum and examine some of the failure modes that can come about. The discussion will first be general and applicable to many problems, but then we’ll zero in on the case of interferometric imaging (both CLEAN and RML) specifically.

Let’s think of a polynomial basis. If you recall back to one of our first lectures, where I asked you to draw a function through a set of discrete samples. Let’s narrow in on the specific case of a polynomial basis set of degree $N$, where $N$ is the number of terms. I’m going to write it like this

$$ y = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + \ldots. $$ but similar arguments apply to Legendre polynomials or Chebyshev polynomials, etc, and in practice it’s better to use on of those for your actual fitting problem.

But lets say we have 10 data points. We can fit a 0th order polynomial, 1st order, etc.., using say a $\chi^2$ likelihood function. And then we can get to a polynomial of degree ten. Who has heard the common critique “your model has more parameters than data, it’s so flexible”. This situation is where the common advice against using a model with more parameters than your data comes from, and when people utter this critique, I think this is the situation that they are referring to.

What are the criticisms of this model? On one hand, it has fit the data perfectly. It’s done what we’ve asked. On the other hand, it doesn’t seem to do what we want. If we were to get new data, our model probably wouldn’t be that useful.

What can we do? Well, the common wisdom would have us stick to models with fewer parameters. But, this is being a little shortsighted. Instead, we can add a regularization that discourages the fit from taking on large amplitudes in many of the terms. One type of regularization is “ridge regression,” also called Tikhonov or L2 regularization. It adds an extra term to the fit metric that says

$$ \lambda \sum_{i=0}^{N-1}|a_i|^2 $$

then we find that the amplitudes of those higher order terms (which might not be necessary) will be diminished. If you recall from very early in the semester, where we talked about the concept of band-limited signals, this regularization is related. In the limit we let $N\rightarrow \infty$, we arrive at a Gaussian process, and the autocovariance of the function (the Gaussian process kernel) is related to the power spectrum of the signal. If we say the signal is band-limited, then that puts a cap on the number of higher order terms that we can actually use.

This math is fully equivalent to the RML imaging problem we introduced last week, and it also raises the same problem: how do we set the regularizer strength? What is the best choice?

Cross validation

Useful thoughts from https://biometry.github.io/APES/LectureNotes/2017-Resampling/CrossValidationLecture.html

The idea is to test the predictive power of your model. In this case, the model would be your setup of your. In the RML case, the model would be the settings of your image pixelization,

If we have the right model, we will generalize perfectly to new data. The problem is that our training data are always limited and will usually always have some noise.

The problem of non-independence of your random hold-outs. When the data are small, it is possible to overfit your cross validation. This is a hard place to be in, especially when getting new data is expensive.

Image Plane Deconvolution (CLEAN)

Tue, 19 Sep 2023 00:00:00 +0000

References

Synthesis Imaging in Radio Astronomy II: Lecture 7: Imaging by Briggs, Schwab, and Sramek and Lecture 8: Deconvolution by Cornwell, Braun, and Briggs
Molecules with ALMA at Planet-forming Scales (MAPS). II. CLEAN Strategies for Synthesizing Images of Molecular Line Emission in Protoplanetary Disks by Czekala et al. 2021

Outline

Recap of visibility datasets and the sampling function
Image plane implications of sampling–the dirty image
Noise and “weighting”
The CLEAN image deconvolution algorithm

Visibility datasets

Recall from last time that the visibility function is the Fourier transform of the sky brightness distribution

$$ \mathcal{V}(u,v) \leftrightharpoons I(l,m) $$

and that each baseline (pair of antennas) of an interferometric array corresponds to a sample of the visibility function at a specific $u,v$ point. The $u,v$ point corresponds to the length of the projected baseline in multiples of the observing wavelength and is the spatial frequency of the image plane that is being sampled.

For a large array with > 50 antennas, like ALMA, you get nearly 1000 unique instantaneous baselines, from each pairwise combination of antennas in the array. As the earth rotates, you can quickly acquire new projected baseline samples.

Sampling function

If you do radio interferometry, you will very often see the baseline distributions plotted as a series of $\delta$ functions on the $u,v$ plane.

TODO: draw approximate plot with points

This is called the sampling function

$$ S(u,v) = \sum_{k=1}^M \delta(u - u_k, v - v_k). $$

And, we can write down the sampling of the visibility function as

$$ S(u, v) \times \mathcal{V}(u, v). $$

If you recall from our Fourier transform distribution, this sampling function is also called the transfer function. It allows certain spatial frequencies through the interferometric system. Both terminologies are used in the radio astronomy community.

Image plane implications

Now, let’s discuss the image-plane ramifications of the sampling operation. We started with $I(l,m)$, the “true” sky brightness, i.e., the one you would observe if you had a perfectly sensitive telescope with infinite resolving power.

$$ \mathcal{V}(u,v) \leftrightharpoons I(l,m) $$

but we’ve multiplied it by the sampling distribution

$$ S(u, v) \times \mathcal{V}(u, v). $$

Remember how we talked about interferometers as spatial filters? We just showed how this conceptually works in the Fourier domain, the $u,v$ coverage provided by the antenna spacings is the transfer function.

We can also show that interferometers are spatial filters by considering the image plane implications. This same operation implies that the true sky brightness is convolved by something in the image plane $$ I(l, m) * B_D(l, m) \leftrightharpoons S(u, v) \times \mathcal{V}(u, v). $$

The quantity $B_D(l,m,)$ is called the dirty beam. We’ll soon talk about the CLEAN algorithm, so the dirty/clean terminology will soon make sense. But first, let’s just think about this for a second. This dirty beam is the same thing we showed in previous lectures, and can also be thought of as the sum of the fringe functions.

If you take an image, convolve it with the dirty beam, and then take its Fourier transform, you’ll see that you will have visibility samples only at the spatial frequencies corresponding to your baselines.

Another way to think of the dirty beam is as the impulse response of the interferometric system. Let’s assume we are observing a point source $I(l,m) = \delta(l,m)$. The visibility function corresponding to a point source is a constant, so we have

$$ \delta(l,m) * B_D(l,m) = S(u,v) \times \mathrm{constant} $$

$$ B_D(l,m) \leftrightharpoons S(u,v). $$

So, another way we can say the same thing is that the dirty beam is the point spread function (PSF) of the interferometer, and it is given by the Fourier transform of the sampling function, which is set by the configuration of the baselines within the array.

The Dirty Image

In last week’s lecture, we talked a bit about the actual visibility data products, a discrete set of (noisy) visibility samples

$$ \mathbf{V} = \{V_1, V_2, \ldots, V_M\}^M_{k=1}. $$

The idea is that each was sampled with some (complex) noise draw

$$ V_i = \mathcal{V}(u_i, v_i) + \epsilon $$

such that $$ \epsilon_\Re \sim \mathcal{N}(0, \sigma) $$

$$ \epsilon_\Im \sim \mathcal{N}(0, \sigma) $$

$$ \epsilon = \epsilon_\Re + i \epsilon_\Im. $$

And then we said that radio interferometers commonly represent the uncertainty on each visibility measurement by a “weight” $w_i$, where

$$ w_i = \frac{1}{\sigma_i^2}. $$

More details about the sampling function

The fact that each visibility measurement is made in the presence of noise means that we should be taking this into account in our sampling function. So a more sophisticated sampling function looks like

$$ S(u, v) = \sum_{k=1}^M T_k D_k w_k \delta(u - u_k, v - v_k). $$

In addition to weighting each sample by its inverse variance $w_k$ (something you might do to take a statistical average), there are other factors you might fiddle with, like a “taper” $T_k$ and density weight $D_k$. For now, you can just think of them as equal to 1.0.

Now, let’s think about how we would take our sampled visibilities $\mathbf{V} = \{V_1, V_2, \ldots, V_M\}^M_{k=1}$ and make an image from them. Well, the way to do this looks like an inverse Fourier transform

$$ I_D(l, m) = C \sum_{k=1}^{N^\dagger} T_k D_k w_k V_k \exp \{2 \pi i (u_k l + v_k m) \}. $$ with normalization constant $$ C = 1 / \sum_{k=1}^{N^\dagger} T_k D_k w_k. $$

where $N^\dagger = 2 N$ is the set of visibilities that includes their complex conjugates such that the sampling is Hermitian. This image that results is called the dirty image, and we denote it with a “D” subscript.

Continuing with the dirty beam concept, we also have

$$ I_D(l,m) = I(l, m) * B_D(l, m) \leftrightharpoons S(u, v) \times \mathcal{V}(u, v). $$

A quick note about the “ill-defined” imaging problem and weighting choices

Making images from Fourier samples is generally an ill-defined inverse process, which is only complicated in the presence of noise. What do we mean by ill-defined?

In the forward process, we collect samples of the visibility function at specific $u,v$ values $S(u, v) \times \mathcal{V}(u, v)$. To make a dirty image, we take the inverse Fourier transform of those values to produce an image. This is the inverse process.

The ill-defined nature of imaging (even in the presence of noise): Think about what’s happening in the forward process. If the visibility function had non-zero amplitudes at some $u,v$ values that we didn’t sample, then these never enter our dataset (obviously). The inverse process itself doesn’t include these values (obviously), and by their omission, assumes that their amplitudes are equal to zero.

So, as far as our interferometer is concerned, it can’t distinguish between a set of degenerate image brightness distributions on the sky so long as they have the exact same $\mathcal{V}(u,v)$ values at the sampled $u,v$ points. The unsampled $\mathcal{V}(u,v)$ locations can take on arbitrary values and still result in the exact same dataset.

Another way of saying the same thing is to think of an interferometer as a spatial filter, i.e., its transfer function (the array configuration) only allows measurements of certain spatial frequencies to enter the dataset. But most images contain power at many spatial frequencies, including those that have been filtered out. So, if you just try to make an image with the spatial frequencies in your dataset, your image will most likely be missing some spatial frequencies that would be there in actuality.

Weighting choices

The whole discussion about the ill-defined inverse problem applies even if we sampled the visibility function perfectly with no noise, so long as there are still unsampled $u,v$ points that contain significant visibility “power.”

The problem gets even more complicated when we consider measurement noise and the fact that array configurations usually sample some parts of $u,v$ space better than others. In my opinion, this is really why various weighting schemes are as popular as they are. The common way to do this is by tuning the $D_k$ and $T_k$ terms. As we’ll show later in the slides, tuning the $D_k$ terms provide a tradeoff between:

natural weighting maximizing point source sensitivity at the cost of spatial resolution (broader beam)
uniform weighting maximizing spatial resolution (narrow beam) at the cost of point source sensitivity (higher RMS noise floor in the image)
“Briggs” robust weighting a tradeoff between these two regimes, ranging from (-2 to 2). The tradeoff is non-linear, so coming from natural weighting, for example, good spatial resolution can be gained with only modest sacrifices in point source sensitivity. Or, vice versa, coming from uniform weighting, good sensitivity to point sources can be gained with only modest losses in resolution.

Typically, a reasonable starting point with ALMA observations is to use some in-between value of Briggs weighting like -0.5, 0.0, or 0.5. Different weighting choices can change your sensitivities on different spatial scales.

The $T_k$ terms are for applying a taper, whereby one downweights longer baseline observations.

CLEAN

We’ve talked about how

$$ I_D(l,m) = I(l, m) * B_D(l, m) \leftrightharpoons S(u, v) \times \mathcal{V}(u, v). $$

Specifically, the dirty image is the convolution of the true sky brightness $I$ with the dirty beam $B_D$. We know what the dirty beam is to high precision.

First, it’s important to note that convolution is a lossy procedure, you (irrevocably) lose information. For example, consider applying a Gaussian blur to an image. The high resolution information in that image has been lost.

CLEAN is an image deconvolution algorithm. We just said that convolution is a lossy procedure, so, how does the algorithm get that information back? What follows are my own opinions about the CLEAN algorithm, its use cases I’m most familiar with in the protoplanetary disk community, and its limitations.

TODO: draw a 1D cut of the dirty beam

The short answer is that CLEAN can help restore an image up to a point. The thing that CLEAN is best at is removing the effect of those nasty sidelobes from a dirty beam, and replacing them with a more Gaussian beam response that is usually easier to work with. CLEAN will not give you “super-resolution” access to lost spatial frequencies that you have lost, but it can help you make better looking images, and ones that have better dynamic range.

Iterative processes

CLEAN is a procedure that iteratively builds up a model image. To carry this out on the whiteboard, I’m going to do things in 1D. In a moment we’ll show an example with 2D images in the slides.

TODO: draw in 1D a dirty image of a few point sources, a representation of the dirty beam, and a blank model image

Before you start, we’ll define a quantity called the CLEAN beam. It’s usually chosen to be a Gaussian fit to the main lobe of the beam.

1. First, we identify the peak location in the dirty image.
2a. Then, we subtract some fraction of the flux times the dirty beam from this location. This dirty image becomes a “residual image” now.
2b. At the same time, we add a $\delta$ function at corresponding location in the model image with the same amplitude as the flux we subtracted. So, if we subtracted 0.1 Jy of flux in the dirty beam, then we would add a $\delta$ function with amplitude of 0.1 Jy in the model image.
2c. You can think of these steps as equivalent, because a $\delta$ function times the dirty beam gives you back the dirty beam. These steps are also equivalently carried out in the visibility domain.
1. Go back to step 1, and repeat with the next-highest peak location. Continue this loop until the peak flux in the image drops below some threshold
1. Once this threshold is reached, the CLEANing is done. The final step is to put everything back together. The model image is convolved with the CLEAN beam to form the restored image. This “smooths out” the model image to some resolution limit, and hides imperfections on smaller scales.
1. The remainder of the residual image is added back to the restored image to give a sense of the “noise” in the image.

Limitations

CLEAN is procedural. What this means is that you set parameters that guide the above process and then carry on until some termination criterion is reached. This could also be part of an interactive process. There is no guarantee that the CLEANed image is unique, either.

In my opinion, CLEAN is best at removing the sidelobe effects of the dirty beam, improving the dynamic range of your image, and possibly detecting fainter (point-like) sources that would have been hidden by the sidelobes of other brighter point sources.

In the above example, we said that we would use a $\delta$ function to build up a model image. You may have already identified this choice of basis set or “CLEAN component” as a potential limitation. This works great for fields of point sources, but what about extended sources? It turns out that it actually works OK for extended sources, so long as you have many of them. This adds to the computational time, though, and is why I think we’re now currently in an interesting place, approaching the limitations of CLEAN.

There are other extensions to CLEAN (called multi-scale), which use CLEAN components of varying sizes, like little Gaussian blobs. This can help substantially over using just $\delta$ functions. For very resolved structures, though, you still run into the problem that your components aren’t sufficiently like the morphology of the source you’re trying to deconvolve.

2D Interferometry, PSFs, Gridding, and Dirty Images

Tue, 19 Sep 2023 00:00:00 +0000

Slides [PDF] [Keynote (better for movies)]

References for today

Interferometry and Synthesis in Radio Astronomy by Thompson, Moran, and Swenson, particularly Appendix 2.1
Essential Radio Astronomy by James Condon and Scott Ransom
The Fourier Transform and its Applications by R. Bracewell

Review of last time

We talked about R.A., Dec., and direction cosines as coordinates on the sky
We introduced a two-element interferometer with multiply and add correlator backend
We introduced the fringe pattern $F(l)$ of a two-antenna interferometer, which is a cosine wave on the sky
We discussed how the frequency of this fringe pattern (i.e., the spatial frequency) changes as a function of baseline length between the two antennas
We introduced the visibility function $\mathcal{V}(u)$ as the Fourier transform of the sky-brightness $I(l)$
And discussed how the output from the interferometer changes in response to a point source and an extended source

This time

Extend our discussion to a general interferometer with north-south and east-west baselines
Discuss arrays with multiple antennas
Earth aperture synthesis
$u,v$ sampling distributions for various arrays
point spread functions (PSFs) and their relationship to the sampling distribution

Wrapping up 1D

Last time, we had said the response $R(l)$ of the interferometer to some sky distribution $I(l)$ is to act as a convolution $$ R(l) = \cos(2 \pi u l) * I(l). $$ In this case, you can think of the fringe function as a point spread function (a terrible one), that is convolving out the true sky brightness distribution.

Another way to look at this, though, is through our old friend the multiplication/convolution algorithm. Let $u_0$ be the (fixed, instantaneous) baseline distance between two antennas, measured in multiples of the observing wavelength $$ u = \frac{D \cos \theta}{\lambda} $$ where $\theta$ is the angle from zenith.

Then, the Fourier pair of the fringe function is $$ \cos(2 \pi u_0 l) \leftrightharpoons \frac{1}{2} [\delta(u + u_0) + \delta(u - u_0)]. $$ These are two delta functions situated at $\pm u_0$.

The Fourier pair of the sky brightness distribution is called the visibility function $$ I(l) \leftrightharpoons \mathcal{V}(u). $$

In the Fourier plane, the response of the interferometer is a multiplication $$ \cos(2 \pi u l) * I(l) \leftrightharpoons \frac{1}{2} [\delta(u + u_0) + \delta(u - u_0)] \times \mathcal{V}(u) $$ i.e., the interferometer has sampled the visibility function at locations $\pm u_0$ corresponding to the baseline distance of the two antennas.

Another way to think about this is that the interferometer acts as a spatial filter that only responds to the two spatial frequencies $\pm u_0$. $\mathcal{V}(u)$ represents the amplitude and phase of the sinusoidal component of the intensity distribution with spatial frequency $u$ cycles per radian. The negative spatial frequency doesn’t have a physical meaning but is a mathematical convenience. Because the intensity distribution on the sky is a real quantity, the visibility function itself is symmetric about the origin in a Hermitian sense, meaning it has real even parts and odd imaginary parts.

Moving on to the general case

We’re going to take our two-element interferometer and re-derive the same relationships using a general vector formalism
Then, we’re going to introduce a general 3D cartesian coordinate set to the problem, which is used by most interferometers

Coordinates

Let’s consider a generic situation of a two-element interferometer observing (tracking) a source on the sky with phase center $\mathbf{s}_0$.

TMS Fig 3.1

Antenna power pattern: An element of the source with solid angle $d\Omega$ at some position $\mathbf{s} = \mathbf{s}_0 + \mathbf{\sigma}$ will contribute an element of power $ \frac{1}{2} A(\sigma) I(\sigma) \Delta \nu d\Omega$, where $A$ is some (normalized) power pattern of a single antenna. For now, you can consider it to be a directionally smooth function that is effectively constant over the field of view of interest.

From what we just talked about for the 1D example, the component of the correlator output will be equal to the received power and to the fringe term $\cos(2 \pi \nu \tau_g)$.

Let $\mathbf{D}_\lambda$ be a baseline vector which points from the central antenna to the other one, and specifies the baseline length in multiples of the observing wavelength. Then

$$ \nu \tau_g = \mathbf{D}_\lambda \cdot \mathbf{s} = \mathbf{D} \cdot (\mathbf{s}_0 + \mathbf{\sigma}). $$

To calculate the output from the correlator, we need to integrate over the spatial distribution of the source

$$ r(D_\lambda, s_0) = \Delta \nu \int_{4 \pi} A(\mathbf{\sigma}) I(\sigma) \cos [2 \pi D_\lambda \cdot (s_0 + \mathbf{\sigma})]\,\mathrm{d}\Omega $$

Here we see an opportunity to use our sine/cosine difference angle formulae again to split this up into sine and cosine components and then use Euler’s formula to put it back together.

Let’s define the complex visibility as $$ \mathcal{V} = \int_{4 \pi} A_N(\sigma) I(\sigma) e^{-i 2 \pi D_\lambda \cdot \sigma}\,\mathrm{d}\Omega $$ which I hope you agree looks suspiciously like a Fourier transform.

When an interferometer observes a source, it is sampling the visibility function at these points (corresponding to a spatial frequency of $\pm \mathbf{D}_\lambda \cdot \mathbf{s}_0 $). You can think of this measurement as just recording the real and imaginary values of the visibility function, or alternatively, some visibility amplitude and some visibility phase $$ \mathcal{V} = |\mathcal{V}|e^{i \phi}. $$

3D coordinates

Now that we’ve introduced how the visibility function comes about in a general vector formalism, let’s get concrete with respect to coordinates on Earth and in the sky.

Credit: TMS Fig 3.2. Note that the $l$ and $m$ coordinates technically index the flat image plane tangent to $\mathbf{s_0}$, not curved as they are shown here.

Let’s focus in on the $u,v, w$ coordinate system in the bottom of the figure. The plane is centered at the location of one of the antennas $u=0,v=0, w=0$. We can also draw our baseline vector $\mathbf{D}_\lambda$ in this coordinate system, pointing to the other antenna. The $w$ unit vector is pointing towards phase center $\mathbf{s}_0$ (i.e., the direction of the source).

Another way of saying this is $$ \mathbf{D}_\lambda \cdot \mathbf{s}_0 = w. $$ So, you can think of the $u,v$ plane $w = 0$ as oriented orthogonal to the vector pointing towards phase center. But, N.B. that the baseline vector itself does not necessarily live in the $w = 0$ plane, it can have a non-zero $w$ component.

Using this coordinate system, let’s focus on rewriting the $\mathbf{D}_\lambda \cdot \mathbf{s}$ term. Recall that the dot product between two vectors is

$$ \mathbf{a} \cdot \mathbf{b} = ||a|| \; ||b|| \cos \theta. $$ Therefore we have $$ \mathbf{D}_\lambda \cdot \mathbf{s} = \left ( ul + vm + wn \right). $$ Wow, that was simple. Hopefully now you see why it was convenient to use $l, m$ as direction cosines!

$n$ is the third direction cosine and is w.r.t. the $w$ axis. It is not independent of $l, m$ and can be written in terms of them as $$ n = \sqrt{1 - l^2 - m^2}. $$

So we would normally write $$ \mathbf{D}_\lambda \cdot \mathbf{s} = \left ( ul + vm + w\sqrt{1 - l^2 - m^2} \right). $$

This factor also appears in the solid angle differential. As we move from phase center, the solid angle is changed by a factor $$ d \Omega = \frac{\mathrm{d}l\; \mathrm{d} m}{\sqrt{1 - l^2 - m^2}}. $$ This is adjusting for the fact that the solid angle is something on the celestial sphere, but we are measuring it using the direction cosines on the tangent plane.

With these relationships in hand, we can rewrite the visibility function as $$ \mathcal{V}(u, v, w) = \int_{-\infty}^\infty \int_{-\infty}^\infty A_N(l,m) I(l,m) \exp \left \{ -i 2 \pi \left [ ul + vm + w \left ( \sqrt{1 - l^2 - m^2} - 1 \right )\right ] \right \} \;\frac{\mathrm{d}l\;\mathrm{d}m}{\sqrt{1 - l^2 - m^2}}. $$

The factor in the exponential comes about from the measurement of angular position with respect to phase center ($\mathbf{D}_\lambda \cdot \mathbf{s})$, as we saw in the “general coordinates” example.

If all of the measurements could be made with the antennas in a plane normal to the $w$ direction such that $w=0$, then we would turn this equation into an exact 2D transform. But this isn’t usually the case and we need to make approximations.

So long as we are in the small-field regime and $l$ and $m$ are small enough such that the term $$ \left ( \sqrt{1 - l^2 - m^2} - 1 \right ) w $$ can be neglected ($\simeq - \frac{1}{2}(l^2 + m^2)w$ in this regime), then we have $$ \mathcal{V}(u, v, w) \simeq \mathcal{V}(u, v, 0) = \int_{-\infty}^\infty \int_{-\infty}^\infty \frac{A_N(l,m) I(l,m)}{\sqrt{1 - l^2 - m^2}} \exp \left \{ -i 2 \pi \left [ ul + vm \right ] \right \} \;\mathrm{d}l\;\mathrm{d}m. $$

OK! So we’ve arrived at the result that I told you about at the beginning of last week’s class, that the visibility function is the Fourier transform of the sky brightness (modified by the primary beam of each antenna, which we can mostly ignore for this discussion as a constant). The approximation we made for the $w$ term places a limit on the maximum size of the field that we can image (at once). There are approaches designed to overcome this scenario, but at least in the context of this course we will restrict our discussion to those that don’t require it. This is generally the case for all images made with VLA or ALMA for a single pointing (i.e., imaging the full primary beam). If you use multiple pointings of the array (generally called “mosaicing”) to make an even larger image, you’ll need to take into account the effects of the $w$ term.

Revisiting “slightly extended source: A “slightly extended source” is something that is larger than the dirty beam but smaller than the primary beam of each telescope. Instantaneous field of view of an interferometer is the same as the primary beam of each telescope, treated as a single dish (see previous section). Each single dish antenna is still seeing the same thing as before, it’s just that we have a correlator backend that’s doing things with the signals, allowing us to create a synthesized beam that is considerably smaller than the size of the primary beam. For example, at 220 GHz (band 6), ALMA has a primary beam of about 20 arcseconds in diameter. However, it’s common to make synthesized beams on the size of 0.1 arcseconds or smaller.

Let’s also briefly discuss what the $u,v$ coordinate plane looks like, now that we are in 2D:

Credit: TMS Fig 2.7

Units of $\mathcal{V}$

What are the units of $\mathcal{V}$ itself? We can get at this by looking at the units of $I_\nu(l,m)$ and how we carried out the Fourier transform integral (in its simplified form).

$$ \mathcal{V}(u, v) = \int_{-\infty}^\infty \int_{-\infty}^\infty I(l,m) \exp \left \{ -i 2 \pi \left [ ul + vm \right ] \right \} \;\mathrm{d}l\;\mathrm{d}m. $$

If we parameterized our image using $\mathrm{Jy} / \mathrm{arcsec}^2$ and we integrated over $ \mathrm{d}l\, \mathrm{d}m$ (both assuming they had units of arcsec), then $\mathcal{V}$ must have units of Jy. I.e., you can think of it sort of like the flux being observed at that angular scale. The visibility function is complex-valued, so if you want to discuss the “power” of an image at some angular scale then you should consider $|\mathcal{V}|^2$.

Interferometry in Practice

Tue, 19 Sep 2023 00:00:00 +0000

Video Recording

References for today

The Fourier Transform and its Applications by R. Bracewell
Interferometry and Synthesis in Radio Astronomy by Thompson, Moran, and Swenson, particularly Appendix 2.1
Essential Radio Astronomy by James Condon and Scott Ransom

This time

The punchline of today is that the complex-valued visibility function $\mathcal{V}$ is the 2D Fourier transform of an image on the celestial sphere (with a small field of view)

$$ I(l, m) \leftrightharpoons \mathcal{V}(u, v) $$

and it is the visibility function that interferometers measure directly. The values of $u, v$ for which interferometers are able to measure the visibility function depend on how the array of antennas is laid out and the spacing between them.

Most of today’s lecture will follow Chapters 2 and 3 of Interferometry and Synthesis in Radio Astronomy by Thompson, Moran, and Swenson. First, we will introduce a two-element interferometer and it’s response to a point source. Then, we’ll complexify this a bit to talk about an extended source (but still in 1D). Then, we’ll move on to discuss intensity distributions and the visibility function in the general case and then derive the relationship between $I(l, m) \leftrightharpoons \mathcal{V}(u, v)$.

We’ll first develop the geometry and math of this relationship for small fields of view, so that you understand the result, at least in an abstract manner. Then we’ll spend the latter part of the course working through how a radio interferometer like the VLA or ALMA works to actually sample the visibility function.

R.A. and Dec

Let’s review our coordinates for images on the celestial sphere.

Declination’s the easier one, in my opinion. No matter where you are, if you move in declination, you move along a great circle (i.e., a circle that actually traces the circumference of the celestial sphere). We measure this in terms of 0 (celestial equator) to +90 degrees at the north celestial pole and -90 degrees at the southern celestial pole. You can split a degree into 60 arcminutes and an arcminute into 60 arcseconds.

Credit: Sky and Telescope

Right Ascension is the one that sometimes trips me up. Because the sky rotates, we have this system of marking it using sidereal time: 24 hours of right ascension, sometimes broken up into 60 minutes, and 60 seconds. Although they are still in multiples of 60, these minutes and seconds we use for right ascension do not have the same angular size as arcminutes and arcseconds (even if we are on the celestial equator). For example, let’s say we have two points on the sky.

p1 = '00h42m00s', '+41d12m'
p2 = '00h42m01s', '+41d12m'

Same declination, but their R.A. values differ by one second. Can anyone guess how many arcseconds separate these points? The answer is about 11.3 arcseconds.

Usually, when we are talking about observing a smaller source (like a protoplanetary disk, or galaxy), we point in some direction towards that object and then define a small little postage stamp, commonly in units of $\Delta \delta $ and $\Delta \alpha \cos \delta$, in which case, the units describing the image are arcseconds. We’re still talking about spherical astronomy though, so this isn’t necessarily limited to small fields of view. We could have $\Delta \delta$ be several degrees, for example.

TODO: Figure: point relative to center, and then Delta directions coming off of it.

Direction cosines

In a moment, we’re going talk about the mechanics of interferometers observing the celestial sphere and how these relate to Fourier transforms. Before we talk about that, though, there’s a concept I want to introduce while we’re still talking about units for images.

Practically speaking, most images we might make with ALMA or the VLA will have a small field of view (< 1 arcminute). In this regime, it simplifies a lot to talk about “flat” images, i.e., image planes that are tangent to the field center. In 1D, it would look like this

The concept of the direction cosine. Credit: Fig 3.3 TMS

The direction cosines are $$ l = \sin(\Delta \alpha \cos \delta) $$ and $$ m = \sin(\Delta \delta) $$ relative to the phase center. You can see from the figure that it is the direction cosine that is actually tracking the position on the tangent plane, where we are defining our image. I know I just wrote down $\sin$ but I called these direction cosines. The term comes about from the way you’d set up the problem in two dimensions, where you might use cosine of the complementary angle instead, but it’s just a matter of convention.

Because they are outputs from trigonometric functions, they are technically unitless. Though l and m are technically unitless and measures of linear distance, for small angular extent, they could also be considered to have units of radians. So it will be common that we refer to $$ l = \sin(\Delta \alpha \cos \delta) \approx \Delta \alpha \cos \delta $$ and $$ m = \sin(\Delta \delta) \approx \Delta \delta. $$

This probably sounds pointless, since we just arrived back at the same units we started with. Hopefully the reasons why we might wish to use these units will become apparent after we cover more about the interferometer. And remember that you can always exactly convert from direction cosine back to angular usits (e.g. $\Delta \alpha \cos \delta$) by doing $\sin^{-1}$, it’s just that for most small angles we’ll be dealing with, this operation is essentially an identity function. All of this goes out the window when we consider wide-field imaging (which we won’t have time to talk about in this course, unfortunately).

Introduction to a 2-element interferometer

Consider this geometric situation

Credit: TMS Fig 2.1

The antennas are spaced directly east-west, and they are observing a source in the far-field, i.e., the radiation from a distant cosmic source appears as a plane wave. First, we will consider the case of a point-source; later we will extend this formalism to spatially extended sources. We’ll assume the primary beam of each antenna is large, such that they can observe radiation from a source located a wide range of $\theta$ angles. We’ll assume that we’re observing in a narrow slice of frequency around $\nu$, essentially monochromatic.

The wavefront from the source arrives at the right antenna some time $$ \tau_g = \frac{D}{c} \sin \theta $$ before it reaches the left one. This is called the geometric time delay.

Each antenna has its own signal voltage stream: $$ V_1 = \sin 2 \pi \nu t $$ and $$ V_2 = \sin 2 \pi \nu (t - \tau_g). $$

These streams are multiplied together in a correlator and then time-averaged over some interval. The output of the correlator is proportional to $$ F(t, \tau_g) = \sin (2 \pi \nu t) \sin 2 \pi \nu (t - \tau_g), $$ which we can expand using our trig sum identities for sine to $$ F(t, \tau_g) = \sin^2(2 \pi \nu t) \cos(2 \pi \nu \tau_g) - \sin(2 \pi \nu t)\cos(2 \pi \nu t) \sin(2 \pi \nu \tau_g). $$

We can simplify this equation based on our knowledge that the correlator multiplies and then adds (integrates, typically for a few seconds).

The central frequency $\nu$ is on the order of 10s of MHz to nearly a THz
$\theta$ (baked into $\tau_g$) is rotating at the Earth’s rotational velocity, which is $10^{-4}\;\mathrm{rad\,s}^{-1}$.
$D$ must be smaller than $10^7\;$m for terrestrial baselines

This means that the rate of variation of $\nu \tau_g \ll \nu t$ by several orders of magnitude.

So long as our averaging interval is $T \gg 1/\nu$ (which is satisfied by a typical multi-second integration), $$ \langle \sin^2 (2 \pi \nu t) \rangle = 1/2 $$ and $$ \langle \sin(2 \pi \nu t)\cos(2 \pi \nu t) \rangle = 0 $$ so we’re left with $$ F \propto \cos (2 \pi \nu \tau_g). $$ We can also define $l = \sin \theta$ and then we can write $$ F \propto \cos (2 \pi \nu \tau_g) = \cos \left (\frac{2 \pi D l}{\lambda} \right ). $$

This is called the fringe function of a two-element interferometer

TODO: draw as a linear relationship vs. $l$, i.e., an oscillating sine wave TODO: then include the fringe plot itself

The fringe function (plotted here as $|F|$) can be thought of as the directional power pattern of the interferometer in the case the antennas are isotropic. Credit: TMS Fig 2.2

So, you see that the 2-element interferometer has a sine/cosine sensitivity to the sky along the east-west axis. It has no sensitivity along the north-south direction.

2-element interferometer for a spatially resolved source

Now we’ll consider a slightly more complex interferometer.

The antennas are (somewhat) directional
They track the source as it moves across the sky, from the rotation of the Earth. This introduces an instrumental time delay

$$ \tau_i = \frac{D}{c} \sin \theta_0. $$ I.e., this keeps the waveforms in sync so long as we’re looking directly at $\theta_0$.

The direction the antennas are pointed is called the phase reference position, which we’ll denote as $\theta_0$ and tracks the source as it rotates across the sky.

TODO: make a figure showing $\Delta \theta$ offset.

And let us consider radiation from a direction $\theta_0 - \Delta \theta$, where $\Delta \theta$ is a small angle. As before, the fringe response term is $$ \cos (2 \pi \nu \tau) = \cos \left \{ 2 \pi \nu \left [ \frac{D}{c} \sin (\theta_0 - \Delta \theta) - \tau_i \right ] \right \}. $$ Using the sine formulas for difference and simplifying with $\cos \Delta \simeq 1$ for small angles, we have $$ \cos (2 \pi \nu \tau) \simeq \cos \left [ 2 \pi \nu \frac{D}{c} \sin \Delta \theta \cos \theta_0 \right]. $$

Let’s stare at this equation a bit more. Assuming we’re holding observing frequency fixed, the angular resolution of the fringes is determined by the projected length of the baseline orthogonal to the direction of the source, which is $D \cos \theta_0$. This is a pretty ordinary physical measurement of a distance, i.e., we would measure it to be something like 50 meters.

Of course, observing frequency also makes a difference. If we’re observing at higher frequencies (shorter $\lambda$), the fringe resolution is going to better (this is just another form of the $\lambda/D$ resolution relationship for telescopes showing up).

So, we can define a new variable for this projected baseline length $$ u = \frac{D \cos \theta_0}{\lambda} = \frac{\nu_0 D \cos \theta_0}{c}. $$ $u$ is the number of wavelengths (at that observing wavelength) that are needed to span the projected baseline length. It is measured in multiples of “$\lambda$,” i.e., you might see a baseline length described as $u = $ 300 kilolambda. $u$ is called a spatial frequency.

Now we will redefine our sky coordinate variable $l = \sin \Delta \theta$, and we find that we can write the fringe response as $$ F(l) \propto \cos (2 \pi \nu_0 \tau) \propto \cos (2 \pi u l). $$

If $u$ gets larger (either by moving the antennas further apart and increasing the projected baseline, or by observing at a higher frequency), then the spatial resolution (spacing of the fringes) will get better.

We see that the quantity $(ul)$ appears inside of a trigonometric function, so this means the quantity must be dimensionless.

This motivates the many different ways we can think about these variables in the image plane and the visibility domain.

Unitless

The first way is to recognize that

$l$ is technically unitless, since it is $\sin(\Delta \theta)$
$u$ itself is also technically unitless, it’s just the baseline length measured in a number of wavelengths (e.g., kilolambda)
Inside this $\cos$ term, though, $u$ plays the role of a frequency, i.e., “cycles per unit $l$”

Both of these variables do correspond to actual distances, the interferometer does have some baseline, which corresponds to its ability to resolve a source of some actual size on the sky. It’s just with this way of thinking about it, we measured both of those sizes using dimensionless units (and that’s OK)!

Unitful

We can put a little bit more sense to this by bringing back the small angle approximation, and saying that because $\Delta \theta$ is small, then $l = \sin \Delta \theta \approx \Delta \theta$, and so it is as if we measured $l$ in some angular unit, like radians or arcseconds.

If $l$ is small, then we would say that it can be measured in radians (which can then be converted to arcseconds)
Then $u$ is measured in cycles/radian or cycles/arcsec

Because of this small angle approximation for the interferometer geometry, it allows us to equate “multiples of lambda” to “cycles per angle” as a spatial frequency.

Let’s walk through an example, and say that we’re observing a

do small angle approximation to say l in units of radians
convert to u

TODO: redraw waveform. This makes a lot of sense if you just draw the waveform up on the sky.

Compared to last time, we now assumed some directional sensitivity for each antenna, such as this power pattern

Credit: Tools of Radio Astronomy, Fig 7.1

The fringe pattern modified by the antenna power pattern. Credit: Rick Perley’s slides, NRAO summer synthesis imaging school 2022.

Recap: so now we’ve redefined the fringe function to talk about the response to a spatially resolved source, as a function of projected baseline length. And, we’ve introduced the concept of spatial frequency.

Fourier transform relationship

We just derived and drew the fringe function as the response of the interferometer on the sky and examined how it changes as we change the position of the antennas on the ground. The essential response $R(l)$ of the interferometer to some sky distribution $I(l)$ is to act as a convolution $$ R(l) = \cos(2 \pi u l) * I(l) $$

Another way to look at this is to consider the Fourier pair of the fringe function when the antennas are at a fixed (instantaneous) baseline distance $u_0$, $$ \cos(2 \pi u_0 l) \leftrightharpoons \frac{1}{2} [\delta(u + u_0) + \delta(u - u_0)]. $$

Let us define the visibility function $\mathcal{V}(u)$ as the Fourier transform of the sky brightness distribution (the true one)

$$ I(l) \leftrightharpoons \mathcal{V}(u). $$ $\mathcal{V}(u)$ represents the amplitude and phase of the sinusoidal component of the intensity distribution with spatial frequency $u$ cycles per radian.

Then, we can bring out our old friend the multiplication/convolution algorithm. $$ \cos(2 \pi u l) * I(l) \leftrightharpoons \frac{\mathcal{V}(u)}{2} [\delta(u + u_0) + \delta(u - u_0)]. $$

The Fourier Transform II

Wed, 13 Sep 2023 00:00:00 +0000

References for today

The Fourier Transform and its Applications by R. Bracewell
Interferometry and Synthesis in Radio Astronomy by Thompson, Moran, and Swenson, particularly Appendix 2.1
Fourier Analysis and Imaging by R. Bracewell
Wikipedia on Nyqist-Shannon sampling theorem

Review of last time

Defined Fourier transform, inverse $$ F(s) = \int_{-\infty}^{\infty} f(x) e^{-i 2 \pi x s}\,\mathrm{d}x $$
Note that because $x$ and $s$ appear in the argument of the exponential, their product must be dimensionless. This means that they will also have inverse units, e.g., “seconds” and “cycles per second.”
Introduced convolution, impluse symbol, and theorems

Where are we headed today?

Finish up Fourier transform theorems
Nyquist sampling theorem
Discrete Fourier Transform

Continuing from last time: Fourier transform theorems

Convolution/multiplication

The convolution of two functions corresponds to the multiplication of their Fourier transforms.

$$ f(x) \leftrightharpoons F(s) $$

and

$$ g(x) \leftrightharpoons G(s) $$

then

$$ f(x) * g(x) \leftrightharpoons F(s)G(s). $$

This is an extremely useful theorem. At least in my career, this, and concepts related to sampling, have been the ones I have used the most often. You may have already used this theorem (numerically) if you’ve ever carried out a convolution operation using scipy.signal.fftconvolve in Python, which can be dramatically faster than directly implementing the convolution, at least for certain array sizes.

Rayleigh’s theorem (Parseval’s theorem for Fourier Series)

The amount of energy in a system is the same whether you calculate it in the time domain or in the frequency domain.

The integral of the mod-squared of a function is equal to the integral of the mod-squared of its spectrum

$$ \int_{-\infty}^\infty |f(x)|^2\,\mathrm{d}x = \int_{-\infty}^\infty |F(s)|^2\,\mathrm{d}s. $$

Autocorrelation theorem

The autocorrelation of a function is

$$ f(x) * f(x) = \int_{-\infty}^\infty f^{*}(u) f(u + x)\,\mathrm{d}u $$

and it has the Fourier transform $$ f(x) * f(x) \leftrightharpoons |F(s)|^2. $$

Thus, the power spectrum is the Fourier transform of the autocorrelation function. It can also be computed directly by taking the “mod-squared” of $F(s)$,

$$ |F|^2 = F F^*. $$

If you’ve ever worked with (stationary) Gaussian processes (e.g., squared-exponential, Matern, etc…), you might recognize this relationship between the autocorrelation (the kernel function) and the power spectrum of the Gaussian process.

The derivative theorem

$$ f(x) \leftrightharpoons F(s) $$

then

$$ f^\prime(x) \leftrightharpoons i 2 \pi s F(s). $$

Using the transform pairs

Now that we have a few basic transform pairs, and some of the transform theorems, you can mix and match these to build up a library of new transform pairs. You will explore this in the problem set.

Definite integral

The zero-valued frequency of a Fourier transform is equal to the definite integral of a function over all space

$$ \int_{-\infty}^\infty f(x)\,\mathrm{d}x = F(0). $$

I.e., to compute the area under the curve, you can just read off the zero-frequency value of the Fourier transform.

Smoothness and compactness

In general, the smoother a function is in the time domain, the more compact its Fourier transform will be in the frequency domain.

Smoother functions will have a larger number of continuous derivatives. Something like the Gaussian envelope $\exp(-\pi x^2)$ is “as smooth as possible” and therefore its Fourier transform (also a Gaussian envelope) is as compact as possible.

Filters and transfer functions

Time domain

We can say that we have some (electrical) waveform $$ V_1(t) = A \cos (2 \pi f t) $$ which is a single-valued function of time. You can think of this as a voltage time-series or another physical quantity. By definition, the waveform is real.

Let’s put on our electrical/acoustical/mechanical engineering hats for a moment and consider that a filter is a physical system with an input an and output, e.g., something that is transmitting vibrations or oscillations, like our waveform.

How a filter changes the amplitude and phase of an waveform. Credit: Ian Czekala

If we feed our waveform into a linear filter, we get output $$ V_2(t) = B \cos (2 \pi f t + \phi). $$

The output is still a waveform, but its amplitude and its phase have changed. These changes are likely frequency dependent, too.

We can specify the filter by a frequency-dependent quantity $T(f)$ called the transfer function. It is a complex-valued function (having both an amplitude and a phase) and is given by $$ T(f) = \frac{B}{A}e^{i \phi}. $$

Interesting, perhaps, but maybe not immediately obviously useful. Let’s introduce the spectrum and then circle back to the transfer function.

Obtaining $V_2$ using the frequency domain

The spectrum of the waveform is the Fourier transform of $V(t)$, which we’ll call $S(f)$ in this section. We’ve broken slightly from our $f \leftrightharpoons F$ notation, but $V \leftrightharpoons S$ is a classic in the signal processing and electrical engineering fields, so we’ll at least build familiarity with it in this example.

The “spectrum” here is just the Fourier transform quantity, it can definitely be complex-valued.

The “spectra” that we typically talk about in astrophysics are measurements of the electromagnetic spectrum—you’ve probably never come across as one that’s complex-valued, right? What’s going on here?

Consider the units of a flux measurement $F_\nu$ of the electromagnetic spectrum. In lecture 1, we covered that the cgs units of flux are $$ \mathrm{ergs}\;\mathrm{s}^{-1}\;\mathrm{cm}^{-2}\;\mathrm{Hz}^{-1}. $$

The clue is in the $\mathrm{ergs}\;\mathrm{s}^{-1}$ part, which we could also write in terms of “watts” if we wanted to be strictly S.I. about it. When we are measuring the electromagnetic spectrum, we are actually measuring the power spectral density, $|F(\nu)|^2$. The absolute squared means the quantity $|F(\nu)|^2$ is real-valued, and is the reason why you never hear about measurements of the electromagnetic spectrum containing imaginary values!

In this course, at least, we’ll try to be explicit about which spectrum we’re referring to. When we actually mean power spectrum, we’ll try to call it as such. Otherwise, “spectrum” will refer to a quantity like $S$.

Now, let’s revisit our filter example, where we had input and output $V_1(t)$ and $V_2(t)$ respectively. From our discussion, we also have

$$ V_1 \leftrightharpoons S_1 $$ and $$ V_2 \leftrightharpoons S_2. $$

The transfer function concept is especially useful when we think about the spectrum of the waveforms, because we have $$ S_2(f) = T(f) S_1(f). $$

I.e., the spectrum of the output waveform is simply the spectrum of the input waveform multiplied by the transfer function.

Once you have $S_2$, then you can get $V_2(t)$ from $$ V_2 \leftrightharpoons S_2. $$

Two examples are low-pass and high-pass filters.

Examples of different filter transfer functions $T(f)$. Credit: Wikipedia/SpinningSpark

Obtaining $V_2$ using the time domain

Now let’s think of digital signal processing, where you wanted to practically apply a filter to some waveform to produce a new waveform. As we just outlined, you could acquire $V_1(t)$, Fourier transform it to access its spectrum $S_1(f)$, multiply by the transfer function $T_(f)$, and then do the inverse Fourier transform to get $V_2(t)$. Is there a way to do this directly in the time domain? What if you don’t have the complete waveform all at once?

The answer is provided by the convolution theorem for Fourier transforms. Since the transfer function is applied via a multiplication in the Fourier domain, we could equivalently carry out the same operation by a convolution in the time domain.

The convolutional kernel would be $$ I(t) \leftrightharpoons T(f) $$ and we’d have $$ V_2(t) = I(t) * V_1(t). $$

To summarize,

Credit Bracewell, Chapter 9.

Determining $I(t)$

If you had some system already in place, and you wanted to determine $I(t)$ experimentally, what is one way you could do it? What waveform could you send the system?

One simple option would be to send $$ V_1(t) = \delta(t), $$ then $$ V_2(t) = I(t) * \delta(t) = I(t). $$

Nyquist-Shannon sampling

Youtube/SteveBrunton on The Sampling Theorem

Thus far we have been talking about continuous functions. As astrophysicists, though, we’re frequently dealing with discrete data points, which are presumed to be samples of some unknown function. Maybe you’re the one designing the experiment to capture these data points, or maybe you’ve just been handed some dataset.

Say you are given a set of (noisless) samples that look like this. What do you think the function should look like in between the points? Credit: Ian Czekala

Concisely put, the sampling theorem states that under a certain condition, a function can be completely reconstructed from a set of discrete samples—without information loss. I.e., the set of discrete samples is fully equivalent to having access to the full set of function values. Today, this sampling theorem is known as the Nyquist-Shannon sampling theorem, the Whittaker–Nyquist–Shannon theorem, or simply “the sampling theorem.”

If we were to try to use these data points to actually reconstruct a function, then what sort of constraint would we need to impose on the function? We’d want to place some constraint on its spectrum, i.e., that there are no higher frequency components oscillating around faster than our sampling points.

Before we dive into the derivation of the sampling theorem, let’s first take another look at what can go wrong when you undersample a time series.

Aliasing

If the true signal is given by the solid line, but we undersample it, then the sine-wave we naively reconstruct from the samples would have the wrong frequency. Here we would say that the higher frequency signal has been aliased into the lower frequency range. Credit: Wikipedia Pluke

This is also called the Stroboscopic effect. There are some nice examples online:

Youtube/SmarterEveryDay.
Stationary helicopter.
Exoplanet transits! Dawson and Fabrycky 2010 find a shorter period for the exoplanet 55 Cnc e, previously confounded by the timing of the RV observations.

Derivation of sampling theorem

Now, let’s use our understanding of Fourier transforms and the sampling/replicating function to develop a precise formulation of the sampling theorem.

Recall that the “shah” function is an infinite series of delta functions spaced a unit dimension apart, and that it is its own Fourier transform $$ \mathrm{shah}(x) \leftrightharpoons \mathrm{shah}(s). $$

Via the similarity theorem, if the delta functions of the shah get closer in the $x$ domain, then they spread out in the Fourier domain, and vice-versa.

We can adjust the spacing of the samples by dilating or shrinking the shah function by some factor. Here, we’ll write this as the sampling interval $\Delta x = \tau$ or the sampling frequency $1/\tau$.

According to the similarity theorem, adjusting the sampling frequency in the $x$ domain has the following effect in the frequency domain $$ \mathrm{shah}(x/\tau) \leftrightharpoons \tau \mathrm{shah}(\tau s). $$

For example, if $\tau = 0.2$, then we have

The shah function is its own Fourier transform. Via the similarity theorem, if we compress the shah function in the time domain (left), we expand it in the Fourier domain (right). Credit: Bracewell, Fig 10.3

Now let’s consider a function and its Fourier transform

A generic function (left) and its Fourier transform (right). We say that this function is ‘band-limited’ because its Fourier transform is 0 for all frequencies above some cutoff frequency $|s| > s_c$. Credit: Bracewell Fig 10.2

As before, we will use multiplication by the shah function to represent sampling of the function.

Left: the sampled version of $f$, which has the Fourier transform on the right. So long as the sampling frequency exceeds twice the cutoff frequency, the Fourier transform ‘islands’ do not overlap (top two rows). Credit: Bracewell Fig 10.3

The non-overlappingness of the ‘islands’ is the key to properly sampling a function, and we’ll see why in a moment when we talk about reconstruction. But first, let’s make a quantitative statement of the sampling theorem (Bracewell):

If $s_c$ is the cutoff frequency defining the band-limited nature of the signal, then so long as the function is sampled at equal intervals not exceeding $\Delta x = 1/(2 s_c)$ then the function is properly sampled, i.e. $$ \frac{1}{\tau} \geq 2 s_c. $$

Restoration of signal kernels

Now let’s talk about how we would actually reconstruct the continuous function from a set of samples. Let’s re-examine our plot of the Fourier domain

Credit: Bracewell Fig 10.3

The “function” on the left is technically not the same (continuous) function that we started with, it is a discrete representation of it. We did just say, though, that if the function was band-limited, then these samples contained all of the same information as if we had access to the full function. So how do we go from these samples back to the full function?

Let’s look at the Fourier side of this plot and compare it to the original Fourier side. The main difference is that this Fourier plot has repeating ‘islands’ at progressively higher frequencies, essentially to infinity. How can we get rid of these higher frequency islands?

The answer is to multiply by a boxcar function in Fourier domain, completely truncating these higher order terms. Then, we can do the inverse Fourier transform and recover the original, continuous function.

What is the analogous operation for the time-domain? This is the same thing as we discussed with the transfer function. Since it was a multiplication in the Fourier domain, it is a convolution in the time domain. And the convolutional kernel is the Fourier transform of the boxcar, which is a sinc function.

So, to exactly reconstruct a band-limited function from a set of samples, we do sinc-interpolation.

Credit: DSP related.

Undersampling and aliasing

If we didn’t sample the function at a sufficiently high rate, then we would have overlapping islands. Essentially, the higher frequency components of the Fourier islands are “folded-over” back into the range of frequencies we thought was band-limited, resulting in a corrupted signal.

In an alias, a higher frequency signal is masquerading as a lower-frequency signal.

Credit: Bracewell Fig 10.3

Compressed sensing

You may have heard of “compressed sensing,” which is one of major signal processing results of the last few decades. The idea is that you can reconstruct a functional form using far fewer samples than required for the Nyquist rate, using some dictionary of functional forms, or knowledege that the signal may be sparse. You can, indeed, perfectly reconstruct the signal through optimization using the $L_1$ norm. If you don’t want to make the assumption that your signal is sparse, though, it’s a good idea to sample at the Nyquist rate.

Extra: Fourier series

You probably first encountered Fourier series as part of your calculus course and later on as part of a partial differential equations course. Say we have some periodic function $g(x)$, then the Fourier series associated with this is $$ a_0 + \sum_1^\infty (a_n \cos 2 \pi n f x + b_n \sin 2 \pi n f x) $$ where the Fourier coefficients are determined by $$ a_0 = \frac{1}{T} \int_{-T/2}^{T/2} g(x) \,\mathrm{d}x $$

$$ a_n = \frac{1}{T} \int_{-T/2}^{T/2} g(x) \cos 2 \pi n f x \,\mathrm{d}x $$

$$ b_n = \frac{1}{T} \int_{-T/2}^{T/2} g(x) \sin 2 \pi n f x \,\mathrm{d}x $$

i.e., we’ve projected the function onto its basis set of sines and cosines.

Already, I’m sure you are starting to see the close connection with what we’ve discussed of the Fourier transform. Traditionally, Fourier series are used as a jumping off point for the discussion of the Fourier transform.

In the last lecture, however, we signaled our intention to take the opposite approach, whereby we skipped over Fourier series and started with the idea that Fourier transforms exist because we observe physical systems which exhibit their behavior. Now, let’s unify the discussion and demonstrate the Fourier series as an extreme situation of the Fourier transform.

Fourier transform of sine and cosine

Let’s put together a number of the theorems that we’ve discussed to build up our understanding of what a Fourier series looks like in the Fourier domain.

In the previous lecture we introduced the Fourier transform theorem for a delta function located at the origin

$$ F(s) = \int_{-\infty}^{\infty} \delta(0) \exp (-i 2 \pi x s)\,\mathrm{d}x = 1 $$ which is a constant.

We also introduced the shift theorem, which says $$ f(x - a) \leftrightharpoons \exp(- 2 \pi i a s) F(s). $$

We can couple these together and write a relationship $$ \delta(x - a) \leftrightharpoons \exp(-2 \pi i a s). $$

Now, let’s see if we can use this Fourier pair to derive the Fourier pairs for cosine and sine. In your quantum physics or partial differential equations classes, you probably used the Euler identity to write cosine and sine like $$ \cos 2 \pi a s = \frac{e^{i 2 \pi a s} + e^{-i 2 \pi a s}}{2} $$ and $$ \sin 2 \pi a s = \frac{e^{i 2 \pi a s} - e^{-i 2 \pi a s}}{2i} $$

So we can do a bit of rearranging and arrive at

$$ \cos \pi x \leftrightharpoons \mathrm{even}(x) = \frac{1}{2}\delta \left (x + \frac{1}{2} \right) + \frac{1}{2}\delta \left (x - \frac{1}{2} \right) $$ and $$ \sin \pi x \leftrightharpoons i\mathrm{odd}(x) = i \frac{1}{2}\delta \left (x + \frac{1}{2}\right) - i \frac{1}{2}\delta \left (x - \frac{1}{2}\right). $$

where these symbols are the even and odd impulse pairs.

The Fourier transform pairs of cosine and sine as the even and odd impulse pairs, respectively. Credit: Bracewell Fig 6.1

Now we’re well on our way to determining the Fourier spectrum of a Fourier series. The Fourier series is just a sum of sines and cosines at different frequencies. The Fourier transform is a linear operator, so we can just add together the contributions from each component in the Fourier domain $$ a_0 + \sum_1^\infty (a_n \cos 2 \pi n f x + b_n \sin 2 \pi n f x) $$

We arrive at the result that the spectrum of a Fourier series is a collection of delta functions whose locations and amplitudes correspond to the frequencies and values of the Fourier coefficients, respectively.

The amplitude of the line spectra corresponding to a Fourier series. Credit: TutorialsPoint

Now that we’ve derived the Fourier spectrum of a Fourier series, we can see at least two reasons why this represents an extreme situation of the Fourier transform:

The input waveform is strictly periodic
The input waveform is infinite in duration

As we talked about in the last lecture, these conditions are violated in the real world. If we limit the duration of the sine wave to a finite duration (say, by multiplication of a truncated Gaussian (because technically the Gaussian is also non-zero over an infinite domain), which we call a window function), then we see what happens in the Fourier domain: the delta functions are broadened by convolution with the Fourier transform of the window function.

Left: the dotted line represents a broad window function to eventually make the waveform finite in duration. Right: this has the effect of broadening the delta functions by convolution with the Fourier transform of the window function. Credit: Bracewell Fig 10.12

The Discrete Fourier transform (DFT)

Now we’ll talk about how we deal with samples of data. We’ll stick with the same example that we’re dealing with a function of time. But rather than $t$, having units of seconds, we’ll simply label each data point by an index $m$ which takes on non-negative, integer values like $m = 0, 1, \ldots, N$.

Credit: Bracewell Fig 11.2

The forward discrete Fourier transform (DFT) is $$ F_k = \sum_{m=0}^{N-1} f_m \exp \left ( - 2 \pi i \frac{m k}{N} \right) $$ and we could compute $F_k$ for $k = 0, 1, \ldots, N-1$.

Here, the discrete index variable $k$ has replaced the continuous-frequency variable $s$, just like $m$ replaced the continuous-time variable $t$.

The inverse discrete Fourier transform is $$ f_m = \frac{1}{N} \sum_{k=0}^{N -1} F_k \exp \left ( 2 \pi i \frac{m k}{N}\right). $$ Like the continuous-Fourier transform, one of the differences from the forward is the $+i$ in the exponential. The other is the inclusion of the normalization pre-factor.

Note: depending on whom you talk to, you’ll see a wide variety of conventions as to where the normalization prefactor goes and where the $2 \pi$ lives. The convention presented here is the same one used by the Python/NumPy package and the Julia/AbstractFFTs.jl package, so it should be the one you encounter most frequently.

Like the continuous Fourier transform, if you take the DFT of a set of samples and then take the iDFT of that, you will end up with the original set of samples.

Units of the DFT

The DFT only knows/assumes that it was fed a set of equally spaced samples $$ f_m = f(x_m) $$ where $$ m = 0, 1, \ldots, N - 1. $$

So, at its most abstract, the DFT takes in a bunch of $N$ samples spaced $\Delta x$ apart and returns $N$ samples corresponding to the Fourier components. The frequency of each component corresponds is given by $k/N$ in units of “cycles per sampling interval.”

I.e., so if we had $N = 8$ samples, then the $k=3$ frequency component returned from the DFT would be equal to $3/8$ cycles per “the interval between samples.”

On its own, the DFT doesn’t provide any information about what type of variable $x$ is or what the spacing is. But there is hope. We can make this concrete, we just have to be careful. Using our example of a time series with $N=8$ samples $\{f_m\}$, say we know that $\Delta x = 0.1$ seconds, $$ \Delta x = x_{m+1} -x_m $$

The spacing in the frequency domain will be 1/8 cycles per 0.1 seconds, or 1.25 Hz.

DFT as a matrix operation

Thus far we have just been talking about a “set” of samples. We can also think of the collection of data as a vector $$ \mathbf{f} = \begin{bmatrix} f_0 \\ f_1 \\ \vdots \\ f_{N-1} \\ \end{bmatrix} $$

and the frequency samples as a vector too

$$ \mathbf{F} = \begin{bmatrix} F_0 \\ F_1 \\ \vdots \\ F_{N-1} \\ \end{bmatrix}. $$

We’ve now mentioned that the Fourier transform is a linear operator a few times. Another way of demonstrating this is that we could write the DFT as a matrix multiplication $$ \mathbf{F} = \mathbf{W} \mathbf{f}. $$

If we look back at the definition of the DFT $$ F_k = \sum_{m=0}^{N-1} f_m \exp \left ( - 2 \pi i \frac{m k}{N} \right) $$ hopefully we can see how this might be cast in matrix form. If we define the quantity $$ \omega = e^{- 2 \pi i / N} $$ then we can write $$ W = \begin{bmatrix} 1 & 1 & 1 & \ldots & 1 \\ 1 & \omega & \omega^2 & \ldots & \omega^{N - 1} \\ 1 & \omega^2 & \omega^4 & \ldots & \omega^{2(N-1)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & \omega^{N-1} & \omega^{2(N-1)} & \ldots & \omega^{(N-1)(N-1)}\\ \end{bmatrix} $$

This formulation can be really useful if you’re into forward modeling using linear models. Because we can write the DFT as a matrix multiplication, it can essentially be just another linear transformation to your linear model. Determining the “best-fit” parameters can still be done analytically. We’ll talk about this a bit more when we get to the lecture on Bayesian inference.

The elements of the DFT matrix represented as samples of complex exponentials. Credit: Wikipedia/Glogger

Fast Fourier Transform (FFT)

Youtube/SteveBrunton on the Fast Fourier Transform

The Fast Fourier Transform or FFT is an algorithm (class of algorithms) for computing the discrete Fourier transform. Practically speaking, if you’re going to perform a Fourier transform on discrete data with a computer, you will almost certainly use an FFT algorithm, so FFT and DFT end up being synonymous.

If we have a data array of length $N$, the complexity of the DFT is $\mathcal{O}(N^2)$, while the complexity of the FFT is only $\mathcal{O}(N \log N)$. This can make a huge difference in computational time if N is large. Say if you have an array of $N = 4096$ datapoints, then you could be looking at a factor of 1000 speedup.

The development and implementation of FFT packages turned the DFT from something that was too slow for many practical applications into a formidable analysis tool. This really enabled the widespread usage of the Fourier transform in signal manipulation and data analysis. I don’t think anyone would argue with you that strongly if you said that the FFT is the most important algorithm of the last century Cornell’s top ten algorithms of the 20th century.

There are many different FFT algorithms out there. The most popular one is the Cooley-Tukey algorithm and relies on the factorization of a size $N$ DFT matrix into $N_1$ smaller DFTs of sizes $N_2$ in a recursive manner. Therefore there is a (historical) preference towards array sizes that powers of $2$. However, there are algorithms out that still give $\mathcal{O}(N \log N)$ even for prime values of $N$, so it’s not much of a constraint in practice.

We don’t have time to go into much more detail about the FFT algorithm here, but hopefully you can see from the structure of the $\mathbf{W}$ matrix that there are plenty of opportunities to make the calculation more efficient by factoring and caching values.

Now, lets look at some code examples of how we might actually use the FFT in Python.

Jupyter Notebook Slides

The Fourier Transform I

Wed, 13 Sep 2023 00:00:00 +0000

Reference Materials for this lecture:

The Fourier Transform and its Applications by R. Bracewell
Interferometry and Synthesis in Radio Astronomy by Thompson, Moran, and Swenson, particularly Appendix 2.1
Fourier Analysis and Imaging by R. Bracewell

Useful (and entertaining!) introductions to the topic:

Youtube: 3Blue1Brown

Complex Numbers

Wikipedia Reference

A complex number $z$ is one in the form of $z = a + b i$, where $i$ is the imaginary unit. The imaginary unit satisfies the equation $$ i^2 = -1. $$

So a complex number has both real ($a$) and imaginary ($b$) components to it. We can represent this as a plot on the Cartesian plane:

Representation of a complex number on the Cartesian plane. Credit: Wolfkeeper

Alternatively, we can also represent a complex number on the polar plane, using an amplitude $r$ and phase $\varphi$:

Representation of a complex number on the polar plane. Credit: Kan8eDie, based on Wolfkeeper

$$ z = r e^{i \varphi} = r (\cos \varphi + i \sin \varphi). $$

This is closely related to Euler’s formula $$ e^{i x} = \cos x + i \sin x $$ for a complex sinusoid. It’s useful to keep in mind Euler’s identity: $$ e^{i \pi} + 1 = 0. $$

It’s possible to convert from the Cartesian form to polar form and vice-versa: $$ r = |z| = \sqrt{a^2 + b^2} $$ and $$ \varphi = \arg(z) = \arg(a + bi) $$ which is most easily carried out using the arctan2 function, to avoid quadrant ambiguities. $$ \varphi = \mathrm{arctan2}(b, a). $$

You can go from polar back to Cartesian by writing $$ z = r e^{i \varphi} = r (\cos \varphi + i \sin \varphi) $$ and then doing $$ a = \Re(z) = r \cos \varphi $$ and $$ b = \Im(z) = r \sin \varphi. $$

The Fourier Transform

References include a mix of

Ch 2 of The Fourier Transform and its Applications by R. Bracewell
Appendix 2.1 of Interferometry and Synthesis in Radio Astronomy by Thompson, Moran, and Swenson

First, we’ll introduce the equations and then explain what’s going on.

Forward transform: $$ F(s) = \int_{-\infty}^{\infty} f(x) e^{-i 2 \pi x s}\,\mathrm{d}x $$ also called the “minus-$i$” transform.

Inverse transform: $$ f(x) = \int_{-\infty}^{\infty} F(s) e^{i 2 \pi x s}\,\mathrm{d}s $$ also called the “plus-$i$” transform.

Note that there are alternate conventions for the Fourier transform pairs, which vary as to whether the $2 \pi$ factor appears in the exponent or as a pre-factor. We prefer the notation we’ve provided here because we find it much easier to keep track of $2\pi$ factors.

Successive transforms:

We can show that $$ f(x) = \int_{-\infty}^{\infty} \left [ \int_{-\infty}^{\infty} f(x) e^{-i 2 \pi x s}\,\mathrm{d}x \right ] e^{i 2 \pi x s}\,\mathrm{d}s, $$ i.e., successive transformations yield back the original function. We don’t have time to walk through the proof, but it’s available in TMS Section A2.1.

Therefore, we have $$ f(x) \leftrightharpoons F(s), $$ where the $\leftrightharpoons$ denotes a Fourier transform pair. Sometimes you might also see the notation $\leftrightarrow$. Generally, if we have functions like $f$ or $g$, then we denote their Fourier transform pairs with captial letters (e.g., $F$, $G$)).

How to think about the domains The Fourier transform maps domains from $x$ to $s$, where $s$ has units of “cycles per unit of $x$”. E.g., if $x$ is in units of time (seconds), then $s$ has units of cycles/second, commonly known as hertz. The FT can also be applied to spatial distance coordinates (e.g., $x$ in meters) and spatial coordinates on the sky (e.g., $x$ in arcseconds).

So let’s say someone is playing a chord on a musical instrument. If $f(x)$ represents the time-series of pressure values near your ear, then $F(s)$ represents the notes that are being played (we would probably be most interested in something like $|F(s)|^2$ for power as a function of frequency, like a spectrogram).

Musical Spectrogram link

Conditions on existence

Given the strong evidence via physical systems that the Fourier transform of particular time-series exists (e.g., spectrums for waveforms, antenna radiation patterns), there are actually some mathematical functions whose Fourier transforms do not exist.

Physical possibility (in physical systems) is a sufficient condition for the existence of its transform.

Sometimes, though, we consider waveforms like

$\sin(t)$: harmonic wave
$H(t)$: Heaviside step
$\delta(t)$: impulse

Strictly speaking, none of these functions has a Fourier transform. What do we mean that they do not have Fourier transforms? $$ F(s) = \int_{-\infty}^{\infty} f(x) e^{-i 2 \pi x s}\,\mathrm{d}x $$ does not converge for all values of $s$. Consider $\sin(t)$ integrated from $(-\infty, \infty)$… it’ll just keep oscillating about 0.

None of these waveforms are actually physically possible, though, because a waveform $\sin(t)$ would have need to been switched on an infinite time ago, a step function would need to be maintained for an infinite time, and an impulse would need to be infinitely large for an infinitely short time.

So, what are the conditions for the existence of Fourier transforms.

The integral of $|f(x)|$ from $-\infty$ to $\infty$ exists.
Any discontinuities in $f(x)$ are finite.

In a physical circumstance, these conditions would be violated when there is infinite energy.

Transforms in the limit

Though we just outlined functional forms whose Fourier transforms do not strictly exist, we can find a way to think such that the transforms of these functions do exist in a practical sense. This is by considering them in the limit.

Consider a periodic function whose transform would strictly not exist, such as $P(x) = \sin(x) $, since $$ \int_{-\infty}^\infty |P(x)|\,\mathrm{d}x $$ would not converge.

We could modify this function ever so slightly by multiplication with a very broad Gaussian envelope $\exp(-\alpha x^2)$, where $\alpha$ is a small positive number, then this modified version may have a transform. As we let $\alpha \rightarrow 0$, then we approach $P$ in the limit. $$ \int_{-\infty}^\infty |e^{-\alpha x^2} P(x)|\,\mathrm{d}x $$

As $\alpha \rightarrow 0$ the transform may still not exist for all $s$, however, we can still be quite productive with the sequence of transforms that do exist as we approach this limit. Therefore, we can practically use the Fourier transform for all physical systems that we might consider. We’ll revisit this in more detail when we talk about Fourier series, line spectra, and the sampling theorem.

Oddness, Evenness, and Complex Conjugates

An even function $E(x)$ and an odd function $O(x)$, followed by their sum. Credit: The Fourier Transform and Its Applications, Bracewell, Figs 2.2 and 2.3.

Even functions have $$ E(-x) = E(x) $$ and are symmetrical.

Odd functions have $$ O(-x) = -O(x) $$ and are antisymmetrical.

The sum of even and odd functions is, in general, neither even nor odd.

Any function $f(x)$ can be split unambiguously into it’s odd and even parts, though, where $$ E(x) = \frac{1}{2} \left [ f(x) + f(-x) \right] $$ and $$ O(x) = \frac{1}{2} \left [ f(x) - f(-x) \right] $$ and so we have $$ f(x) = E(x) + O(x), $$ where both $E$ and $O$ are in general complex-valued.

Evenness and oddness are very useful because we can these definitions to re-write the Fourier transform as $$ F(s) = 2 \int_0^\infty E(x) \cos(2 \pi x s)\,\mathrm{d}x - 2 i \int_0^\infty O(x) \sin (2 \pi x s)\,\mathrm{d}x. $$

We see that

If a function is even, its transform is even
If a function is odd, its transform is odd

and other results summarized by Bracewell in Chapter 2:

If a function has the characteristics in the left column, then its Fourier transform has the characteristics in the right column. Credit: Bracewell Ch. 2

These relationships are extremely useful for quickly ascertaining the basic nature of the Fourier transform of any function you may encounter, as well as planning how to numerically compute transforms using fft or rfft packages.

Complex conjugates

If we have the complex conjugate of function $f(x)$, denoted by $f^*(x)$, then we have

$$ f^{*}(x) \leftrightharpoons F^{*}(-s) $$

Transforms of some simple functions

Let’s practice by taking the Fourier transforms of some functions.

Boxcar

The rectangle or “boxcar” function is

The rectangle, or boxcar function. Credit: Bracewell Ch. 3

Let’s calculate the Fourier transform. The function itself is simple, so this is mainly an exercise in choosing the right limits $$ F(s) = \int_{-\infty}^{\infty} f(x) e^{-i 2 \pi x s}\,\mathrm{d}x $$

$$ F(s) = \int_{-1/2}^{1/2} e^{-i 2 \pi x s}\,\mathrm{d}x $$

We can use Euler’s formula to write $$ F(s) = \int_{-1/2}^{1/2} \cos(2 \pi x s) + i \sin(2 \pi x s) \,\mathrm{d}x $$ and visually seen that the $\sin$ term would eventually cancel itself out, or we could have relied upon the fact that we know $\Pi(x)$ is an even function ($O(x) = 0$) and used $$ F(s) = 2 \int_0^\infty E(x) \cos(2 \pi x s)\,\mathrm{d}x - 2 i \int_0^\infty O(x) \sin (2 \pi x s)\,\mathrm{d}x. $$ to yield $$ F(s) = 2 \int_0^{1/2} \cos(2 \pi x s)\,\mathrm{d}x $$

$$ F(s) = 2 \Big |_0^{1/2} \frac{\sin 2 \pi x s}{2 \pi s} = \frac{\sin \pi s}{\pi s} = \mathrm{sinc}(s) $$

Note that we (and Bracewell) define $$ \mathrm{sinc}(s) = \frac{\sin \pi s}{\pi s} $$ this is called the normalized sinc function, and (IMO) is the most useful because of it’s nice Fourier pair relationships. There is also the “unnormalized” sinc function, which doesn’t have the factors of $\pi$ in it, but we won’t use that in this course. The normalized sinc function has the properties that

its peak is equal to 1 $\mathrm{sinc}(0) = 1$
its “nulls” are located at non-zero integer values of $n$ for $\mathrm{sinc}(n)$
its integral from $-\infty,\infty$ is equal to 1

So we have the Fourier pair: $$ \Pi(x) \leftrightharpoons \mathrm{sinc}(s) $$ The unit rectangle (or boxcar) function has the Fourier transform pair of a normalized sinc function.

This is also the same relationship that we introduced in the first lecture: the far field electric field pattern (sinc) is the Fourier transform of the electric field illuminating the aperture of the telescope (boxcar).

Gaussian

How about the Fourier transform of a Gaussian function? $$ f(x) = \exp \left ( -\frac{x^2}{2 a ^2} \right ) $$

$$ F(s) = \int_{-\infty}^{\infty} \exp \left ( -\frac{x^2}{2 a ^2} \right ) \exp (-i 2 \pi x s)\,\mathrm{d}x $$

$$ F(s) = \int_{-\infty}^{\infty} \exp \left ( -\frac{x^2}{2 a ^2} -i 2 \pi x s \right )\,\mathrm{d}x $$

Usually, when I see something like this, my standard approach is to start browsing books of integrals like Gradshteyn and Ryzhik for ideas about how I might rearrange the integrand and successfully evaluate the integral. I’d usually just go for this. But, in this case, we can actually do something using a trick you probably learned in jr. high school, called completing the square.

Ok, so we’ve got terms in the exponent with $x^2$ and $x$, and an equation that looks like $$ a x^2 + bx + c = 0 $$ which we want to turn into something that looks like $$ a(x + d)^2 + e = 0. $$

The answer is that $$ d = \frac{b}{2a} $$ and $$ e = c - \frac{b^2}{4 a} $$ such that our rearranged exponent becomes $$ -\left [ \frac{x^2}{2 a ^2} + i 2 \pi x s \right ] = -[(x - 2i \pi a^2 s)^2 + 4 \pi^2 a^4 s^2]/2a^2 $$ Why was this useful? Well, the integral is over $x$, so we can pull out all terms that do not depend on $x$, giving us a rearranged integral of $$ F(s) = \exp(-2 \pi^2 a^2 s^2 ) \int_{-\infty}^\infty \exp \left (-\frac{(x - 2 i \pi a^2 s)^2}{2 a^2} \right)\,\mathrm{d}x $$

Here it is helpful to remember your Gaussian integration formulas such that $$ \int_{-\infty}^\infty e^{-a(x + b)^2}\,\mathrm{d}x = \sqrt{\frac{\pi}{a}}. $$

Thus, the integral contributes another factor of $\sqrt{2 \pi} a$ and the final result is

$$ F(s) = \sqrt{2 \pi} a \exp(-2 \pi^2 a^2 s^2 ) $$

What functional form is this? This is another Gaussian, though the normalization and standard deviation are a bit different! So, we see that the Gaussian function is a Fourier transform pair with itself.

My usual approach when dealing with Gaussians is to start with the Fourier duals (Bracewell) $$ e^{- \pi x^2} \leftrightharpoons e^{- \pi s^2} $$ and then use the similarity and shift theorems (discussed later) to morph it into the form needed, picking up any additional prefactors as necessary.

Hopefully now you have a taste of how to compute Fourier transforms. At its most basic, it’s just a matter of setting up and evaluating the integral. For many function forms, you can use integration strategies to arrive at analytic solutions. As a practical matter, in the next lecture we’ll see how we can evaluate this integral numerically.

Convolution

The convolution of two functions $f(x)$ and $g(x)$ is defined as $$ \int_{-\infty}^\infty f(u) g(x - u)\,\mathrm{d}u $$ and is frequently written using the $*$ symbol. The convolution produces a new function, so we have $$ h(x) = f(x) * g(x). $$

Convolution as a process is very useful to think of graphically

A graphical representation of the convolution of two functions $f(x)$ and $g(x)$. The $g$ function is flipped, shifted to $x$, and then multiplied against $f$. The value of the convolution $h(x)$ is given by the integral of the multiplied product. Credit: Bracewell Ch. 3

In general, convolution by most functions (e.g., boxcars, Gaussians, etc…) results in a smoothing out of high-frequency structure. Credit: Bracewell Ch. 3

Convolution can also be thought of as a superposition of characteristic contributions. I.e., the final function $h(x)$ has grabbed value from nearby regions of $f(x)$, modulated by the envelope of $g(x)$. This paradigm is very useful for understanding interpolation, smoothing, and kernel density estimation (KDE). Credit: Bracewell Ch. 3

Convolution is commutative

$$ f * g = g * f $$

and associative

$$ f * (g * h) = (f * g) * h $$

and distributive over addition

$$ f * (g + h) = f * g + f * h. $$

It is a linear operator, just like the Fourier transform.

The impulse symbol

We’ll develop a notation for an intense (unit-area) pulse so brief that the measuring equipment is unable to distinguish it from pulses yet briefer still. You may be quite familiar with this concept as a “delta-function,” especially in the context of quantum physics. Physically speaking, things like “point masses”, “point charges,” and (astrophysically speaking) “point sources” do not physically exist, but they are very useful concepts. The only important attribute of an impulse is how it reacts under integration

$\delta(x) = 0$ for $x \ne 0$
$\int_{-\infty}^\infty \delta(x)\,\mathrm{d}x = 1$

And, there is a close relationship between the impulse symbol and the unit step function $H(x)$ such that $$ \int_{-\infty}^x \delta(x^\prime)\,\mathrm{d}x^\prime = H(x). $$

Another very important property of the impulse function is its “sifting property” (TMS A2.11), such that $$ f(a) = \int_{-\infty}^{\infty} f(x) \delta(x - a)\,\mathrm{d}x^\prime. $$ i.e., the integral of function $f(x)$ times a delta-function located at $a$ will give the value of the function evaluated at $a$, $f(a)$.

The Fourier transform of a delta function (centered on 0) is

$$ F(s) = \int_{-\infty}^{\infty} \delta(0) \exp (-i 2 \pi x s)\,\mathrm{d}x = 1. $$ This is yet another important Fourier pair $$ \delta(x) \leftrightharpoons \mathrm{constant\;amplitude\,\forall\, s}. $$

The Sampling or Replicating Symbol “Shah function”

This is an infinite sequence of unit impulses, given by $$ \mathrm{shah}(x) = \sum_{n=-\infty}^\infty \delta(x - n) $$

The replicating function. Credit: Bracewell Fig 5.4

Also sometimes called a “Dirac Comb.” There is a generalization of the sifting property, such that if you multiply a function by a shah, you are effectively sampling it at unit intervals.

You can use it to sample $f(x)$ (by multiplication)

Sampling property of the shah function by multiplication. Credit: Bracewell Fig 5.5

And you can use it to replicate $f(x)$ (by convolution)

Replicating property of the shah function by convolution. Credit: Bracewell Fig 5.6

The unit shah function is also its own Fourier transform $$ \mathrm{shah}(x) \leftrightharpoons \mathrm{shah}(s) $$

Fourier transform theorems properties: similarity, convolution, multiplication

There are several useful properties of the Fourier transform that you’ll want to familiarize yourself with. See Bracewell Ch. 6 or TMS A2.1.2.

Similarity

$$ f(x) \leftrightharpoons F(s) $$

then

$$ f(ax) \leftrightharpoons \frac{1}{|a|} F \left (\frac{s}{a} \right). $$

I.e., applied to waveforms and spectra, a compression of the time scale corresponds to an expansion of the frequency scale.

Fourier transforms and the Heisenberg uncertainty principle

In a signal-processing sense, this manifests as an inability to precisely specify a signal in both the time and frequency domains at the same time. As you decrease the variance of a function (i.e., make it more concentrated and thus localized) in one domain, you increase the variance of it in the other domain (i.e., make it more extended and thus dispersed).

In quantum mechanics, this same concept is at play in the Heisenberg Uncertainty principle, where probability distributions (i.e., wavefunctions) governing position and momentum are related by the Fourier transform. It’s impossible to know both position and momentum precisely.

Shift

$$ f(x) \leftrightharpoons F(s) $$

then

$$ f(x - a) \leftrightharpoons \exp(- 2 \pi i a s) F(s). $$

If you shift a function, then there are no changes in the amplitude of the Fourier transform, but, there are changes to its phase, dependent on s. The higher the frequency, the greater the change in phase angle. In radio astronomy, it’s common to hear of this as a translation in the R.A./Dec. plane results in a phase shift in the visibility plane.

Introduction and Course Overview

Fri, 25 Aug 2023 00:00:00 +0000

References and Resources for this lecture

Full reference information can always be found in the syllabus, under “References Materials.”

Essential Radio Astronomy Ch 1: emission mechanisms, relevant astrophysical objects
Tools of Radio Astronomy: radio windows, units, flux densities
Interferometry and Synthesis in Radio Astronomy: single-dish observations, units, flux densities

Course Overview

Welcome to AS 5003: Contemporary Astrophysics, on radio astronomy and interferometric imaging!

Syllabus overview regarding format and tutorial schedule

Astrophysics at radio wavelengths

Emission mechanisms

Radio synchrotron (continuum), non-thermal emission from relativistic electrons in magnetic fields
Bremsstrahlung (a.k.a. free-free) emission (continuum), thermal emission from ionized gas (H II regions)
Thermal emission from cold (< 100 K) media, like dust (continuum)
Atomic hyperfine splitting (“21-cm” line corresponding to neutral hydrogen)
Molecular emission lines, primarily from rotational transitions (e.g., CO $J=2-1$)

Atmospheric windows for astronomy. Credit: ESA/Hubble (F. Granato) and Essential Radio Astronomy.

The earth’s atmosphere is very transparent in the radio region of the electromagnetic spectrum, especially compared to optical windows. Only towards the microwave region (wavelengths around 1 mm), does the atmospheric transmission start to decline.

Astrophysical objects

In truth, almost everything these days!

quasars
gamma ray burst (GRB) afterglows
fast radio bursts (FRBs)
pulsars
supernovae remnants
cosmic microwave background (CMB)
galaxies, including molecules at high redshift sources
dust/interstellar medium
the Sun
planets (e.g., Jupiter, Uranus)
protoplanetary disks
molecular clouds (molecular emission)
black hole accretion disks (EHT)

Radio jets from the elliptical galaxy Hercules A (overlaid with an optical image from Hubble). Karl Jansky VLA. Credit: NASA, ESA, S. Baum and C. O’Dea (RIT), R. Perley and W. Cotton (NRAO/AUI/NSF), and the Hubble Heritage Team (STScI/AURA).

The protoplanetary disk around HL Tau, imaged using the Atacama Large Millimeter Array. Credit: ALMA(ESO/NAOJ/NRAO); C. Brogan, B. Saxton (NRAO/AUI/NSF)

The rings of Uranus seen by ALMA (thermal emission from 77 K material). Credit: ALMA (ESO/NAOJ/NRAO); E. Molter and I. de Pater.

Single-dish radio telescopes

In this section, we’ll cover existing single-dish radio telescopes and refresh our memory of basic telescope performance.

The same $$ \theta \approx \frac{\lambda}{D} $$ applies for radio antennas. Take a $\lambda = 1\;\mathrm{cm}$ observation, for example. Compared to an optical $\lambda = 500\;\mathrm{nm}$ telescope the same size, the resolution will be a factor of $$ \frac{1\;\mathrm{cm}}{500\;\mathrm{nm}} = 20,000 $$ worse. Yikes!

Radio astronomers are constantly trying to find ways to increase angular (spatial) resolution.

One way is to build bigger telescopes, such as the Green Bank Telescope (100m diameter)

The Green Bank Telescope (100m diameter) operates at radio wavelegths. Credit: NRAO/AUI/NSF

Another way is to work at higher frequencies (shorter wavelengths), e.g. sub-mm radio astronomy (IRAM 30m telescope)

The IRAM 30m diameter telescope, which operates at sub-mm wavelengths. Credit: Wikipedia/IRAM-gre

A final way is to use interferometry, sometimes also at sub-mm wavelengths (ALMA), which will be the main focus of this course

It’s easier to build larger telescopes at longer wavelengths because the tolerances required for the reflecting surface are less strict than at optical wavelengths. Though typically one must keep surface tolerances to within $$ \sigma = \frac{\lambda_\mathrm{min}}{16}, $$ otherwise the efficiency of the antenna starts to decline substantially. For the 100 m diameter GBT operating at it’s highest frequency (100 GHz) or 3 mm, this translates to $200\;\mu\mathrm{m}$, which is the thickness of two sheets of paper! That’s quite an engineering challenge, and is the reason why large, steerable dishes are difficult to build.

Keeping telescopes fixed is one way to build a little bit bigger, such as FAST, which is a five hundred meter diameter fixed telescope in China. See also Arecibo, which unfortunately collapsed in December 2020. Eventually, though, the materials/engineering cost to building large single dish telescopes becomes prohibitive.

Single dish observations

The “beam” of a receiving antenna, or power pattern as a function of direction, can be calculated using the reciprocity theorems for transmitting and receiving antennas. These state that the far field electric field pattern $f(l)$ is the Fourier transform (much more in lectures 2 and 3!) of the electric field illuminating the aperture of the telescope $g(u)$.

A schematic illustration of (top): Uniformly illuminated aperture (middle): The electric field pattern of the antenna, as a function of direction (bottom): The power pattern of the antenna, as a function of direction. Credit: Essential Radio Astronomy

For large apertures, the nulls at $ l = \pm 1, 2, \ldots$ appear at the angles $\theta \approx \lambda/D, 2 \lambda/D, \ldots$. In two dimensions, for a circular aperture, this is an Airy pattern.

A beam power pattern plotten in polar coordinates, demonstrating that the antenna can pick up power from sidelobes at range of angles. Credit: Tools of Radio Astronomy.

This is an idealized representation, but is still helpful. The beam can pick up power through sidelobes at a range of angles. Directional antennas help concentrate power in the main beam, but antennas with secondary stages (and thus supports, like most telescopes) create opportunities for ground radiation to reflect into the receiver.

You can think of single-dish telescopes (unless they have a sophisticated, multi-pixel receiver) essentially as single-pixel devices. So to make a map of the sky, you would need to raster scan the telescope across the region of interest, reading out antenna temperature as a function of RA, Dec. To make a good (i.e., scientifically accurate) map, you should focus on Nyquist sampling the sky to a uniform sensitivity, usually through a hexagonal pattern of dithering. More advanced instruments may have an array of “feeds” in a focal plane (mirroring a set of “pixels”), but this is still a small number of pixels compared to a typical CCD (e.g., 25 or 36 compared to $2046^2$).

Image Units

Radiative transfer recap

What are the units of the sky, and the images we make as representations of it?

One of the most useful quantities from radiative transfer is $I_\nu$.

$I_\nu $ is the specific intensity of radiation, you can think of it as the energy carried along by an infinitesimal “bundle” of rays.

It has dimensions of: $$ \mathrm{energy}\; (\mathrm{time})^{-1} \;(\mathrm{area})^{-1} \;(\mathrm{solid\,angle})^{-1} \; (\mathrm{frequency})^{-1} $$ in CGS units, we would write $$ \mathrm{ergs}\;\mathrm{s}^{-1}\;\mathrm{cm}^{-2}\;\mathrm{ster}^{-1}\;\mathrm{Hz}^{-1} $$ In astronomical settings, I’ve always seen $I_\nu$ referred to as the “specific intensity.” In non-astronomy settings, I’ve seen “spectral intensity.” If $I_\nu$ is integrated over all frequencies, it’s called the radiant intensity $I$.

$I_\nu $ can be a little mind-bending to think about… it can be a function of

3D space $\vec{x}$
direction $\vec{\Omega}$
frequency

Intensity itself is not a vector quantity; rather it is a scalar field that is a function of position and direction $I_\nu(\vec{x}, \vec{\Omega})$. Rybicki and Lightman write the angular direction vector as $\vec{\Omega}$ and the solid angle surrounding that vector as $\mathrm{d}\Omega$.

The geometry surrounding the concept of specific intensity. The normal vector is $\vec{\Omega}$, the position $\vec{x}$ in 3D space corresponds to the location of the $dA$ patch. The Credit: Radiative Processes, Figure 1.2

If we have a defined reference frame, we would probably write $\vec{\Omega}$ as a vector in spherical coordinates and define the components along $\hat{\phi}$ and $\hat{\theta}$ and let $\mathrm{d}\Omega = d\phi \sin \theta d\theta$.

When we are making astrophysical observations from the earth, though, we are making or acquiring images of regions of the celestial sky. So we generally talk of $I_\nu(\alpha, \delta)$, where $\alpha$ and $\delta$ are R.A. and declination offsets from some direction, respectively. We’re always looking from the same place (at least compared to the size of the universe), so we don’t worry about specifying position within 3D space. But if we went to the Andromeda Galaxy and started mapping the celestial sky, we would need to, then. In the end though, images are have the same units because they represent specific intensity. It’s very common to refer to $I_\nu(\alpha, \delta)$ as the surface brightness.

When we’re discussing images of astronomical sources, we’re usually using RA $\alpha$ and Dec $\delta$. A solid angle simply describes the area on a unit sphere (e.g., the sky), the area itself need not be circular. The Very Large Array, located in Socorro, NM. Credit: NRAO

Flux

Once you’ve defined $I_\nu$, then it’s relatively easy to calculate quantities like energy density, flux, momentum, etc, as integrals of the specific intensity field.

Flux is where you integrate out the angular dependence:

$$ F_\nu = \int I_\nu \cos \theta d \Omega $$ (intensity passing through some differential area $dA$, lowered by the effective angle).

$F_\nu$ has units of $$ \mathrm{ergs}\;\mathrm{s}^{-1}\;\mathrm{cm}^{-2}\;\mathrm{Hz}^{-1} $$ (i.e., angular dependence has been integrated out).

Most astrophysical sources produce significantly less energy in radio waves compared to higher frequency bands, and so the raw CGS unit can be quite cumbersome. To make this easier, astronomers use a unit called the “Jansky,” which is defined as

$$ 1\,\mathrm{Jy} = 10^{-23}~\mathrm{ergs}\;\mathrm{s}^{-1}\;\mathrm{cm}^{-2}\;\mathrm{Hz}^{-1} $$

The Jansky is a unit of flux.

How do we report specific intensities/surface brightnesses for radio sources, then? We can, reintroduce the “per solid angle” to the Jansky, for example $$ \mathrm{Jy}\;\mathrm{arcsec}^{-2}. $$

Later on the course, we’ll talk about $\mathrm{Jy}\;\mathrm{beam}^{-1}$, which is another unit of surface brightness/specific intensity.

Other units of surface brightness that you might encounter at other wavelengths include $\mathrm{mag}\;\mathrm{arcsec}^{-2}$ (optical) and $\mathrm{MJy}\;\mathrm{sr}^{-1}$ (infrared).

Questions for review:

What is the name of $I_\nu$, and what are its units?
What is the name of $F_\nu$, and what are its units?
If we made an astronomical observation of a “point source,” would we report $I_\nu$ or $F_\nu$?
What about for a spatially resolved source?
Is a Jansky a unit for $I_\nu$ or $F_\nu$?

Using $F_\nu$ and $I_\nu$ to represent point sources and spatially resolved sources, respectively. Credit: Ian Czekala

The many temperatures of radio astronomy

From Cosmic Sources only

brightness temperature: $T_B$, the temperature such that a blackbody would emit with the observed specific intensity. N.B.: One needs to be very careful of the context when using brightness temperature. Classic radio astronomers will define brightness temperature using the Rayleigh-Jeans definition (Essential Radio Astronomy by Condon and Ransom, Tools of Radio Astronomy by Wilson et al.) $$ T_B \equiv \frac{c^2}{2 k} \frac{1}{\nu^2} I_\nu $$ even when the Rayleigh-Jeans version of the blackbody formula is not valid for the given temperature and observing frequency. This may not be physically accurate, but has the advantage that intensity and brightness temperature are always linearly related (i.e., a 20% increase in $T_B$ corresponds to a 20% increase in $I_\nu$).

Other, more physics-minded resources will define brightness temperature using the full form of the Planck formula (Physics of the Interstellar and Intergalactic Medium, Draine). $$ I_\nu = \frac{2 h \nu^3}{c^2} \frac{1}{\exp{(h \nu / k T_B)} - 1} $$ such that $$ T_B \equiv \frac{h \nu / k}{\ln[1 + 2 h \nu^3 / c^2 I_\nu].} $$ If the source is in thermal equilibrium, this form has the advantage that $T_B$ does correspond to a physical temperature. Draine uses the term antenna temperature to refer to brightness temperature arrived at using the Rayleigh-Jeans definition.

Unfortunately, this ambiguity makes communicating using brightness temperature (a great concept!) confusing and error-prone. If there is any ambiguity about whether the Rayleigh-Jeans approximation is valid $h \nu \ll k T_B $ for all observed regions in your field of view, then you should state which form of the blackbody formula you are using in your figure/article.

antenna temperature: if you are in a radio astronomy context, you will also come across this terminology, which is functionally the same as the classic radio astronomer’s definition of brightness temperature. Key is that specific intensity is linearly related to antenna temperature and makes it easy to substitute one for the other using the Rayleigh-Jeans approximation $$ T_A(\nu) = \frac{c^2}{2 k \nu^2} I_\nu $$ Typically, antenna temperature is used to describe a measured temperature from the instrument in the broader context of other noise sources.

In the field of radio astronomy, be aware that one frequently combines temperatures in other interesting ways. One can express random noise power in terms of an effective temperature $$ P = k T \Delta \nu $$ where $\Delta \nu$ is the bandwidth of the observation. Here the power is equal to the noise power delivered to a matched load by a resistor at physical temperature $T$. By matched load, we mean we connect a resistor to the input terminals of a linear amplifier. The fact that this resistor has some temperature (i.e., we haven’t cooled it to absolute zero…) means that the thermal motion of the electrons will produce a random, variable current $i(t)$ input to the amplifier. The mean value of this current is zero, but the root mean squared value is non-zero, and this represents a non-zero power. I.e., you can draw (some) power from a resistor at room temperature, purely from thermal motions. The situation is not dissimilar to the random walk of a particle in Brownian motion. For more details, see Tools of Radio Astronomy, Chapter 1.8.

Including noise sources

Antenna temperature $T_A$ the component of the power received by the antenna from cosmic sources. It has the same interpretation as before (though we’ll talk about beam dilution in a second).
Receiver temperature $T_R$ the component of the power from internal noise of the receiver components themselves, ground radiation, atmospheric emission, etc…
System temperature $T_S = T_A + T_R$ is the sum of receiver temperature and antenna temperature. It’s the one power number coming out of the backend of your telescope. It’s up to you to calibrate $T_R$ accurately enough to measure $T_A$.

In any observation, you will have your cosmic signal of interest and several contributions of noise (see Essential Radio Astronomy, Ch. 3.6.1) $$ T_S=T_\mathrm{cmb}+T_\mathrm{rsb}+T_A + [1−\exp(−\tau_A)] T_\mathrm{atm}+T_\mathrm{spill}+T_r+ \ldots $$ such as the CMB, other galactic background sources, the atmosphere, spillover radiation from the ground, the temperature of the radiometer itself (hopefully cryogenically cooled), etc.

In the limit that $T_A \ll T_S$ (most astronomy situations, unfortunately!), we have $$ S/N \approx C \frac{T_A}{T_S} \sqrt{\Delta \nu \Delta t} $$ where $C$ is a constant of proportionality greater than or equal to 1, and $\Delta t$ is the integration time. If we let $\Delta \nu \approx 1\;\mathrm{GHz}$ and $\Delta t \approx 1\;\mathrm{h}$, then we can get $\sqrt{\Delta \nu \Delta t} \approx 10^6$, allowing us to detect a signal which is less than $10^{-6}$ the system noise. A great illustration of this capability is the COBE satellite that studied CMB anisotropies with brightness temperatures $< 10^{-7}$ that of the system temperature. To achieve these contrasts, however, it’s important to keep systematics under control, otherwise the S/N scaling won’t hold!

Beam dilution

In previous lectures, we’ve been talking about specific intensity $I_\nu(\Omega)$ as a “known” quantity of direction (e.g., R.A., Dec.) and used antenna temperature $T_A$ as a linear proxy for its specification. For the following discussion, we’re going to move into the realm of observations, and discuss the ways $T_A$ can be an unfaithful proxy for the “true” specific intensity distribution or brightness temperature. we’ll use the symbol $T_b(\Omega)$ to denote the “true” brightness/antenna temperature, assuming we’re in the Rayleigh-Jeans limit, and redefine $T_A$ to mean the response of the telescope to the cosmic radiation.

When we’re doing observations, we don’t always have access to the highest resolution version of $T_b(\Omega)$, but rather we have access to a quantity which is the true $T_b(\Omega)$ convolved with the beam of the telescope, which is the implication of $T_A$ for this discussion.

When astrophysical sources are insufficiently resolved, our measurements of $I_\nu$ do not trace the true sky distribution, specifically features are smeared out over a spatial extent and peak intensities are reduced. This means that using these insufficiently resolved measurements of $T_A$ will not accurately trace the true underlying temperatures of the astrophysical source (even if the emission from the source is actually thermal in origin). Of course, we never have infinite spatial resolution, so there will always be structure on scales beyond that of our observations. Credit: Ian Czekala

large (fully resolved) source

In the limit that we are observing a source that subtends a solid angle much larger than the beam of the antenna, $$ \Omega_S > \Omega_A $$ the convolution of the beam doesn’t matter, we’re still sensing approximately the same $T_b(\Omega)$ such that $$ T_A(\Omega) \approx T_b(\Omega). $$ If the source is in LTE, then we also have that $T_A \approx T$.

small (unresolved) source

If the source is more compact than the antenna beam, $$ \Omega_S < \Omega_A, $$ then the measured antenna temperature is basically the “true” intensity averaged over the area of the main beam. This lowers the measured antenna temperature by a factor $$ \frac{T_A}{T_b} = \frac{\Omega_S}{\Omega_A} $$ where the ratio $\frac{\Omega_S}{\Omega_A}$ is called the beam filling factor.

For example, you could have a compact source with $T_b = 10^4$ K, but if it only fills 1% of the beam solid angle then you would measure an antenna temperature of 100 K. If you took your observations at face-value (and assumed LTE), then you would incorrectly conclude that the source is 100x cooler than it actually is.

Beam dilution also applies to observations of sources that have structure on spatial scales below the observable limit, which, to be honest, is going to be most astrophysical sources of interest. For example, consider a gas filament in a star-forming region.

Radio-bright, spatially concentrated regions will be “smeared out” by the beam. If you wanted to use antenna temperature (and assume LTE) to estimate the physical conditions of the gas filament, you do so at the peril of measuring incorrect temperatures. The unfortunate reality here is that, without higher resolution images to guide you (which sometimes exist at optical or infrared wavelengths), it’s quite difficult to estimate how badly your measurements are affected by beam dilution.

For more useful single-dish guidance, see these notes by James Jackson, or Ch 3.1.6 of Essential Radio Astronomy.

Introduction to interferometric arrays, ALMA, VLA, SMA, VLBA

Now that we’ve covered some of the fundamentals around radio telescopes and single-dish antennas, we’ll move on to discussing how we combine the signals from multiple antennas to do interferometry. Here are some of he interferometers operating today:

The Atacama Large (Sub)millimeter Array, an interferometric array of 66 antennas operating at sub-millimeter wavelengths. The largest antennas in the array are only 12m in diameter, yet through interferometry, the array is able to obtain far higher spatial resolution than the largest single-dish antennas. Credit: NRAO/ESO/NAOJ/JAO

The Very Large Array, located in Socorro, NM. Credit: NRAO

One antenna at the Eastern end of the Very Long Baseline Array (VLBA), St. Croix, U.S. Virgin Islands. Credit: Cumulus Clouds

The NOEMA, located in the French Alps on the Plateau du Bure. Credit: IRAM/Rebus

The Submillimeter Array (SMA), located on Mauna Kea, Hawaii. Credit: I. Czekala

Correlation and Covariance in ALMA Data

Tue, 06 Dec 2022 00:00:00 +0000

Zoom Slides

Regularized Maximum Likelihood (RML) II

Fri, 21 Oct 2022 00:00:00 +0000

References

MPoL introduction
Machine Learning: A Probabilistic Perspective by Murphy, Chapter 10
Pattern Recognition and Machine Learning by Bishop, Chapter 8
The fourth paper in the 2019 Event Horizon Telescope Collaboration series describing the imaging principles
Maximum entropy image restoration in astronomy AR&A by Narayan and Nityananda 1986
Multi-GPU maximum entropy image synthesis for radio astronomy by Cárcamo et al. 2018
Regularized Maximum Likelihood Techniques for ALMA Observations by Zawadzki, Czekala, et al.
Fitting Very Flexible Models: Linear Regression With Large Numbers of Parameters by Hogg and Villar

Last time

Recap of (parametric) forward modeling in a Bayesian context
Recap of the CLEAN procedural image deconvolution algorithm
Introduction of RML process as a non-parametric model
Discussion of regularization, in the context of priors
Discussion of loss function space (defined by probability distribution) vs. the optimization engineering that helps you navigate it

Today

Overarching question—how do you assess whether something is good? Forays into Machine Learning
Deeper dive into future RML applications and opportunities

Model comparison

$$ \lambda \sum_{i=0}^{N-1}|a_i|^2 $$

This math is fully equivalent to the RML imaging problem we introduced last week, and it also raises the same problem: how do we set the regularizer strength? What is the best choice?

Cross validation

Useful thoughts from https://biometry.github.io/APES/LectureNotes/2017-Resampling/CrossValidationLecture.html

The idea is to test the predictive power of your model. In this case, the model would be your setup of your. In the RML case, the model would be the settings of your image pixelization,

If we have the right model, we will generalize perfectly to new data. The problem is that our training data are always limited and will usually always have some noise.

Regularized Maximum Likelihood (RML) I

Mon, 17 Oct 2022 00:00:00 +0000

References

MPoL introduction
Machine Learning: A Probabilistic Perspective by Murphy, Chapter 10
Pattern Recognition and Machine Learning by Bishop, Chapter 8
The fourth paper in the 2019 Event Horizon Telescope Collaboration series describing the imaging principles
Maximum entropy image restoration in astronomy AR&A by Narayan and Nityananda 1986
Multi-GPU maximum entropy image synthesis for radio astronomy by Cárcamo et al. 2018
Regularized Maximum Likelihood Techniques for ALMA Observations by Zawadzki, Czekala, et al.

Last time

Discussed $u,v$ coverage and sampling (weights)
Introduced the “dirty image” as the inverse Fourier transform of the visibility samples
Introduced the CLEAN image deconvolution procedure

This time

Review parametric vs. non-parametric models
Introduce Regularized Maximum Likelihood (RML) imaging