DSC 155 & Math 182

Winter 2022, Lecture A00 (Dumitriu) MWF 11:00-11:50am

Hidden Data in Random Matrices


Course Information

  Instructional Staff

Name Role E-mail Meetings (Zoom, or in-person)
Ioana Dumitriu Instructor idumitriu@ucsd.edu Lecture:   MWF 11-11:50am;
                  (Zoom or CSB 001)

Office  Hours:   M 3-5pm; Thu 6-7pm
                             (Zoom or APM 5824)
Haixiao Wang
Teaching Assistant h9wang@ucsd.edu Discussion/Lab:  Thu  2-2:50pm (A01), 3-3:50pm (A02)
                                 (Zoom or APM B402A)

Office Hours:  Tue 3-5pm (combined) 
                           (Zoom or TBA)

Calendar

This is a tentative course outline and might be adjusted during the quarter.

Week Monday Tuesday Wednesday Thursday Friday
1
Jan 3
Lecture 1
OH (Dumitriu)
Jan 4
OH (Wang)
Jan 5
Lecture 2
Jan 6
Lab/Discussion
OH (Dumitriu)
POSTED: Lab 1
Jan 7
Lecture 3
2
Jan 10
Lecture 4
OH (Dumitriu)
POSTED: HW 1
Jan 11
OH (Wang)

Jan 12
Lecture 5
Jan 13
Lab/Discussion
OH (Dumitriu)
Jan 14
Lecture 6
DUE: Lab 1
3
Jan 17
MLK Day, no instruction
Jan 18
OH (Wang)
DUE: HW 1
Jan 19
Lecture 7
Jan 20
Lab/Discussion
OH (Dumitriu)
POSTED: Lab 2
Jan 21
Lecture 8

4
Jan 24
Lecture 9
OH (Dumitriu)
POSTED: HW 2
Jan 25
OH (Wang)
Jan 26
Lecture 10
Jan 27
Lab/Discussion
OH (Dumitriu)
Jan 28
Lecture 11
DUE: Lab 2
5
Jan 31
Lecture 12
OH (Dumitriu)
Feb 1
OH (Wang)
Feb 2
Lecture 13
Feb 3
Lab/Discussion
OH(Dumitriu)
POSTED: Lab 3
DUE: HW 2
Feb 4
Lecture 14
POSTED: Final Project
6
Feb 7
Lecture: REVIEW
OH (Dumitriu)
POSTED: Midterm
Feb 8
OH (Wang)
DUE: Midterm
POSTED: HW 3
Feb 9
Lecture 15
Feb 10
Lab/Discussion
OH (Dumitriu)
Feb 11
Lecture 16
DUE: Lab 3

7
Feb 14
Lecture 17
OH (Dumitriu)
Feb 15
OH (Wang)
DUE: HW 3
Feb 16
Lecture 18
Feb 17
Lab/Discussion
OH (Dumitriu)
POSTED: Lab 4
Feb 18
Lecture 19

8
Feb 21
Presidents' Day,
no instruction

Feb 22
OH (Wang)

Feb 23
Lecture 20
Feb 24
Lab/Discussion
OH (Dumitriu)
POSTED: Lab 5
Feb 25
Lecture 21
DUE: Lab 4
9
Feb 28
Lecture 22
Todd Kemp OH (Dumitriu)
POSTED: HW 4
Mar 1
OH (Wang)
Mar 2
Lecture 23
Mar 3
Lab/Discussion
OH (Dumitriu)
Mar 4
Lecture 24
DUE: Lab 5
10
Mar 7
Lecture 25
OH (Dumitriu)
Mar 8
OH (Wang)
DUE: HW 4
Mar 9
Lecture 26
Mar 10
Lab/Discussion
OH (Dumitriu)
Mar 11
Lecture: REVIEW
DUE: Final Project
11
Mar 14
FINAL EXAM




Top


Lecture Notes


All lecture notes, both "before" and "after" the lecture, will be provided on Canvas. You will be able to see the recorded videos on Canvas, too, both the Zoom ones and the ones that will result from podcasting once we are back in-person.

In addition, Professor Todd Kemp has kindly agreed to allow us to use his notes from the previous iteration of the course. You may find a PDF with his notes on the Canvas "Home" website for the course. While these notes are not required, you might find them helpful.

Top


Syllabus


Here is the very short version of the syllabus; please make sure you read the entire website.

DSC 155 & MATH 182 is a one quarter topics course on the theory of Principal Component Analysis (PCA), a tool from statistics and data analysis that is extremely wide-spread and effective for understanding large, high-dimensional data sets. PCA is a method of projecting a high-dimensional data set into a lower-dimensional affine subspace (of chosen dimension) that best fits the data (in least squares sense), or equivalently maximizes the preserved variance of the original data. It is a computationally efficient algorithm (utilizing effective numerical methods for the singular value decomposition of matrices), and so is often a first-stop for advanced data analytics of big data sets. The algorithm produces a canonical basis for the projected subspace; these vectors are called the principal components. The question of choosing the dimension for the projection is subtle. In virtually all real-world applications, a "cut-off phenomenon" occurs: a small number (usually 2 or 3) of the singular values of the sample covariance matrix for the data account for a large majority of the variance in the data set.

A full (and honestly still developing) theoretical understanding of this "cut-off phenomenon" has only arisen in the last 15 years, using the tools of random matrix theory. The result, known as the BBP Transition (named after Jinho Baik, Gerard Ben Arous, and Sandrine Peche, who discovered it in 2005), explains the phenomenon in terms of analysis of outlier singular values in low-rank perturbations of random covariance matrices. This uses a simple model (standard in signal processing and information theory) of a signal in a noisy channel to tease out exactly when an outlier will appear in PCA analysis of noisy data, and further predicts more subtle effects that current statistical methods are being developed to correct to produce even more accurate data analysis.

The goal of this course is to present and understand the PCA algorithm, and then analyze it to understand how (and when) it works. Time permitting, we will then use these ideas to apply to some current interesting problems in data science and computer science.

There is no textbook that covers this material. It is an experimental course designed by Professor Todd Kemp, whose ideas and materials we will rely on (I have posted a PDF with his Lecture notes for this course on Canvas). Here is a tentative list of topics we intend to cover, in the order they will be covered. (Depending on time, we may have to skip some of the more advanced topics at the end).
  • Introductory material:
    • Review of relevant linear algebra concepts (linear equations; rank and nullity; orthogonal rotations and projections; eignevalues and eigenvectors), and introduction to singular value decomposition.
    • Review of relevant topics from probability (distributions and densities; random vectors, covariance, and correlation).
    • Introduction to basic statistical concepts: unbiased estimators, maximal likelihood estimation, sample covariance, (linear) least squares and regression.
  • Principal Component Analysis (PCA):
    • as best approximating d-dimensional affine subspace projection
    • as highest variance d-dimensional affine subspace projection
    • equivalence of these two characterizations
    • finding the principal components: singular value decomposition
    • complexity and correctness
  • Noise without signal: spectral statistics of totally random data sets:
    • Method of Moments
    • Wigner's semicircle law
    • Marchenko-Pastur laws for Gaussian rectangular matrices
    • Universality of eigenvalue statistics
  • Outlier singular values:
    • Robustness of the bulk singular values
    • Spiked covariance models
    • The BBP transition
  • Application of the BBP transition to understand the generic behavior of PCA
  • Introduction to community detection: the stochastic block model.

Prerequisites:  The prerequisites are Linear Algebra (MATH 18) and Probability Theory (MATH 180A). MATH 109 (Mathematical Reasoning) is also strongly recommended as a prerequisite or co-requisite. Also: MATH 102 (Applied Linear Algebra) would be beneficial, but is not required. For the lab component of the course, some familiarity with Python and MATLAB is helpful, but not required.

Lecture:  Attending the lecture is a fundamental part of the course; you are responsible for material presented in the lecture whether or not it is discussed in the notes. You should expect questions on the homework and exams that will test your understanding of concepts discussed in the lecture.

Homework:  Homework assignments will be posted on Canvas on the dates indicated in the calendar and will be due at 10:59pm on the indicated due date.  You must turn in your homework through Gradescope; if you have produced it on paper, you can scan it or simply take clear photos of it to upload. It is allowed (and even encouraged!) to discuss homework problems with your classmates and your instructor and TA, but your final write up of your homework solutions must be your own work. If you collaborate on the homework, you must list the people you collaborated with on the write up.

Labs:  The data science labs are accessible through DataHub. The turn-in components should be exported as pdf files and turned in through Gradescope; they are due at 10:59pm on the dates indicated on the labs. MAKE SURE YOU ATTEND THE FIRST LAB/DISCUSSION SESSION.

Lab Project:  You will choose a real-world high-dimensional data set, and implement the PCA algorithm to analyze it. You will use the tools explored in this class to give a careful analysis of how the PCA algorithm performed, what it discovered about the data, and what structural shortcomings were evidence in the analysis. Topics and data-sets to be approved by the instructor.

Take-Home Midterm Exam:  There will be a single take-home midterm exam, available immediately after the lecture on Monday, February 7, due the following day before 10:59pm. You are free to use any paper / online resources you like during the exam, but collaboration with other people is not allowed. This will be enforced by the honor-system; be warned, we will grade carefully looking for evidence of collaboration on the exam, and any suspicious cases will be reported as academic integrity violations (with likely severe penalties).

Final Exam:  If in-person, the final exam will be held on March 14, from 11:30am-2:29pm, location TBA. If we are still remote by then, we will make other arrangements.

  • It is your responsibility to ensure that you do not have a schedule conflict involving the final examination; you should not enroll in this class if you cannot take the final examination at its scheduled time.
  • The exam will be open book: you may use the recommended course textbooks, your course notes, and all resources on this course webpage (slide/note transcripts before/after, course notes, and the EVT recording of the final lecture) during the exam. You will not be allowed to use other resources. You may not collaborate or interact with other humans during the exam.

Administrative Links:    Here are two links regarding UC San Diego policies on exams:

Regrade Policy:  

Grading: Your cumulative average will be the best of the following two weighted averages:

Your course grade will be determined by your cumulative average at the end of the quarter, and will be based on the following scale:

A+ A A- B+ B B- C+ C C-
97 93 90 87 83 80 77 73 70

The above scale is guaranteed: for example, if your cumulative average is 80, your final grade will be at least B-. However, your instructor may adjust the above scale to be more generous.

Etiquette

In addition, here are a few of my expectations for etiquette (updated for the era of remote teaching). Naturally, if and when we go back to in-person instruction, these expectations will be modified accordingly.

Accommodations:

Students requesting accommodations for this course due to a disability must provide a current Authorization for Accommodation (AFA) letter issued by the Office for Students with Disabilities (OSD) which is located in University Center 202 behind Center Hall. The AFA letter may be issued by the OSD electronically or in hard-copy; in either case, please make arrangements to discuss your accommodations with me in advance (by the end of Week 2). We will make every effort to arrange for whatever accommodations are stipulated by the OSD. For more information, see here.

Top


Academic Integrity Policies


UC San Diego's code of academic integrity outlines the expected academic honesty of all students and faculty, and details the consequences for academic dishonesty. The main issues are cheating and plagiarism, of course, for which we have a zero-tolerance policy. (Penalties for these offenses always include assignment of a failing grade in the course, and usually involve an administrative penalty, such as suspension or expulsion, as well.) However, academic integrity also includes things like giving credit where credit is due (listing your collaborators on homework assignments, noting books or papers containing information you used in solutions, etc.), and treating your peers respectfully in class.

To ensure that Academic Integrity is upheld during any and all exams (quizzes, midterm, and final), at the beginning of each exam you will be signing a pledge that you will submit alongside your solutions, whereby you will agree to obey all the rules that have been outlined. In addition, you will agree to the following: In case the instructor suspects academic misconduct in completing the exam, you will be invited to defend your solutions during a short Zoom session with the instructor and/or the TA. If you decline, or if you accept but fail to defend your solution(s) to the instructor’s or the TA’s satisfaction, the instructor will refer your case to the Academic Integrity Office for an investigation. If you accept and defend your solution(s) satisfactorily, the case will be closed and you will receive a grade for the exam and for the class.

Top


About Gradescope


We will be using Gradescope for the grading of both homework and exams.
  • Your login is your university email, and your password can be changed here. The same link can be used if you need to originally set your password.
  • Please make sure your files are legible before submitting.
  • Most word processors can save files as a pdf.
  • There are many tools to combine pdfs, such as here, and others for turning jpgs into pdfs, such as here.
  • If you have not yet been added to the course, the Gradescope entry code is RWNK4J . Use your UCSD email!

Top


Homework


There will be a total of 4 homework assignments, which will be posted (with eventual solutions) on Canvas. See the calendar for their posting and due dates. Note that you will need to submit the solutions through Gradescope. Late homework will not be accepted.