The topic this week is the Singular Value Decomposition. In class, I talked about an example using weather data; we’ll walk through that example again here.
The basics
Given a matrix A, A.singular_values() returns the singular values of A, and A.SVD() gives the singular value decomposition. There’s a wrinkle, though: Sage only knows how to compute A.SVD() for certain kinds of matrices, like matrices of floats. So, for instance:
A = matrix([[1,2,3],[4,5,6]])
A.SVD()
gives an error, while
A = matrix(RDF,[[1,2,3],[4,5,6]])
A.SVD()
works fine. (Here, “RDF” stands for “real decimal field”; this tells Sage we want to think of the entries as finite-precision real numbers.)
The command A.SVD() returns a triple (U,S,V) so that A=USV^T; U and V are orthogonal matrices; and S is a “diagonal” (but not square) matrix. So, the columns of U are left singular vectors of A, and the columns of V are right singular vectors of A.
I find the output of A.SVD() hard to read, so I like to look at its three components separately, with:
A.SVD()[0]
A.SVD()[1]
A.SVD()[2]
As always, you can get some help by typing
A.SVD?
Getting some data
We’ll compute an SVD for some weather data, provided by the National Oceanic and Atmospheric Administration (NOAA). (The main reason I chose this example is that their data is easy to get hold of. Actually, lots of federal government data is accessible. data.gov keeps track of a lot of it, and census.gov has lots of data useful for social sciences.)
To get some data from NOAA, visit the National Climate Data Center at the NOAA, select “I want to search for data at a particular location”, and then select “Search Tool”. Then hack around for a while until you have some interesting data, and download it. This took me several tries (and each try takes a few minutes): lots of the monitoring stations were not reporting the data I wanted, but this wasn’t obvious until I downloaded it. That’s how dealing with real-world data is at the moment; in fact, the NOAA data is much easier to get hold of than most.
You want to download the data as a .csv (comma-separated values) file. Here is the .csv file I used: WeatherData2020.csv. My data is from a monitoring station at Vancouver airport (near Portland), with one entry per day of the year in 2015 (so, 365 entries total), recording the amount of precipitation, maximum temperature, minimum temperature, average wind speed, fastest 2 minute wind speed, and fastest 5 second wind speed.
You can open this csv file with Excel or any other spreadsheet. More important for us, Python (and hence Sage) imports csv files fairly easily.
To import the file into Sage, save it to your computer, and then upload it to Sage Math Cloud by clicking “New” (file) button. Here is code that will import the data into a Sage worksheet, once it’s been uploaded to Sage Math Cloud:
import csv reader=csv.reader(open('WeatherData2020.csv')) data = [] for row in reader: data.append(row)
(If the filename is something other than weatherData.csv, change the first line accordingly.) The result is a list of lists: each line in the data file (spread sheet) is a list of entries, and ‘data’ is a list of those lines. Type
data[0]
data[1]
to get a feeling for what the data looks like.
Converting the data to vectors
We want to:
- Extract the numerical data from ‘data’, creating a list of vectors (instead of a list of lists).
- Compute the average of these vectors.
- Subtract the average from the vectors, creating the “re-centered” or “mean-deviation’ vectors.
- Turn these re-centered vectors into a matrix A.
- Compute the singular value decomposition of A.
- Interpret the result, at least a little.
Step 1. The first real line of data is data[1]. Let’s set
v1=data[1]; v1
The numerical entries of v start at fourth entry, which is v[3] (because Sage counts from 0). Also, the last two entries are missing. To get the part of v starting from entry 3 and going to the third-to-last entry, type
v1[3:-2]
Actually, the second line is missing the third-to-last entry, too. Also, the first entry is weirdly missing the average temperature, so let’s throw out those entries and everything for January 1 (for convenience). So, the numerical parts are:
numerical_vectors = [x[3:-3] for x in data[2:]]
(As the notation suggests, this creates a list consisting of x[3:] for each line x in data[1:]. This is called list comprehension, (see also Section 3.3 here) and it’s awesome.)
This is still a list of lists. We want to turn it into a list of vectors, in which the entries are real numbers (floating point numbers). A similar command does the trick:
vecs = [vector([float(t) for t in x]) for x in numerical_vectors]
Check that you have what you expect:
data[2]
vecs[0]
Step 2. Computing the average. Note that there are 128 vectors in vecs. (You could ask Sage, which the command len(vecs).)
M = sum(vecs) / 128
Step 3. Again, this is easiest to do using list comprehension:
cvecs = [v - M for v in vecs]
Step 4. This is now easy:
A = matrix(cvecs)
Interlude: cleaning the data
The NOAA data is pretty clean — values seem to be pretty reasonable, and in particular of the right type — except that lots of values tend to be missing. Dealing with missing data is a hard problem, which is receiving a lot of attention at the moment. We avoided it by just throwing away the columns (and one row) with missing data.
It used to be that NOAA reported missing fields as -9999 (see the previous edition of this post). This made it easy to get garbage answers out, if you didn’t notice the -9999s and just computed. It seems NOAA noticed that, and improved their data downloads to correctly report missing information as blank.
SVD and visualizing the result
Now, let’s look at the singular values and vectors.
%display typeset
A.singular_values()
A.SVD()[2]
Notice that there are some pretty big jump between singular values. The second singular value is a lot smaller than the first, and the fifth is a lot smaller than the fourth. So our data is quite well approximated by a 4-dimensional space, and even just knowing a 1-dimensional projection of this data should give us a reasonable approximation. Let’s make a list of the singular vectors giving these projections, and look at them a bit more.
sing_vecs = A.SVD()[2].columns()
sing_vecs[0]
This gives the first singular vector. I get roughly (0.0021,0.0002,0.0017,−0.0027,0.0035,0.99998,0.0024). The entries correspond to (average wind speed, precipitation, average temperature, maximum temperature, minimum temperature, wind direction in the fastest 2 minutes, wind speed in the fastest 2 minutes):
This explains the .99998: the wind direction is given in degrees (0 to 360), and there’s way more variation there than there is in the other entries. (So, we’re suffering from heterogeneous data; see “Further thoughts” below for more discussion.) Applying the SVD to an angle doesn’t really make sense: the angles 358 and 3 are very close together, but the linear algebra doesn’t see that. Let’s solve this problem by…throwing out that component:
numerical_vectors2 = [x[3:8]+x[9:-3] for x in data[2:]]
vecs2 = [vector([float(t) for t in x]) for x in numerical_vectors2]
M = sum(vecs2) / 128
cvecs2 = [v - M for v in vecs2]
B = matrix(cvecs2)
B.singular_values()
B.SVD()[2]
sing_vecs2 = B.SVD()[2].columns()
sing_vecs2[0]
Now I get roughly (0.0014,−0.0011,0.5472,0.7334,0.4010,0.04303) which is a little more varied. The fields are now (average wind speed, precipitation, average temperature, maximum temperature, minimum temperature, wind speed in the fastest 2 minutes). What we’re seeing is that the most variation is in the three temperature entries, which all move together. The wind speed tends to be higher when the temperature is higher (I didn’t know that), and there’s more precipitation when the temperature is lower (I did know that — this is Eugene, remember).
The singular vectors give us new coordinates: the new coordinates of a (re-centered) vector are the dot products with the singular vectors:
def coords(v): return vector([v.dot_product(w) for w in sing_vecs])
If you look at the coordinates of some of our re-centered vectors, you’ll notice they (almost always) come in decreasing order, and the fourth, fifth, and six are small. For instance:
coords(cvecs2[17])
(That’s the weather for January 19 that we’re looking at, by the way.)
We can use the coordinates to see how well the data is approximated by 1, 2, and 3-dimensional subspaces. Let’s define functions which give the 1-dimensional, 2-dimensional, and 3-dimensional approximations, and a function that measures the error.
def approx1(v): c = coords(v) return c[0]*sing_vecs2[0] def approx2(v): c = coords(v) return c[0]*sing_vecs2[0]+c[1]*sing_vecs2[1] def approx3(v): c = coords(v) return c[0]*sing_vecs2[0]+c[1]*sing_vecs2[1]+c[2]*sing_vecs2[2] def approx4(v): c = coords(v) return c[0]*sing_vecs2[0]+c[1]*sing_vecs2[1]+c[2]*sing_vecs2[2]+c[3]*sing_vecs2[3] def approx5(v): c = coords(v) return c[0]*sing_vecs2[0]+c[1]*sing_vecs2[1]+c[2]*sing_vecs2[2]+c[3]*sing_vecs2[3]+c[4]*sing_vecs2[4] def error(v,w): return (v-w).norm()
For example, the approximations for January 18, and the errors are:
v = cvecs2[17]
v
approx1(v)
error(v, approx1(v))
approx2(v)
error(v, approx2(v))
approx3(v)
error(v, approx3(v))
approx4(v)
error(v, approx4(v))
approx5(v)
error(v, approx5(v))
One way to see how good the approximations are is with a scatter plot, showing the sizes of the original vectors and the sizes of the errors between the first, second, and third approximations, for each day in the year:
list_plot([v.norm() for v in cvecs2])+list_plot([error(approx1(v),v) for v in cvecs2], color='red')+list_plot([error(approx2(v),v) for v in cvecs2], color='orange')+list_plot([error(approx3(v),v) for v in cvecs2], color='green')+list_plot([error(approx4(v),v) for v in cvecs2], color='purple')+list_plot([error(approx5(v),v) for v in cvecs2], color='black')
Blue is the length of the original re-centered data. The red, orange, green, purple, and black dots are the errors in the first, second, third, fourth, and fifth approximations. The fact that they get closer to 0, fast, tells you these really are good approximations. You can also see days which are unusual from the point of view of the first singular vectors (e.g., entry 16, which is January 18, seems to have been weird).
Further thoughts
The data we’ve been looking at is heterogeneous — it doesn’t really make sense to compare the magnitude of the wind speed to the magnitude of the temperature. So, perhaps we should have renormalized all of the components in our vectors first to have standard deviation 1, say, or maybe done something else more complicated. (If we renormalize, we’re blowing up small variations into big ones; is this what we want?) At the very least, our visualization of the errors is a little silly.
Another unsatisfying point is understanding what the second and third singular vectors are telling us, other than that they capture most of the information in the original vectors.
Finally, again, our way of handling missing data is thoroughly unsatisfying. (This is an active area of research, with applications ranging from genomics to music recommendations.)