In this example, we look at several aspects of the voting records of the US Congress of 2010. The votes were compiled from http://clerk.house.gov.

We have the votes (or absence of a vote) for each member of the House of Representatives on over 939 roll call votes in 2011 with 426 members having all votes. Not surprisingly, these votes are clearly structured across party lines. We will apply PCA and see that the first component pretty much determines party affiliation.

I generated the data file 2011_cleaned_votes.csv using the following function that scrapes the votes for a given roll call in a given year.

In[1]:

```
import urllib, os.path, time
from elementtree import ElementTree as ET
import numpy as np
```

In[2]:

```
def scrape_votes(year, roll, sleep=True):
"""
For a given year and roll call number, download the votes
as an XML file, retrieve all votes and return an array that
has a row for each member of the House, their ID, their party
and their vote, represented as a string in the "vote" field
and a numeric representation in their "numeric_vote" field.
An "Aye" was coded as 1, "Nay" as -1, with "Not Voting" and
"Present" coded as 0.
"""
# download the data and make a local copy
if not os.path.exists("../vote/%s_%s.xml" % (year, roll)):
if sleep:
time.sleep(np.random.randint(2,4))
data = urllib.urlopen("http://clerk.house.gov/evs/%d/roll%03d.xml" % (year, roll)).read()
f = file("../vote/%s_%s.xml" % (year, roll), 'w')
f.write(data)
f.close()
data = file("../vote/%s_%s.xml" % (year, roll))
# parse the XML tree
# an investigation of one of the .xml files shows that
# the information we want is in the <recorded-vote> tags
xml_data = ET.parse(data)
values = []
for vote in xml_data.find('vote-data').findall('recorded-vote'):
legislator = vote.find('legislator')
name = legislator.get('unaccented-name')
ident = legislator.get('name-id')
party = legislator.get('party')
v = vote.find('vote').text
values.append((name, v, party, ident, {'Yea':1,'Aye':1, 'Nay':-1, 'Not Voting':0, 'Present':0, 'No':-1}[v]))
return np.array(values, np.dtype([('name', '|S50'),
('vote%d' % roll, '|S12'),
('party', '|S15'),
('id', '|S12'),
('numeric_vote%d' % roll, np.int)]))
```

Here’s a sample of what this function returns.

In[3]:

```
scrape_votes(2011,21)[:20]
```

Out[3]:

```
array([('Ackerman', 'Yea', 'D', 'A000022', 1),
('Adams', 'Yea', 'R', 'A000366', 1),
('Aderholt', 'Yea', 'R', 'A000055', 1),
('Akin', 'Yea', 'R', 'A000358', 1),
('Alexander', 'Yea', 'R', 'A000361', 1),
('Altmire', 'Yea', 'D', 'A000362', 1),
('Amash', 'Yea', 'R', 'A000367', 1),
('Andrews', 'Yea', 'D', 'A000210', 1),
('Austria', 'Yea', 'R', 'A000365', 1),
('Baca', 'Yea', 'D', 'B001234', 1),
('Bachmann', 'Yea', 'R', 'B001256', 1),
('Bachus', 'Yea', 'R', 'B000013', 1),
('Baldwin', 'Yea', 'D', 'B001230', 1),
('Barletta', 'Yea', 'R', 'B001269', 1),
('Barrow', 'Yea', 'D', 'B001252', 1),
('Bartlett', 'Yea', 'R', 'B000208', 1),
('Barton (TX)', 'Yea', 'R', 'B000213', 1),
('Bass (CA)', 'Yea', 'D', 'B001270', 1),
('Bass (NH)', 'Yea', 'R', 'B000220', 1),
('Becerra', 'Yea', 'D', 'B000287', 1)],
dtype=[('name', '|S50'), ('vote21', '|S12'), ('party', '|S15'), ('id', '|S12'), ('numeric_vote21', '<i4')])
```

We will get all votes for 2011 using a simple loops (with a little sleep so we don’t get kicked out of the server).

In[4]:

```
import matplotlib.mlab as ML # for recarray utils
def get_all_votes():
data = scrape_votes(2011, 1)
rolls_found = [1]
for i in range(2,950):
try:
new_votes = scrape_votes(2011,i)
newdata = ML.rec_join('id', data, ML.rec_drop_fields(new_votes,
['name', 'party']))
if newdata.shape[0] < 10:
# A merge seems to have failed if this happens...
raise ValueError
else:
data = newdata
rolls_found.append(i)
except:
# Roll 2 is election of the speaker... not a Y/N vote
pass
# every 100th vote, let's save what we have to disk
if i % 100 == 0:
ML.rec2csv(data, '../data/2011_votes.csv', delimiter=';')
# we don't need all columns to analyse the data
cleaned = ML.rec_keep_fields(data, ['party'] + [n for n in data.dtype.names if 'numeric_vote' in n])
ML.rec2csv(cleaned, '../data/2011_cleaned_votes.csv', delimiter=';')
ML.rec2csv(data, '../data/2011_votes.csv', delimiter=';')
force = False
if not os.path.exists('../data/2011_cleaned_votes.csv') or force:
get_all_votes()
```

I will run my R analysis from within ipython, though all the R code is exectuable in R itself, of course. The ipython notebook allows me to keep some python preprocessing along with R code in one file...

votes = read.table("http://stats202.stanford.edu/data/2011_cleaned_votes.csv", header = TRUE, sep = ";") full_votes = read.table("http://stats202.stanford.edu/data/2011_votes.csv", header = TRUE, sep = ";") dim(votes)

## [1] 426 948

summary(votes$party)

## D R ## 189 237

Principal components analysis is an unsupervised technique that yields some “interesting” linear combinations of the votes. However, as implemented in R, it only works when there are fewer votes, than members. Let’s take a random sample of 20 votes and look at the result.

set.seed(0) # this can be changed subset = sample(2:ncol(votes), 20) pca.votes = princomp(votes[, subset]) plot(pca.votes$scores[, 1], pca.votes$scores[, 2])

This data is also “supervised”, in the sense that we have a label: the member’s party. Let’s color our points with this label.

red = votes$party == "R" blue = votes$party == "D" plot(pca.votes$scores[, 1], pca.votes$scores[, 2], type = "n", xlab = "Variable 1", ylab = "Variable 2") points(pca.votes$scores[red, 1], pca.votes$scores[red, 2], pch = 23, bg = "red") points(pca.votes$scores[blue, 1], pca.votes$scores[blue, 2], pch = 23, bg = "blue")

Clearly, there is some separation between the two parties along
*Variable 1*. However, there does not appear to be much separation
between the parties on *Variable 2*.

We can also take a larger sample of votes, and the picture looks similar.

subset = sample(2:ncol(votes), 100) pca.votes = princomp(votes[, subset]) plot(pca.votes$scores[, 1], pca.votes$scores[, 2], type = "n", xlab = "Variable 1", ylab = "Variable 2") points(pca.votes$scores[red, 1], pca.votes$scores[red, 2], pch = 23, bg = "red") points(pca.votes$scores[blue, 1], pca.votes$scores[blue, 2], pch = 23, bg = "blue")

In summary, *Variable 1* seems highly associated to *party*. There are
more than 2 scores. In fact, in this example, there are 100 scores. Each
score “explains” some amount of the variability in the *votes* dataset.
How much it explains can be visualized with what is known as a
screeplot.

plot(pca.votes)

summary(pca.votes$scores[red, 1])

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2.03 6.13 6.60 6.41 7.02 7.42

summary(pca.votes$scores[blue, 1])

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -10.10 -9.24 -8.88 -8.04 -7.99 2.54

Most of the scores for each party are on one side of 0. This task of
creating a new variable from several votes is an example of *dimension
reduction* or *feature extraction* and is an *unsupervised* task. Often,
this is used as a preprocessing to be fed into a *supervised* task like
predicting party labels based on voting records. Note that we did not
use the party variable to find these new variables (this is *very*
important).

sign_R = sign(mean(pca.votes$scores[red, 1])) predictions = numeric(length(votes$party)) predictions[sign(pca.votes$scores[, 1]) == sign_R] = "R" predictions[sign(pca.votes$scores[, 1]) != sign_R] = "D"

How good is our classifier?

sum(predictions != votes$party)

## [1] 4

It looks pretty good. To be fair, we should probably fit the classifier on a subset of the members and then validate it on the remaining members. We do this for logistic regression below.

In any case, given our simple classifier, we can see how many people “cross party lines” on this variable:

which(sign(pca.votes$scores[blue, 1]) == sign_R)

## [1] 17 117 135 148

which(sign(pca.votes$scores[red, 1]) != sign_R)

## integer(0)

So, we see that 1 Democrat had a “Republican” score on the first variable. Which are they?

full_votes$name[17]

## [1] Bishop (GA) ## 426 Levels: Ackerman Adams Aderholt Akin Alexander Altmire ... Young (IN)

Later in the class, we will discuss classifiers, a simple one being logistic regression. Let’s fit the logistic regression model to our data here and see what it tells us.

party_N = numeric(length(votes$party)) party_N[red] = 1 party_N[blue] = 0 variable = pca.votes$scores[, 1] logistic.model = glm(votes$party ~ variable, family = binomial())

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

How well did the model fit? Well, the logistic model estimates probabilities between 0 or 1. We can turn these into 0-1 labels by the rule: assign 1 if the probability is > 0.5, 0 otherwise.

predictions_logistic = fitted(logistic.model) > 0.5 sum(predictions_logistic != party_N)

## [1] 2

To be fair, we are testing our classifier using the same features we used to extract our feature. Let’s take a subset of members before we do PCA, and then see how well our classifier does.

subset_members = sample(1:nrow(votes), 400) subset_votes = sample(2:ncol(votes), 100) pca.votes = princomp(votes[subset_members, subset_votes])

We now have to evaluate the variable on new members, this is what the
*loadings* are.

variable = as.matrix(votes[-subset_members, subset_votes]) %*% pca.votes$loadings[, 1]

Our thresholded classifier still does reasonably well

red = votes[subset_members, ]$party == "R" sign_R = sign(mean(pca.votes$scores[red, 1])) # the new predictions predictions = numeric(length(party_N[-subset_members])) predictions[sign(variable) == 1] = 1 predictions[sign(variable) != 1] = 0 predictions

## [1] 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1

party_N[-subset_members]

## [1] 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1

sum(predictions != party_N[-subset_members])

## [1] 0

We can also predict the labels using the logistic model on the remaining 26 using this variable.

logistic.model = glm(votes$party[-subset_members] ~ variable, family = binomial())

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

predictions_logistic = fitted(logistic.model) > 0.5 sum(predictions_logistic != party_N[-subset_members])

## [1] 0

Both models do pretty well. Is this a good thing or a bad thing?