py4sci

Table Of Contents

Previous topic

Syllabus

Next topic

Getting started with R

This Page

Voting Records 2011 Congress

In this example, we look at several aspects of the voting records of the US Congress of 2010. The votes were compiled from http://clerk.house.gov.

We have the votes (or absence of a vote) for each member of the House of Representatives on over 939 roll call votes in 2011 with 426 members having all votes. Not surprisingly, these votes are clearly structured across party lines. We will apply PCA and see that the first component pretty much determines party affiliation.

Preprocessing

I generated the data file 2011_cleaned_votes.csv using the following function that scrapes the votes for a given roll call in a given year.

In[1]:

import urllib, os.path, time
from elementtree import ElementTree as ET
import numpy as np

In[2]:

def scrape_votes(year, roll, sleep=True):
       """
       For a given year and roll call number, download the votes
       as an XML file, retrieve all votes and return an array that
       has a row for each member of the House, their ID, their party
       and their vote, represented as a string in the "vote" field
       and a numeric representation in their "numeric_vote" field.

       An "Aye" was coded as 1, "Nay" as -1, with "Not Voting" and
       "Present" coded as 0.
       """

       # download the data and make a local copy
       if not os.path.exists("../vote/%s_%s.xml" % (year, roll)):
           if sleep:
                time.sleep(np.random.randint(2,4))
           data = urllib.urlopen("http://clerk.house.gov/evs/%d/roll%03d.xml" % (year, roll)).read()
           f = file("../vote/%s_%s.xml" % (year, roll), 'w')
           f.write(data)
           f.close()
       data = file("../vote/%s_%s.xml" % (year, roll))

       # parse the XML tree
       # an investigation of one of the .xml files shows that
       # the information we want is in the <recorded-vote> tags

       xml_data = ET.parse(data)
       values = []
       for vote in xml_data.find('vote-data').findall('recorded-vote'):
           legislator = vote.find('legislator')
           name = legislator.get('unaccented-name')
           ident = legislator.get('name-id')
           party = legislator.get('party')
           v = vote.find('vote').text
           values.append((name, v, party, ident, {'Yea':1,'Aye':1, 'Nay':-1, 'Not Voting':0, 'Present':0, 'No':-1}[v]))

       return np.array(values, np.dtype([('name', '|S50'),
                                         ('vote%d' % roll, '|S12'),
                                         ('party', '|S15'),
                                         ('id', '|S12'),
                                         ('numeric_vote%d' % roll, np.int)]))

Here’s a sample of what this function returns.

In[3]:

scrape_votes(2011,21)[:20]

Out[3]:

array([('Ackerman', 'Yea', 'D', 'A000022', 1),
       ('Adams', 'Yea', 'R', 'A000366', 1),
       ('Aderholt', 'Yea', 'R', 'A000055', 1),
       ('Akin', 'Yea', 'R', 'A000358', 1),
       ('Alexander', 'Yea', 'R', 'A000361', 1),
       ('Altmire', 'Yea', 'D', 'A000362', 1),
       ('Amash', 'Yea', 'R', 'A000367', 1),
       ('Andrews', 'Yea', 'D', 'A000210', 1),
       ('Austria', 'Yea', 'R', 'A000365', 1),
       ('Baca', 'Yea', 'D', 'B001234', 1),
       ('Bachmann', 'Yea', 'R', 'B001256', 1),
       ('Bachus', 'Yea', 'R', 'B000013', 1),
       ('Baldwin', 'Yea', 'D', 'B001230', 1),
       ('Barletta', 'Yea', 'R', 'B001269', 1),
       ('Barrow', 'Yea', 'D', 'B001252', 1),
       ('Bartlett', 'Yea', 'R', 'B000208', 1),
       ('Barton (TX)', 'Yea', 'R', 'B000213', 1),
       ('Bass (CA)', 'Yea', 'D', 'B001270', 1),
       ('Bass (NH)', 'Yea', 'R', 'B000220', 1),
       ('Becerra', 'Yea', 'D', 'B000287', 1)],
      dtype=[('name', '|S50'), ('vote21', '|S12'), ('party', '|S15'), ('id', '|S12'), ('numeric_vote21', '<i4')])

We will get all votes for 2011 using a simple loops (with a little sleep so we don’t get kicked out of the server).

In[4]:

import matplotlib.mlab as ML # for recarray utils

def get_all_votes():
    data = scrape_votes(2011, 1)
    rolls_found = [1]
    for i in range(2,950):
        try:
            new_votes = scrape_votes(2011,i)
            newdata = ML.rec_join('id', data, ML.rec_drop_fields(new_votes,
                                                             ['name', 'party']))
            if newdata.shape[0] < 10:
                # A merge seems to have failed if this happens...
                raise ValueError
            else:
                data = newdata
            rolls_found.append(i)
        except:
            # Roll 2 is election of the speaker... not a Y/N vote
            pass

        # every 100th vote, let's save what we have to disk
        if i % 100 == 0:
            ML.rec2csv(data, '../data/2011_votes.csv', delimiter=';')

    # we don't need all columns to analyse the data
    cleaned = ML.rec_keep_fields(data, ['party'] + [n for n in data.dtype.names if 'numeric_vote' in n])
    ML.rec2csv(cleaned, '../data/2011_cleaned_votes.csv', delimiter=';')
    ML.rec2csv(data, '../data/2011_votes.csv', delimiter=';')

force = False
if not os.path.exists('../data/2011_cleaned_votes.csv') or force:
    get_all_votes()

Analysis in R

I will run my R analysis from within ipython, though all the R code is exectuable in R itself, of course. The ipython notebook allows me to keep some python preprocessing along with R code in one file...

    votes = read.table("http://stats202.stanford.edu/data/2011_cleaned_votes.csv",
        header = TRUE, sep = ";")
    full_votes = read.table("http://stats202.stanford.edu/data/2011_votes.csv",
        header = TRUE, sep = ";")
    dim(votes)

    ## [1] 426 948

    summary(votes$party)

    ##   D   R
    ## 189 237

Principal components analysis is an unsupervised technique that yields some “interesting” linear combinations of the votes. However, as implemented in R, it only works when there are fewer votes, than members. Let’s take a random sample of 20 votes and look at the result.

    set.seed(0)  # this can be changed
    subset = sample(2:ncol(votes), 20)
    pca.votes = princomp(votes[, subset])
    plot(pca.votes$scores[, 1], pca.votes$scores[, 2])

_images/voting_fig_00.png

This data is also “supervised”, in the sense that we have a label: the member’s party. Let’s color our points with this label.

    red = votes$party == "R"
    blue = votes$party == "D"

    plot(pca.votes$scores[, 1], pca.votes$scores[, 2], type = "n", xlab = "Variable 1",
        ylab = "Variable 2")
    points(pca.votes$scores[red, 1], pca.votes$scores[red, 2], pch = 23, bg = "red")
    points(pca.votes$scores[blue, 1], pca.votes$scores[blue, 2], pch = 23, bg = "blue")

_images/voting_fig_01.png

Clearly, there is some separation between the two parties along Variable 1. However, there does not appear to be much separation between the parties on Variable 2.

We can also take a larger sample of votes, and the picture looks similar.

    subset = sample(2:ncol(votes), 100)
    pca.votes = princomp(votes[, subset])

    plot(pca.votes$scores[, 1], pca.votes$scores[, 2], type = "n", xlab = "Variable 1",
        ylab = "Variable 2")
    points(pca.votes$scores[red, 1], pca.votes$scores[red, 2], pch = 23, bg = "red")
    points(pca.votes$scores[blue, 1], pca.votes$scores[blue, 2], pch = 23, bg = "blue")

_images/voting_fig_02.png

In summary, Variable 1 seems highly associated to party. There are more than 2 scores. In fact, in this example, there are 100 scores. Each score “explains” some amount of the variability in the votes dataset. How much it explains can be visualized with what is known as a screeplot.

    plot(pca.votes)

_images/voting_fig_03.png
    summary(pca.votes$scores[red, 1])

    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    ##    2.03    6.13    6.60    6.41    7.02    7.42

    summary(pca.votes$scores[blue, 1])

    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    ##  -10.10   -9.24   -8.88   -8.04   -7.99    2.54

Most of the scores for each party are on one side of 0. This task of creating a new variable from several votes is an example of dimension reduction or feature extraction and is an unsupervised task. Often, this is used as a preprocessing to be fed into a supervised task like predicting party labels based on voting records. Note that we did not use the party variable to find these new variables (this is very important).

Prediction of party from voting record

    sign_R = sign(mean(pca.votes$scores[red, 1]))
    predictions = numeric(length(votes$party))
    predictions[sign(pca.votes$scores[, 1]) == sign_R] = "R"
    predictions[sign(pca.votes$scores[, 1]) != sign_R] = "D"

How good is our classifier?

    sum(predictions != votes$party)

    ## [1] 4

It looks pretty good. To be fair, we should probably fit the classifier on a subset of the members and then validate it on the remaining members. We do this for logistic regression below.

In any case, given our simple classifier, we can see how many people “cross party lines” on this variable:

    which(sign(pca.votes$scores[blue, 1]) == sign_R)

    ## [1]  17 117 135 148

    which(sign(pca.votes$scores[red, 1]) != sign_R)

    ## integer(0)

So, we see that 1 Democrat had a “Republican” score on the first variable. Which are they?

    full_votes$name[17]

    ## [1] Bishop (GA)
    ## 426 Levels: Ackerman Adams Aderholt Akin Alexander Altmire ... Young (IN)

Logistic regression

Later in the class, we will discuss classifiers, a simple one being logistic regression. Let’s fit the logistic regression model to our data here and see what it tells us.

    party_N = numeric(length(votes$party))
    party_N[red] = 1
    party_N[blue] = 0

    variable = pca.votes$scores[, 1]
    logistic.model = glm(votes$party ~ variable, family = binomial())

    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

How well did the model fit? Well, the logistic model estimates probabilities between 0 or 1. We can turn these into 0-1 labels by the rule: assign 1 if the probability is > 0.5, 0 otherwise.

    predictions_logistic = fitted(logistic.model) > 0.5
    sum(predictions_logistic != party_N)

    ## [1] 2

Validation and test data

To be fair, we are testing our classifier using the same features we used to extract our feature. Let’s take a subset of members before we do PCA, and then see how well our classifier does.

    subset_members = sample(1:nrow(votes), 400)
    subset_votes = sample(2:ncol(votes), 100)
    pca.votes = princomp(votes[subset_members, subset_votes])

We now have to evaluate the variable on new members, this is what the loadings are.

    variable = as.matrix(votes[-subset_members, subset_votes]) %*% pca.votes$loadings[,
        1]

Our thresholded classifier still does reasonably well

    red = votes[subset_members, ]$party == "R"
    sign_R = sign(mean(pca.votes$scores[red, 1]))

    # the new predictions
    predictions = numeric(length(party_N[-subset_members]))
    predictions[sign(variable) == 1] = 1
    predictions[sign(variable) != 1] = 0
    predictions

    ##  [1] 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1

    party_N[-subset_members]

    ##  [1] 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1

    sum(predictions != party_N[-subset_members])

    ## [1] 0

We can also predict the labels using the logistic model on the remaining 26 using this variable.

    logistic.model = glm(votes$party[-subset_members] ~ variable, family = binomial())

    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

    predictions_logistic = fitted(logistic.model) > 0.5
    sum(predictions_logistic != party_N[-subset_members])

    ## [1] 0

Both models do pretty well. Is this a good thing or a bad thing?