py4sci

Table Of Contents

Previous topic

Getting started with R

Next topic

Some help for R

This Page

Plotting Unemployment Data

This example recreates an R version of a nice visualization of state level unemployment that can be found on the FlowingData blog. The blog mentioned above implements the map completely in python, we will use python to preprocess, then plot in R. Actually, the python way is faster because the result is another .svg file.

Preprocessing

Sometimes, downloading and preprocessing data can be the hardest part of “data mining”. Once the data is manageable, there are lots of tools that can manipulate it. In this little example, we download some data from the web on county level unemployment in the US and create a plot.

The preprocessing is in python, and the plot is in R.

In[1]:

import urllib, csv, numpy as np

## {{{ http://code.activestate.com/recipes/577305/ (r1)
state_dict = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AS': 'American Samoa',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'GU': 'Guam',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MP': 'Northern Mariana Islands',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NA': 'National',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'PR': 'Puerto Rico',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VI': 'Virgin Islands',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
       }

Our first step is to download the data. The following code downloads a copy of the data and does not redownload it as the script is rerun.

In[2]:

ds = np.DataSource('local')
raw = ds.open("http://www.bls.gov/lau/laucntycur14.txt").read()

After looking at the file, there are 7 lines we’d like to get rid of at the top and 6 lines at the end. We split the string by the newline character “\n”.

In[3]:

good_lines = raw.split("\n")[7:-6]

Next, we write out the data to a CSV file with ”;” as separators. R’s libraries for plotting state and county data assume lower case state and county names.

In[4]:

output_file = file("../data/unemployment_2011.csv", "w")
writer = csv.writer(output_file, delimiter='|')
col_names = ['code', 'state_fips', 'county_fips', 'period', 'civilian', 'employed', 'unemployed', 'rate', 'county', 'state', 'state_code']
writer.writerow(col_names)

In[5]:

for row in good_lines:
    row = [value.strip() for value in row.split("|")]
    region = row.pop(3).split(',') # region is the 4th entry with row[0] the first...
    row = row[:4] + [int(v.replace(',','')) for v in row[4:7]] + row[7:] # delete the commas in counts
    if len(region) == 1: # District of Columbia
        county, state_code = "", "DC"
    else:
        county, state_code = region
    state_code = state_code.strip()
    county = county.lower()

    # R's map_data has no quotes or mentions of city, parish, county, etc.
    for term in ['county', 'parish', 'municipio', "'", '.']:
        county = county.replace(term, '')
    state = state_dict[state_code].lower()
    row += [county.strip(), state, state_code]
    writer.writerow(row)

output_file.close()

Plotting via SVG

In[6]:

import urllib, os.path
from xml.etree import ElementTree
from StringIO import StringIO

import numpy as np, matplotlib.mlab as ML

Let’s download our processed data

In[7]:

ds = np.DataSource('local')
unemployment = ML.csv2rec(ds.open("http://stats202.stanford.edu/data/unemployment_2011.csv"), delimiter='|')
unemployment_jul11 = unemployment[unemployment['period']=='Jul-11']

To create a county map, we must create a dictionary with keys given by the full FIPS code.

In[8]:

unemployment_dict = {}
for i in range(unemployment_jul11.shape[0]):
    full_fips = "%02d" % unemployment_jul11[i]['state_fips']+"%03d" % unemployment_jul11[i]['county_fips']
    unemployment_dict[full_fips] = unemployment_jul11[i]['rate']

We now download a local copy of the SVG file.

In[9]:

county_svg = os.path.join('local', 'counties.svg')
if not os.path.exists(county_svg):
    svg = urllib.urlopen("http://upload.wikimedia.org/wikipedia/commons/5/5f/USA_Counties_with_FIPS_and_names.svg").read()
    f = file(county_svg, 'w')
    f.write(svg)
    f.close()

Finally, we plot it by writing out an SVG file.

In[10]:

def map_county_data(unemployment_dict, cmap, outname, county_svg):
    """Create an SVG map with colors populated
    by a dictionary of (FIPS, rate) value pairs.
    """

    county_svg = file(county_svg).read()
    county_data = ElementTree.fromstring(county_svg)

    # County style
    style_template = 'font-size:12px;fill:%(fill)s;fill-rule:nonzero;stroke:%(stroke)s;stroke-opacity:1;stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel'

    for path in county_data:
        path_id = path.get('id')
        if path_id in unemployment_dict:
            path.set('style', style_template % {'fill':cmap(unemployment_dict[path_id]), 'stroke':'#ffffff'})

    f = file(outname, 'w')
    f.write(ElementTree.tostring(county_data))
    f.close()

Now, given a color map as a function of the rate, we will colour our map.

In[11]:

colors = ["#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043"]
def cmap(rate):
    return colors[max(0, min(int(rate/2), 5))]

In[13]:

states_svg = os.path.join('local', 'states.svg')
if not os.path.exists(states_svg):
    svg = urllib.urlopen("http://upload.wikimedia.org/wikipedia/commons/3/32/Blank_US_Map.svg").read()
    f = file(states_svg, 'w')
    f.write(svg)
    f.close()

map_county_data(unemployment_dict, cmap, 'unemployment_jul11.svg', county_svg)
July 2011

July 2011

Let’s compare this with a year ago. We could also make a map of the ratio...

July 2010

July 2010

Plotting the same data using ggplot2

This example follows the Revolutions blog, though the commands have had to change as the library has changed a little.

    unemployment = read.table("http://www.stanford.edu/class/stats202/data/unemployment_2011.csv",
        header = TRUE, sep = "|")
    names(unemployment)

    ##  [1] "code"        "state_fips"  "county_fips" "period"      "civilian"
    ##  [6] "employed"    "unemployed"  "rate"        "county"      "state"
    ## [11] "state_code"

We need some libraries for the maps, the plotting tool as well as the alpha function.

    library(maps)  # mapdata
    library(ggplot2)  # ggplot

    ## Warning: package 'ggplot2' was built under R version 2.15.1

    library(scales)  # alpha

    ## Warning: package 'scales' was built under R version 2.15.1

We will plot county data here, followed by state data below. We use the state_df below to add the state boundaries to the map.

    county_df = map_data("county")
    state_df = map_data("state")

    county_df$state = county_df$region
    county_df$county = county_df$subregion

Now, we will add our unemployment data into the county_df dataframe with a merge followed by a reorder. The order is necessary for the rendering of the map to come.

    unemployment_jul11 = unemployment[unemployment$period == "Jul-11", ]
    map_data = merge(county_df, unemployment_jul11, by = c("state", "county"))
    map_data = map_data[order(map_data$order), ]

We actually plot a discretized rate rather than the rate itself.

    map_data$rate_d = cut(map_data$rate, breaks = c(seq(0, 10, by = 2), 35))

Finally, the map

    data_source = ggplot(map_data, aes(long, lat, group = group))  # this is where the data lives
    county_polygons = geom_polygon(aes(fill = rate_d), colour = alpha("white", 1/2),
        size = 0.2)  # add in the polygons in the data, filled by the attribute rate_d
    state_boundaries = geom_polygon(data = state_df, colour = "white", fill = NA)  # add in the state boundaries from state_df
    color_mapper = scale_fill_brewer("Unemployment (Jul-11)", palette = "PuRd")  # how do we fill polygons in this plot?

    unemployment_plot = (data_source + county_polygons + state_boundaries + color_mapper)
    print(unemployment_plot)

_images/unemployment_fig_00.png

State level data

    # create a new variable for unemployment

    state_df$avg_unemployment_jul11 = numeric(nrow(state_df))
    sorted_states = sort(unique(map_data$state))
    for (state in sorted_states) {
        state_subs = unemployment_jul11$state == state
        state_df$avg_unemployment_jul11[state_df$region == state] = mean(unemployment_jul11$rate[state_subs])
    }
    state_df$rate_d = cut(state_df$avg_unemployment_jul11, breaks = c(seq(0, 10,
        by = 2), 35))

    data_source = ggplot(state_df, aes(long, lat, group = group))
    state_polygons = geom_polygon(aes(fill = rate_d), colour = alpha("white", 1/2),
        size = 0.2)
    color_mapper = scale_fill_brewer("State level (Jul-11)")
    state_level_data = (data_source + state_polygons + state_boundaries + color_mapper)
    print(state_level_data)

_images/unemployment_fig_01.png

Making a function for reuse later

We might decide that code was useful, and that we can reuse it for other purposes. So, let’s make a function that will take some data for each state and plot it. It will have this fixed palette, but we can change it later.

    avg_unemployment_jul11 = state_df$avg_unemployment_jul11

    # rate_df should have at least two columns: 'state', and 'rate' we are
    # assuming that there is one row per state and will fill in the state_df

    make_state_map = function(rate_df, breaks, label, palette = 1) {

        state_df = map_data("state")
        state_df$rate = numeric(nrow(state_df))
        sorted_states = sort(unique(state_df$region))
        for (state in sorted_states) {
            state_df$rate[state_df$region == state] = rate_df$rate[rate_df$state ==
                state]
        }
        state_df$rate_d = cut(state_df$rate, breaks = c(seq(0, 10, by = 2), 35))

        # Now, we discretize the rate
        state_df$rate_d = cut(state_df$rate, breaks = breaks)

        data_source = ggplot(state_df, aes(long, lat, group = group))
        state_polygons = geom_polygon(aes(fill = rate_d), colour = alpha("white",
            1/2), size = 0.2)
        color_mapper = scale_fill_brewer(label, palette = palette)

        result = data_source + state_polygons + color_mapper
        return(result)
    }

We can reproduce our above plot as follows:

    new_data = unique(data.frame(rate = avg_unemployment_jul11, state = state_df$region))
    print(make_state_map(new_data, 5, "State level (Jul-11)"))

_images/unemployment_fig_02.png