py4sci

Table Of Contents

Previous topic

Multidimensional scaling

Next topic

Decision trees

This Page

Multidimensional arrays

This example creates the iris data cubes constructed in the text book as well as data cubes for the unemployment data.

Iris data

We now construct the same cubes used in your textbook, based on separating the irises by petal width and petal length.

    iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
        sep = ",", header = FALSE)
    names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width",
        "iris.type")
    attach(iris)

    width_cuts = c(0, 0.75, 1.75, Inf)
    length_cuts = c(0, 2.5, 5, Inf)
    widths = c("low_w", "medium_w", "high_w")
    lengths = c("low_l", "medium_l", "high_l")
    types = sort(levels(iris.type))
    cube = array(0, c(3, 3, 3), dimnames = list(types, widths, lengths))

    for (i in 1:3) {
        for (j in 1:3) {
            for (k in 1:3) {
                subset = iris.type == types[k]
                cube[k, i, j] = sum((petal.width[subset] < width_cuts[i + 1]) &
                    (petal.width[subset] >= width_cuts[i]) & (petal.length[subset] <
                    length_cuts[j + 1]) & (petal.length[subset] >= length_cuts[j]))
            }
        }
    }

    types[1]

    ## [1] "Iris-setosa"

    cube[1, , ]

    ##          low_l medium_l high_l
    ## low_w       50        0      0
    ## medium_w     0        0      0
    ## high_w       0        0      0

    types[2]

    ## [1] "Iris-versicolor"

    cube[2, , ]

    ##          low_l medium_l high_l
    ## low_w        0        0      0
    ## medium_w     0       47      2
    ## high_w       0        1      0

    types[3]

    ## [1] "Iris-virginica"

    cube[3, , ]

    ##          low_l medium_l high_l
    ## low_w        0        0      0
    ## medium_w     0        1      4
    ## high_w       0        5     40

    detach(iris)

Using python: pandas

First, we import the dataset into ipython using the R magic

In[5]:

%load_ext rmagic
%R -d iris

In[6]:

import pandas, numpy as np
iris_df = pandas.DataFrame(iris)

We will write custom functions to discretize the variables, then aggregate. When going through this originally in class, I had used a custom function but one can use np.digitize

In[7]:

iris_df['cut.width'] = np.digitize(iris_df['petal.width'], (0,0.75, 1.75, np.inf))
iris_df['cut.length'] = np.digitize(iris_df['petal.length'], (0,2.5,5,np.inf))
iris_df['iris.name'] = map(lambda x: {1:'setosa', 2:'versicolor', 3:'virginica'}[x], iris_df['iris.type'])

Let’s see how many we have in each category. Note that it doesn’t form the array as we have in R, and it leaves out some groups where there is no data.

In[8]:

groups = iris_df.groupby(['iris.name', 'cut.width', 'cut.length'])
V = groups.agg(len)
print V
                                 sepal.length  sepal.width  petal.length  petal.width  iris.type
iris.name  cut.width cut.length
setosa     1         1           50            50           50            50           50
versicolor 2         2           47            47           47            47           47
                     3           2             2            2             2            2
versicolor 3         2           1             1            1             1            1
virginica  2         2           1             1            1             1            1
                     3           4             4            4             4            4
virginica  3         2           5             5            5             5            5
                     3           40            40           40            40           40

We can also compute mean or max easily

In[9]:

print groups.agg(np.mean)
                                 sepal.length  sepal.width  petal.length  petal.width  iris.type
iris.name  cut.width cut.length
setosa     1         1           5.006         3.418        1.464         0.244        1
versicolor 2         2           5.919         2.757        4.215         1.302        2
                     3           6.35          2.85         5.05          1.65         2
versicolor 3         2           5.9           3.2          4.8           1.8          2
virginica  2         2           4.9           2.5          4.5           1.7          3
                     3           6.4           2.65         5.375         1.5          3
virginica  3         2           6.04          2.86         4.86          1.84         3
                     3           6.717         3.032        5.682         2.11         3

In[10]:

print groups.agg(np.max)
                                 sepal.length  sepal.width  petal.length  petal.width  iris.type
iris.name  cut.width cut.length
setosa     1         1           5.8           4.4          1.9           0.6          1
versicolor 2         2           7             3.4          4.9           1.6          2
                     3           6.7           3            5.1           1.7          2
versicolor 3         2           5.9           3.2          4.8           1.8          2
virginica  2         2           4.9           2.5          4.5           1.7          3
                     3           7.2           3            5.8           1.6          3
virginica  3         2           6.3           3            4.9           2            3
                     3           7.9           3.8          6.9           2.5          3

More specialized output can be formed by passing a dictionary to agg

In[11]:

groups.agg({'sepal.length':np.mean, 'sepal.width':np.max})

Out[11]:

                                 sepal.length  sepal.width
iris.name  cut.width cut.length
setosa     1         1           5.006         4.4
versicolor 2         2           5.919         3.4
                     3           6.35          3
versicolor 3         2           5.9           3.2
virginica  2         2           4.9           2.5
                     3           6.4           3
virginica  3         2           6.04          3
                     3           6.717         3.8

Unemployment data

We will construct a cube with one entry for each state, each period and each of the variables [rate, unemployed, civilian]

    uemp = read.table("http://stats202.stanford.edu/data/unemployment_2011.csv",
        sep = "|", header = TRUE)

    periods = levels(uemp$period)
    states = levels(uemp$state)
    vars = c("rate", "civilian", "unemployed")

The following is a possibly inefficient way to construct the data cube, though it shows exactly what goes into each cell of the cube.

First, we construct an empty cube:

    cube = array(0, c(length(states), length(periods), length(vars)), dimnames = list(states,
        periods, vars))

Next, we choose some function to apply over each variable. We want the mean, but some counties have missing data, so we remove the missing (NA) by default

    mean.na = function(x) {
        return(mean(x, na.rm = TRUE))
    }

    max.na = function(x) {
        return(max(x, na.rm = TRUE))
    }

    fns = list(mean.na, sum, sum)

For each variable, we apply a different function across the counties of that state for that period. In our example, we will compute the raw mean unemployment rate across counties, the total number of unemployed and the total number of civilians. Note the use of list in the arguments to array. A list is similar to a call to c, but the entries can be named as well.

Let’s now loop over the 3 axes of the cube, populating the entries. This is slightly inefficient...

    for (i in 1:dim(cube)[1]) {
        for (j in 1:dim(cube)[2]) {
            for (k in 1:dim(cube)[3]) {
                cube[i, j, k] = fns[[k]](uemp[(uemp$state == states[i]) & (uemp$period ==
                    periods[j]), vars[k]])
            }
        }
    }

Let’s redo one of the problems on your homework:

    cube["california", , ]

    ##            rate civilian unemployed
    ## Apr-12    12.69 18367346    1930972
    ## Aug-11    12.86 18449357    2199257
    ## Dec-11    12.89 18432229    2017792
    ## Feb-12    13.75 18450464    2100526
    ## Jan-12    13.60 18364985    2080980
    ## Jul-11    13.44 18436446    2264485
    ## Jul-12    11.90  4847286     576580
    ## Jul-12(p) 12.22 13640987    1434847
    ## Jun-11    13.55 18364776    2207433
    ## Jun-12    12.32 18444636    1972365
    ## Mar-12    13.94 18498666    2119946
    ## May-12    12.17 18431866    1912724
    ## Nov-11    12.43 18429657    2014476
    ## Oct-11    12.31 18482754    2083997
    ## Sep-11    12.38 18473678    2124652

As we saw in the homework, the mean of the rates is not the same as the overall unemployment. This is because each county has a different population, so averaging the averages does not yield the overall average.

    cube["california", , "unemployed"]/cube["california", , "civilian"]

    ##    Apr-12    Aug-11    Dec-11    Feb-12    Jan-12    Jul-11    Jul-12
    ##    0.1051    0.1192    0.1095    0.1138    0.1133    0.1228    0.1189
    ## Jul-12(p)    Jun-11    Jun-12    Mar-12    May-12    Nov-11    Oct-11
    ##    0.1052    0.1202    0.1069    0.1146    0.1038    0.1093    0.1128
    ##    Sep-11
    ##    0.1150

Some example of aggregations would be to take the maximum of the entries across states. This will yield a two-dimensional cube with periods and variables.

    apply(cube, c(2, 3), max.na)

    ##            rate civilian unemployed
    ## Apr-12    16.25 18367346    1930972
    ## Aug-11    18.43 18449357    2199257
    ## Dec-11    15.11 18432229    2017792
    ## Feb-12    16.96 18450464    2100526
    ## Jan-12    17.45 18364985    2080980
    ## Jul-11    18.34 18436446    2264485
    ## Jul-12    11.90  4847286     576580
    ## Jul-12(p) 16.92 13640987    1434847
    ## Jun-11    17.05 18364776    2207433
    ## Jun-12    15.91 18444636    1972365
    ## Mar-12    17.06 18498666    2119946
    ## May-12    15.54 18431866    1912724
    ## Nov-11    17.22 18429657    2014476
    ## Oct-11    18.37 18482754    2083997
    ## Sep-11    17.34 18473678    2124652

To compute the maximum across periods, we would use the call

    apply(cube, c(1, 3), max.na)

    ##                        rate civilian unemployed
    ## alabama              11.387  2212409     217390
    ## alaska               12.907   376353      28891
    ## arizona              13.153  3038650     304069
    ## arkansas              9.509  1404676     117593
    ## california           13.945 18498666    2264485
    ## colorado              7.875  2764916     233130
    ## connecticut           9.037  1945156     179351
    ## delaware              7.733   443372      33830
    ## district of columbia 11.300   360710      39268
    ## florida              10.927  9341312    1025822
    ## georgia              11.275  4814162     489052
    ## hawaii                8.975   665420      49070
    ## idaho                 9.961   789839      69948
    ## illinois             10.525  6668291     688463
    ## indiana               9.468  3217446     297604
    ## iowa                  6.181  1672921     101104
    ## kansas                5.825  1520191     105842
    ## kentucky             10.613  2108261     207019
    ## louisiana             9.861  2116953     182653
    ## maine                 9.275   724085      58151
    ## maryland              8.304  3136247     231566
    ## massachusetts         8.836  3510445     267181
    ## michigan             11.373  4749360     545392
    ## minnesota             7.629  3014554     209140
    ## mississippi          13.376  1362235     158342
    ## missouri              8.972  3089671     274973
    ## montana               7.154   518795      37166
    ## nebraska              4.419  1029874      48528
    ## nevada               11.665  1391576     195222
    ## new hampshire         5.860   750633      42865
    ## new jersey           10.329  4660351     470628
    ## new mexico            8.164   934234      76339
    ## new york              9.581  9733982     888935
    ## north carolina       11.740  4711336     513245
    ## north dakota          4.625   398500      15487
    ## ohio                  9.866  5879995     534920
    ## oklahoma              6.338  1821340     114678
    ## oregon               11.306  2008691     192600
    ## pennsylvania          8.639  6589268     549617
    ## puerto rico          18.435  1297689     209487
    ## rhode island         11.560   569797      67519
    ## south carolina       12.535  2191325     236844
    ## south dakota          5.459   454677      21977
    ## tennessee            11.160  3157774     308290
    ## texas                 8.034 12729403    1073931
    ## utah                  7.990  1361906      96055
    ## vermont               6.386   363638      20564
    ## virginia              7.806  4383650     281012
    ## washington           10.621  3545307     327109
    ## west virginia         9.300   816760      66382
    ## wisconsin             8.908  3119082     247757
    ## wyoming               6.370   312331      19194

Finally, the mean across periods:

    apply(cube, c(1, 3), mean.na)

    ##                        rate civilian unemployed
    ## alabama               9.709  2024172     168915
    ## alaska               10.278   343866      25203
    ## arizona              11.905  2814567     251287
    ## arkansas              8.605  1288981      99720
    ## california           12.829 17207676    1936069
    ## colorado              7.451  2548757     206818
    ## connecticut           8.182  1790952     149932
    ## delaware              7.140   411031      29289
    ## district of columbia  9.879   324717      32069
    ## florida               9.601  8659793     838766
    ## georgia              10.459  4430280     419911
    ## hawaii                7.798   613353      40660
    ## idaho                 8.549   724739      59130
    ## illinois              9.207  6152551     583545
    ## indiana               8.713  2985326     258672
    ## iowa                  5.463  1551418      85552
    ## kansas                5.312  1404166      90010
    ## kentucky              9.689  1931987     172655
    ## louisiana             8.485  1930627     141619
    ## maine                 8.102   661097      48421
    ## maryland              7.596  2878238     199061
    ## massachusetts         6.958  3229570     220111
    ## michigan             10.186  4352379     415137
    ## minnesota             6.196  2781210     166719
    ## mississippi          11.463  1254891     126064
    ## missouri              8.074  2833175     223044
    ## montana               6.258   474092      30891
    ## nebraska              3.831   945655      39875
    ## nevada               10.829  1284463     164161
    ## new hampshire         5.303   691914      36756
    ## new jersey            9.658  4275501     400955
    ## new mexico            7.215   867114      62206
    ## new york              8.433  8904396     754669
    ## north carolina       10.923  4358332     441871
    ## north dakota          3.711   362494      11879
    ## ohio                  8.675  5418135     433022
    ## oklahoma              5.628  1666113      95881
    ## oregon                9.938  1858861     167279
    ## pennsylvania          7.853  5990602     468396
    ## puerto rico          16.998  1191293     178279
    ## rhode island         10.219   522782      58101
    ## south carolina       11.511  2015339     196498
    ## south dakota          4.895   418020      18358
    ## tennessee             9.908  2917991     250935
    ## texas                 6.994 11705231     881706
    ## utah                  6.894  1250943      77664
    ## vermont               5.619   335339      17288
    ## virginia              7.241  4044809     243252
    ## washington            9.538  3266361     282989
    ## west virginia         8.359   749590      56211
    ## wisconsin             7.699  2863893     206440
    ## wyoming               5.580   286014      15994