py4sci

Table Of Contents

Previous topic

Voting Records 2011 Congress

Next topic

Plotting Unemployment Data

This Page

Getting started with R

In this example, we will load some data into R, and “explore it”, at least in a simple sense.

Loading some data

In class, I will be sometimes using the ipython notebook to run R, which has to be enabled with this magic.

The data is in a particular library in R called ggplot2. You will have to install it with this command

    library(ggplot2)
    data(diamonds)

To find out what information there is about the dataset, you can run this command:

help(diamonds)

To find a more numeric summary of the data, try

    summary(diamonds)

    ##      carat              cut        color        clarity
    ##  Min.   :0.200   Fair     : 1610   D: 6775   SI1    :13065
    ##  1st Qu.:0.400   Good     : 4906   E: 9797   VS2    :12258
    ##  Median :0.700   Very Good:12082   F: 9542   SI2    : 9194
    ##  Mean   :0.798   Premium  :13791   G:11292   VS1    : 8171
    ##  3rd Qu.:1.040   Ideal    :21551   H: 8304   VVS2   : 5066
    ##  Max.   :5.010                     I: 5422   VVS1   : 3655
    ##                                    J: 2808   (Other): 2531
    ##      depth          table          price             x
    ##  Min.   :43.0   Min.   :43.0   Min.   :  326   Min.   : 0.00
    ##  1st Qu.:61.0   1st Qu.:56.0   1st Qu.:  950   1st Qu.: 4.71
    ##  Median :61.8   Median :57.0   Median : 2401   Median : 5.70
    ##  Mean   :61.8   Mean   :57.5   Mean   : 3933   Mean   : 5.73
    ##  3rd Qu.:62.5   3rd Qu.:59.0   3rd Qu.: 5324   3rd Qu.: 6.54
    ##  Max.   :79.0   Max.   :95.0   Max.   :18823   Max.   :10.74
    ##
    ##        y               z
    ##  Min.   : 0.00   Min.   : 0.00
    ##  1st Qu.: 4.72   1st Qu.: 2.91
    ##  Median : 5.71   Median : 3.53
    ##  Mean   : 5.73   Mean   : 3.54
    ##  3rd Qu.: 6.54   3rd Qu.: 4.04
    ##  Max.   :58.90   Max.   :31.80

To view another textual summary, try

    str(diamonds)

    ## 'data.frame':        53940 obs. of  10 variables:
    ##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
    ##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
    ##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
    ##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
    ##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
    ##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
    ##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
    ##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
    ##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
    ##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

To peak at a few rows of the data, try

    head(diamonds)

    ##   carat       cut color clarity depth table price    x    y    z
    ## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
    ## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
    ## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
    ## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
    ## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
    ## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Or, if you want the first 10th through 20th rows (inclusive)

    diamonds[10:20, ]

    ##    carat       cut color clarity depth table price    x    y    z
    ## 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39
    ## 11  0.30      Good     J     SI1  64.0    55   339 4.25 4.28 2.73
    ## 12  0.23     Ideal     J     VS1  62.8    56   340 3.93 3.90 2.46
    ## 13  0.22   Premium     F     SI1  60.4    61   342 3.88 3.84 2.33
    ## 14  0.31     Ideal     J     SI2  62.2    54   344 4.35 4.37 2.71
    ## 15  0.20   Premium     E     SI2  60.2    62   345 3.79 3.75 2.27
    ## 16  0.32   Premium     E      I1  60.9    58   345 4.38 4.42 2.68
    ## 17  0.30     Ideal     I     SI2  62.0    54   348 4.31 4.34 2.68
    ## 18  0.30      Good     J     SI1  63.4    54   351 4.23 4.29 2.70
    ## 19  0.30      Good     J     SI1  63.8    56   351 4.23 4.26 2.71
    ## 20  0.30 Very Good     J     SI1  62.7    59   351 4.21 4.27 2.66

You can access variables by name:

    summary(diamonds[, c("clarity", "price")])

    ##     clarity          price
    ##  SI1    :13065   Min.   :  326
    ##  VS2    :12258   1st Qu.:  950
    ##  SI2    : 9194   Median : 2401
    ##  VS1    : 8171   Mean   : 3933
    ##  VVS2   : 5066   3rd Qu.: 5324
    ##  VVS1   : 3655   Max.   :18823
    ##  (Other): 2531

We might want a visual summary of some variables as well

    pairs(diamonds[, c("depth", "price")])

_images/diamonds_fig_00.png

Some of the varibles are discrete, or categorical

    boxplot(diamonds$price ~ diamonds$clarity)

_images/diamonds_fig_01.png

You may want all rows of the diamonds with price higher than 4000$.

    diamonds_more_than_4000 = diamonds[diamonds$price > 4000, ]
    head(diamonds_more_than_4000)

    ##      carat       cut color clarity depth table price    x    y    z
    ## 6212  1.07 Very Good     I     SI1  58.4    60  4001 6.68 6.78 3.93
    ## 6213  0.90     Ideal     G     SI1  61.6    57  4001 6.17 6.24 3.82
    ## 6214  0.90     Ideal     H     SI2  62.1    55  4001 6.17 6.20 3.84
    ## 6215  1.03      Good     G     SI2  63.7    60  4001 6.35 6.28 4.02
    ## 6216  0.80 Very Good     G    VVS2  62.5    56  4002 5.95 5.98 3.73
    ## 6217  0.99 Very Good     J     SI1  60.3    57  4002 6.44 6.49 3.90

To extract only the color and clarity of these diamonds:

    color_clarity_more_than_4000 = diamonds[diamonds$price > 4000, c("color", "clarity",
        "price")]
    head(color_clarity_more_than_4000)

    ##      color clarity price
    ## 6212     I     SI1  4001
    ## 6213     G     SI1  4001
    ## 6214     H     SI2  4001
    ## 6215     G     SI2  4001
    ## 6216     G    VVS2  4002
    ## 6217     J     SI1  4002

Or, realizing that color and clarity are the 2nd and 3rd columns and price is 7th, we can find the same data with this command:

    color_clarity_more_than_4000 = diamonds[diamonds$price > 4000, c(2, 3, 7)]
    head(color_clarity_more_than_4000)

    ##            cut color price
    ## 6212 Very Good     I  4001
    ## 6213     Ideal     G  4001
    ## 6214     Ideal     H  4001
    ## 6215      Good     G  4001
    ## 6216 Very Good     G  4002
    ## 6217 Very Good     J  4002