py4sci

Table Of Contents

Previous topic

Visual summaries

Next topic

Multidimensional arrays

This Page

Multidimensional scaling

The following Olympic example shows the relation between MDS and PCA when the similarity matrix is a Euclidean one.

Olympic data

    library(ade4)

    ## Attaching package: 'ade4'

    ## The following object(s) are masked from 'package:base':
    ##
    ## within

    data(olympic)
    pca.olympic = princomp(olympic$tab)
    plot(pca.olympic$scores[, 1], pca.olympic$scores[, 2], pch = 23, bg = "red",
        xlab = "Score 1", ylab = "Score 2")

_images/mds_fig_00.png

Here are the MDS scores which seem to have an axis flipped.

    X = as.matrix(olympic$tab)
    similarity = X %*% t(X)
    n = nrow(olympic$tab)
    D = sqrt((outer(rep(1, n), diag(similarity)) - 2 * similarity + outer(diag(similarity),
        rep(1, n))))
    D = as.dist(D)
    MDS = cmdscale(D)
    plot(MDS[, 1], MDS[, 2], pch = 23, bg = "red", xlab = "Score 1", ylab = "Score 2")

_images/mds_fig_01.png
    plot(MDS[, 1], pca.olympic$scores[, 1], pch = 23, bg = "red")

_images/mds_fig_02.png
    plot(MDS[, 2], pca.olympic$scores[, 2], pch = 23, bg = "red")

_images/mds_fig_03.png

This exact relation between the MDS coordinates and PCA coordinates depends on the fact that we are using the Euclidean distance between cases. Let’s try the Manhattan distance.

    D_L1 = dist(X, "manhattan")
    MDS_L1 = cmdscale(D_L1)
    plot(MDS_L1[, 1], MDS_L1[, 2], pch = 23, bg = "red")

_images/mds_fig_04.png

On inspection, it is not the same as before (which is not surprising). We see the scores are correlated, but not perfectly as before.

    plot(MDS[, 1], MDS_L1[, 1], pch = 23, bg = "red")

_images/mds_fig_05.png
    plot(MDS[, 2], MDS_L1[, 2], pch = 23, bg = "red")

_images/mds_fig_06.png

U.S. capitals

I formed a file of the distance matrix for the 50 U.S. capitals using Google’s map API following this example.

In[7]:

import os, urllib, time, random

Rather than type state capitals, I copied a list

In[8]:

capitals_str = """
Alabama - Montgomery<br>
Alaska - Juneau<br>
Arizona - Phoenix<br>
Arkansas - Little Rock<br>
California - Sacramento<br>
Colorado - Denver<br>
Connecticut - Hartford<br>
Delaware - Dover<br>
Florida - Tallahassee<br>
Georgia - Atlanta<br>
Hawaii - Honolulu<br>
Idaho - Boise<br>
Illinois - Springfield<br>
Indiana - Indianapolis<br>
Iowa - Des Moines<br>
Kansas - Topeka<br>
Kentucky - Frankfort<br>
Louisiana - Baton Rouge<br>
Maine - Augusta<br>
Maryland - Annapolis<br>
Massachusetts - Boston<br>
Michigan - Lansing<br>
Minnesota - St. Paul<br>
Mississippi - Jackson<br>
Missouri - Jefferson City<br>
Montana - Helena<br>
Nebraska - Lincoln<br>
Nevada - Carson City<br>
New Hampshire - Concord<br>
New Jersey - Trenton<br>
New Mexico - Santa Fe<br>
New York - Albany<br>
North Carolina - Raleigh<br>
North Dakota - Bismarck<br>
Ohio - Columbus<br>
Oklahoma - Oklahoma City<br>
Oregon - Salem<br>
Pennsylvania - Harrisburg<br>
Rhode Island - Providence<br>
South Carolina - Columbia<br>
South Dakota - Pierre<br>
Tennessee - Nashville<br>
Texas - Austin<br>
Utah - Salt Lake City<br>
Vermont - Montpelier<br>
Virginia - Richmond<br>
Washington - Olympia<br>
West Virginia - Charleston<br>
Wisconsin - Madison<br>
Wyoming - Cheyenne
"""

state_dict = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AS': 'American Samoa',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'GU': 'Guam',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MP': 'Northern Mariana Islands',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NA': 'National',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'PR': 'Puerto Rico',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VI': 'Virgin Islands',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
       }

Next, I will form a list of capitals that google map will be able to return the coordinates of.

In[9]:

capitals_str = capitals_str.replace("<br>","").strip()
continental_50 = []
for k in state_dict:
    if state_dict[k] in capitals_str:
        continental_50.append(k)
        # this fails for Indianopolis unfortunately
        capitals_str = capitals_str.replace(state_dict[k], k)

capitals = [','.join(c.split('-')[::-1]) for c in capitals_str.split('\n')]

Retrieve the results in blocks of 25. We do this because google restricts to 100 results every 10 seconds, 2500 results per 24 hours so we sleep for at least 15 seconds.

We don’t need to retrieve the whole square matrix as it’s symmetric, we retrieve it in 5x5 blocks

In[10]:

def get_all_distances():

    if not os.path.exists(os.path.join('..', 'distances')):
        os.makedirs(os.path.join("..", "distances"))

    for i in range(10):
        for j in range(i+1):
            origins = "|".join(capitals[(i*5):(i+1)*5])
            destinations = "|".join(capitals[(j*5):(j+1)*5])
            url = "http://maps.googleapis.com/maps/api/distancematrix/json?origins=%s&destinations=%s&sensor=false" % (origins, destinations)
            # form a proper url
            url = url.replace(' ','+')
            fname = os.path.join('..', "distances", "results_%d%d.json" % (i,j))
            if not os.path.exists(fname):
                results = urllib.urlopen(url).read()
                f = file(fname, 'w')
                f.write(results)
                f.close()
                time.sleep(random.randint(15,25))
                print i, j

get_all_distances()

Now, we parse all the resulting JSON files

In[11]:

from glob import glob
import simplejson
import numpy as np, matplotlib.mlab as ML

def make_data_matrix():
    distance = {}
    duration = {}
    for f in glob("../distances/*json"):
        results = simplejson.load(file(f))
        for o, row in zip(results['origin_addresses'], results['rows']):
            for d, col in zip(results['destination_addresses'], row['elements']):
                o_st = o.split(',')[-2].strip().upper().split(' ')[0]
                d_st = d.split(',')[-2].strip().upper().split(' ')[0]
                distance[(o_st,d_st)] = float(col['distance']['value'])
                distance[(d_st,o_st)] = distance[(o_st,d_st)]

                duration[(o_st,d_st)] = float(col['duration']['value'])
                duration[(d_st,o_st)] = duration[(o_st,d_st)]

    states = sorted(continental_50)
    distance_matrix = np.empty(50, [('state', '|S4')] + [(r, np.float) for r in states])
    duration_matrix = np.empty(50, [('state', '|S4')] + [(r, np.float) for r in states])

    for i, ostate in enumerate(states):
        distance_matrix['state'][i] = ostate
        duration_matrix['state'][i] = ostate
        for dstate in states:
            distance_matrix[dstate][i] = distance[(ostate, dstate)]
            duration_matrix[dstate][i] = duration[(ostate, dstate)]
    ML.rec2csv(distance_matrix, '../data/capital_distance.csv')
    ML.rec2csv(duration_matrix, '../data/capital_duration.csv')

make_data_matrix()

Let’s make a map that we can view to compare our MDS plots to. First, we form a URL, which we paste into an img HTML tag as src below.

In[12]:

url = """http://maps.googleapis.com/maps/api/staticmap?markers=color:blue|%(capitals)s&sensor=false&size=640x640&zoom=3""" % {'capitals':'|'.join(capitals[:50]).replace(' ','')}
html = '<img src="%s">' % url
html

Out[12]:

'<img src="http://maps.googleapis.com/maps/api/staticmap?markers=color:blue|Montgomery,AL|Juneau,AK|Phoenix,AZ|LittleRock,AR|Sacramento,CA|Denver,CO|Hartford,CT|Dover,DE|Tallahassee,FL|Atlanta,GA|Honolulu,HI|Boise,ID|Springfield,IL|INpolis,IN|DesMoines,IA|Topeka,KS|Frankfort,KY|BatonRouge,LA|Augusta,ME|Annapolis,MD|Boston,MA|Lansing,MI|St.Paul,MN|Jackson,MS|JeffersonCity,MO|Helena,MT|Lincoln,NE|CarsonCity,NV|Concord,NH|Trenton,NJ|SantaFe,NM|Albany,NY|Raleigh,NC|Bismarck,ND|Columbus,OH|OKCity,OK|Salem,OR|Harrisburg,PA|Providence,RI|Columbia,SC|Pierre,SD|Nashville,TN|Austin,TX|SaltLakeCity,UT|Montpelier,VT|Richmond,VA|Olympia,WA|Charleston,WV|Madison,WI|Cheyenne,WY&sensor=false&size=640x640&zoom=3">'
Map of U.S. capitals

U.S. Capitals

    capitals = read.table("http://stats202.stanford.edu/data/capital_distance.csv",
        header = TRUE, sep = ",")
    MDS = cmdscale(as.dist(capitals[, -1]))
    plot(MDS, type = "n")
    text(MDS[, 1], MDS[, 2], names(capitals[, -1]), xlab = "", ylab = "")

_images/mds_fig_07.png

Let’s correct this East-West swap.

    plot(MDS, type = "n")
    text(-MDS[, 1], MDS[, 2], names(capitals[, -1]), xlab = "", ylab = "")

_images/mds_fig_08.png

Not perfect, but not too bad.

Iris data

Let’s take another look at how different distances affect the MDS coordinates.

    iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
        sep = ",", header = FALSE)
    names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width",
        "iris.type")

First, we use Euclidean distance, yielding the usual (unstandardized) PCA scores.

    D = dist(iris[, -5], method = "euclidean")
    MDS = cmdscale(D)
    plot(MDS[, 1], MDS[, 2], pch = 21, bg = c("red", "green", "blue")[unclass(iris$iris.type)])

_images/mds_fig_09.png

How about \ell_1 or Manhattan?

    D = dist(iris[, -5], method = "manhattan")
    MDS = cmdscale(D)
    plot(MDS[, 1], MDS[, 2], pch = 21, bg = c("red", "green", "blue")[unclass(iris$iris.type)])

_images/mds_fig_10.png

And the \ell_{\infty} or max norm?

    D = dist(iris[, -5], method = "maximum")
    MDS = cmdscale(D)
    plot(MDS[, 1], MDS[, 2], pch = 21, bg = c("red", "green", "blue")[unclass(iris$iris.type)])

_images/mds_fig_11.png

Just for fun, let’s try some Minkowski distances. With p=20, this should look close to \ell_{\infty}

    D = dist(iris[, -5], method = "minkowski", p = 20)
    MDS = cmdscale(D)
    plot(MDS[, 1], MDS[, 2], pch = 21, bg = c("red", "green", "blue")[unclass(iris$iris.type)])

_images/mds_fig_12.png

For small p, it should look closer to \ell_1. We can even take p < 1.

    D = dist(iris[, -5], method = "minkowski", p = 0.2)
    MDS = cmdscale(D)
    plot(MDS[, 1], MDS[, 2], pch = 21, bg = c("red", "green", "blue")[unclass(iris$iris.type)])

_images/mds_fig_13.png

In short, it seems easy to distinguish setosa from virginica and versicolor but these two may be harder to separate.