Stat 290: Paradigms for Computing with Data
Stat 290: Paradigms for Computing with Data
Philosophy
Paradigms for Computing with Data presents computing tools and concepts for all stages of dealing with the modern data deluge---statistical computing at the center, but also the essential surrounding tasks, including data organization, presentation of results and the user interface. This approach is needed to deal with the challenges posed by modern technology, challenges that are also opportunities for better use of data. The size and complexity of data sources has increased enormously, while the importance of learning from the data has been recognized as never before. New modes of computing such as large-scale parallelism and cloud computing can help, but require new approaches to programming. But the key challenge is to use our own time effectively by choosing the best programming approach for each stage of a project.
To meet these challenges,
we present a range of computing paradigms and corresponding
languages, each designed for ease of use but also providing
a rich set of tools. We use the R language and the
thousands of packages written for it for core statistical
computing.
Other languages are
discussed for tasks where they excel. For example,
Python
provides a similarly strong language and a set of supporting
packages for data processing, scientific computing and
interactive interfaces. An approach through
inter-system interfaces and interactive front ends allows
you to add features from these languages without mastering
all the details. Alternatively, a solution can be
programmed largely in another system when appropriate, and
then made available in R. Object-oriented
programming techniques are particularly valuable. We discuss
these both in the functional form found in R and also in the
encapsulated form typical of languages such as Java and C++.
Other languages are discussed for tasks where they excel. An approach through inter-system interfaces and interactive front ends allows you to add features from these languages without mastering all the details.
Course Description
This course covers programming and computing techniques to support projects in data analysis and related research. It is suitable for students at the graduate level in statistics or in other fields where substantial data analysis and development of associated computational software is part of the student's research activity. Prerequisites are basic competence in computer programming and in statistics plus a serious interest in applying computing to data analysis. Experience with R will be an advantage.
The course will cover the major concepts in programming with R, emphasizing its use to implement and share research and applications of data analysis through R packages. It should be of interest to anyone involved in applying these tools in Statistical Computing, Bioinformatics and Data Mining.
To meet the challenges of modern data analysis, other languages and systems will be included to support data acquisition and management, data visualization and graphics, and user interfaces, especially via the Web. Python will be emphasized for its effective use in many of these areas. Techniques using Java, and XML will be included. Discussions will emphasize inter-system interfaces. Examples from Bioinformatics, databases, distributed and web-based data sources will be used.
The course will include
four homework exercise sets. A main requirement is a
final project, either an R package or other software
contribution of similar scope. Students may choose
from a list of projects that will be provided or
propose a project. (The latter is subject to the
instructor's approval.) A proposal is expected by week
2 and a final decision by week 3.
Prerequisites
■Basic Statistics (at the level of Stat 110 or Stat 141)
Please note that hands-on programming will be necessary in this class.
Instructor
■TAs: TBA
During Winter 2015, Prof. Chambers will be lecturing during the first week of class.
Meeting Times and Location
■Monday, Wednesday, Friday, 10:00-10:50AM in Gates B3
Class Materials
Topics covered will include R programming at a fairly
advanced level, objects, database tools, data formats,
XML, XSL, JSON,
language interfaces to C, Python and Java, Graphics,
parallel computing basics, big data resources in
R. There will be one or two guest lectures to be announced later.
Web Resources
See the coursework website: W14-STATS-290-01
Textbooks
There is no textbook that covers all the material for this class. The R related sections will use the book Software for Data Analysis by John Chambers, Springer (2008). Advanced R by Hadley Wickham CRC Press (2014), XML and Web Technologies for Data Sciences with R by Deborah Nolan and Duncan Temple Lang Springer (2014). Other books and online resources for material on Python, Bioinformatics, XML, graphics, databases will be listed on this page. These include:
■ggplot2: Elegant Graphics for Data Analysis, Hadley Wickam, Springer, 2009.
■R Programming for Bioinformatics by R. Gentleman, Chapman and Hall, 2008.
■Learning Python, Marj Lutz, O'Reilly.
■Parallel R, O'Reilly.
■Advanced R Development (forthcoming) by Hadley Wickam. See Advanced R Wiki
■Visualizing Data. Ben Fry. O'Reilly.