SAMSI Co-Sponsored Event: Workshop on R & Spark – Tools for Data Science Workflows


Space is limited – Register using button below:


This two-day workshop is being made possible by SAMSI and the National Institute of Statistical Sciences (NISS).

Directions: The address is 79 T. W. Alexander Drive in the Research Triangle Park. It is in the Research Commons complex. Heading west on T. W. Alexander from the Durham Freeway (Route 147), it is the third lefthand entrance into the complex. Park in the first lot you see, in front of building 4501—there is a red sign saying MEMA on the top of the building. SAMSI is on the third floor.


R is a flexible, extensible statistical computing environment, but it is limited to single-core execution. Spark is a distributed computing environment which treats R as a first-class programming language. This course introduces data structures in R and their use in functional programming workflows relevant to data science.

The course covers the initial steps in the data science process:

  • extracting data from source systems
  • transforming data into a tidy forms
  • loading data into distributed file systems, distributed data warehouses, and NoSQL databases, i.e., ETL

SparkR and sparklyr are then used as interfaces for modeling big data using regression and classification supervised learning methods. Unsupervised learning methods, such as clustering and dimension reduction, are also covered. Additional methods, such as gradient boosting and deep learning, are illustrated using the h2o and rsparkling R packages. Finally, methods for analyzing streaming data are presented. The course finishes with an in-depth example. The infrastructure and content is containerized for easy download to your laptop using Docker.

The course will be instructed by E. James Harner, Professor Emeritus of Statistics at West Virginia University.

Questions: email