SAMSI Co-Sponsored Event: Workshop on R & Spark – Tools for Data Science Workflows


This two-day workshop was made possible by SAMSI and the National Institute of Statistical Sciences (NISS).


R is a flexible, extensible statistical computing environment, but it is limited to single-core execution. Spark is a distributed computing environment which treats R as a first-class programming language. This course introduces data structures in R and their use in functional programming workflows relevant to data science.

The course covered the initial steps in the data science process:

  • extracting data from source systems
  • transforming data into a tidy forms
  • loading data into distributed file systems, distributed data warehouses, and NoSQL databases, i.e., ETL

SparkR and sparklyr were then used as interfaces for modeling big data using regression and classification supervised learning methods. Unsupervised learning methods, such as clustering and dimension reduction, were also covered. Additional methods, such as gradient boosting and deep learning, were illustrated using the h2o and rsparkling R packages. Finally, methods for analyzing streaming data were presented. The course finished with an in-depth example. The infrastructure and content were containerized for easy download to students’ laptops.

The course was instructed by E. James Harner, Professor Emeritus of Statistics at West Virginia University.

Questions: email