The 5th IEEE International
Conference on Data Science
and Advanced Analytics

1–4 October 2018
Turin — Italy

Tutorials

NDlib: Modelling and Analyzing Diffusion Processes over Complex Networks

Abstract:

Nowadays the analysis of dynamics of and on networks represents a hot topic in the Social Network Analysis playground. To support stu-dents, teachers, developers and researchers we developed a novel frame-work, named NDlib, an environment designed to describe diffusion sim-ulations. NDlib is designed to be a multi-level ecosystem that can be fruitfully used by different user segments. Upon NDlib, we designed a simulation server that allows remote execution of experiments as well as an online visualization tool that abstracts its programmatic interface and makes available the simulation platform to non-technicians.

Presenters:

Giulio Rossetti, KDD Lab. ISTI-CNR, Pisa, Italy
Letizia Milli, KDD Lab., University of Pisa
Salvatore Rinzivillo, KDD Lab. ISTI-CNR, Pisa, Italy

Project Management for Data Science

Abstract:

This short tutorial introduces the fundamental principles and best practices of project management; it specifically targets data science projects, including those that involve supervised machine learning (ML) and natural language processing (NLP) elements. The view presented is overall consistent with traditional project management methodology as taught by the Project Management Institutes (PMI), but adapts and extends these as necessary and beneficial to the data science context, which is more experimental, data driven (and thus open-ended and uncertain) than, say, a bridge construction project. By including practical aspects like interfacing between technologists and business stakeholder in industry projects and integrating ethics and privacy considerations, we hope to provide a holistic, useful and practically applicable account.

Presenter:

Jochen L. Leidner, Thomson Reuters Research & Development and University of Sheffield, United Kingdom

Reproducible research using lessons learn from software development

Abstract:

A cornerstone of data-driven empirical research is reproducibility. The credibility of an analysis or a forecasting system rest in the promise that the entire analysis process can be reproduced by an independent party yielding similar results. Modern data scientist are faced with the challenge of maintaining reproducibility of their results while at the same time the software infrastructure required to compute and adequately present these results is becoming increasingly complex. This tutorial is geared towards novice and intermediate data scientists who want to improve the reproducibility of their results. To this end, methods for applying well-established tools and procedures from software development are applied to a data analysis work-flow to improve reproducibility of the results. In particular, the tutorial will cover the use of markdown and the ‘knitr’ package for combining code, results, and description (literate programming), ‘make’ for organiz-ing and automating complex build processes in a data analysis, git for version control and collaboration, and finally the use of container technology (Docker) to isolate an entire data analysis including the underlying operating system. The tutorial will intro-duce each of the above mentioned technologies for their particular role in making a data analysis more reproducible and is thus geared towards an audience with either no or little experience in any or all of the required techniques. The example analyses will be written in R and Python.

Presenter:

Kevin Kunzmann, University of Cambridge, MRC Biostatistics Unit, United Kingdom

Deep Learning for Computer Vision: A practitioner’s viewpoint

Abstract:

This tutorial will focus on Deep Learning for image classification, adopting a pragmatic perspective and dealing with data scarcity, a scenario where training models from scratch leads to overfitting. We are going to tackle these problems by learning from practical examples. We will show in code examples, using Jupyter notebooks, how to deal with model selection with an example dataset. We will show how the theory of approximation-generalization works in practice, by producing and interpreting the learning curves for different models and estimating the amount of data necessary to obtain a given performance. Finally, we will introduce the transfer learning technique and show how it allows to obtain better performance with less data and limited resources.

Presenters:

André Panisson, ISI Foundation, Turin, Italy
Alan Perotti, ISI Foundation, Turin, Italy

Data Science Workflows Using R and Spark

Abstract:

This tutorial covers the data science process using R as a programming language and Spark as a big-data platform. Powerful workflows are developed for data extraction, data transformation and tidying, data modeling, and data visualization. During the course R-based illustrations show how data is transported using REST APIs, sockets, etc. into persistent data stores such as the Hadoop Distributed File System (HDFS), relational databases and in some cases sent directly to Spark's real-time compute engine. Workflows using dplyr verbs are used for data manipulation within R, within relational databases (PostgreSQL),and within Spark using sparklyr. These data-based workflows extend into machine learning algorithms, model evaluation, and data visualization. The machine learning algorithms taught in this tutorial include supervised techniques such as linear regression, logistic regression, decision trees, gradient-boosted trees, and random forests. Feature selection is done primarily by regularization and models are evaluated using various metrics. Unsupervised techniques include k-means clustering and dimension reduction. Big-data architectures are discussed including the Docker containers used for building the tutorial infrastructure called rspark.

See: https://github.com/jharner/rspark
The Docker containers can be run on the desktop, run using vagrant, or deployed to Amazon Web Services (AWS). As a result, students will have access to a full big-data computing platform and extensive course content.

Presenter:

E. James Harner, West Virginia University, USA

Kernel Methods for Machine Learning and Data Science

Abstract:

This tutorial is presented at an intermediate level and seeks to explore the variety of ways in which modern kernel methods use the concept of similarity measures in Reproducing Kernel Hilbert Spaces (RKHS) to extend, strengthen and enrich existing learning machines while creating brand new ones. Techniques like kernel PCA which extends the ubiquitous method of principal component analysis are presented, as well as spectral clustering, kernel kMeans, and the whole suite of kernel regression techniques from Radial basis function regression to the Relevance vector Machine Regression, the Support vector Regression machine and the Gaussian Process Regression method, just to name a few. Examples involving small, medium and extremely large (Big data) datasets are used to illustrate the methods. The software environment used is R studio.

Presenter:

Ernest Fokoué, School of Mathematical Sciences, Rochester Institute of Technology, USA