Tutorials [IEEE DSAA 2018]

Tutorials

(T1) NDlib: Modelling and Analyzing Diﬀusion Processes over Complex Networks

Abstract:

Nowadays the analysis of dynamics of and on networks represents a hot topic in the Social Network Analysis playground. To support stu-dents, teachers, developers and researchers we developed a novel frame-work, named NDlib, an environment designed to describe diﬀusion sim-ulations. NDlib is designed to be a multi-level ecosystem that can be fruitfully used by diﬀerent user segments. Upon NDlib, we designed a simulation server that allows remote execution of experiments as well as an online visualization tool that abstracts its programmatic interface and makes available the simulation platform to non-technicians.

Presenters:

Giulio Rossetti, KDD Lab. ISTI-CNR, Pisa, Italy
Letizia Milli, KDD Lab., University of Pisa
Salvatore Rinzivillo, KDD Lab. ISTI-CNR, Pisa, Italy

(T2) Data Sources and Techniques to Mine Human Mobility Patterns

Abstract:

Understanding human mobility is key for a wide range of applications, such as urban planning, traffic forecasting, activity-based modeling transportation networks or epidemic modelling, to name a few. The huge amount of geo-spatial data is creating new challenges and opportunities to satisfy this thirst of knowledge. In this tutorial, we introduce datasets, concepts, knowledge, and methods used in human mobility mining. Specifically, in order to highlight the main advantages, limitations, and open issues, we present a survey of the state of the art and case-studies based on the city of Barcelona and on the region of Catalonia.

Presenters:

Ludovico Boratto, Eurecat, The Technology Centre of Catalonia, Barcelona (Spain)
Carmen Herrero, Eurecat, The Technology Centre of Catalonia, Barcelona (Spain)
Andreas Kaltenbrunner, NTENT, Barcelona (Spain)
Matteo Manca, Zurich Spain, Barcelona (Spain)
Giovanni Stilo, Sapienza University of Rome (Italy)

(T3) Reproducible research using lessons learn from software development

Abstract:

A cornerstone of data-driven empirical research is reproducibility. The credibility of an analysis or a forecasting system rest in the promise that the entire analysis process can be reproduced by an independent party yielding similar results. Modern data scientist are faced with the challenge of maintaining reproducibility of their results while at the same time the software infrastructure required to compute and adequately present these results is becoming increasingly complex. This tutorial is geared towards novice and intermediate data scientists who want to improve the reproducibility of their results. To this end, methods for applying well-established tools and procedures from software development are applied to a data analysis work-ﬂow to improve reproducibility of the results. In particular, the tutorial will cover the use of markdown and the ‘knitr’ package for combining code, results, and description (literate programming), ‘make’ for organiz-ing and automating complex build processes in a data analysis, git for version control and collaboration, and ﬁnally the use of container technology (Docker) to isolate an entire data analysis including the underlying operating system. The tutorial will intro-duce each of the above mentioned technologies for their particular role in making a data analysis more reproducible and is thus geared towards an audience with either no or little experience in any or all of the required techniques. The example analyses will be written in R and Python.

Presenter:

Kevin Kunzmann, University of Cambridge, MRC Biostatistics Unit, United Kingdom

(T4) Deep Learning for Computer Vision: A practitioner’s viewpoint

Abstract:

This tutorial will focus on Deep Learning for image classification, adopting a pragmatic perspective and dealing with data scarcity, a scenario where training models from scratch leads to overfitting. We are going to tackle these problems by learning from practical examples. We will show in code examples, using Jupyter notebooks, how to deal with model selection with an example dataset. We will show how the theory of approximation-generalization works in practice, by producing and interpreting the learning curves for different models and estimating the amount of data necessary to obtain a given performance. Finally, we will introduce the transfer learning technique and show how it allows to obtain better performance with less data and limited resources.

Presenters:

André Panisson, ISI Foundation, Turin, Italy
Alan Perotti, ISI Foundation, Turin, Italy

(T5) Project Management for Data Science

Abstract:

This short tutorial introduces the fundamental principles and best practices of project management; it speciﬁcally targets data science projects, including those that involve supervised machine learning (ML) and natural language processing (NLP) elements. The view presented is overall consistent with traditional project management methodology as taught by the Project Management Institutes (PMI), but adapts and extends these as necessary and beneﬁcial to the data science context, which is more experimental, data driven (and thus open-ended and uncertain) than, say, a bridge construction project. By including practical aspects like interfacing between technologists and business stakeholder in industry projects and integrating ethics and privacy considerations, we hope to provide a holistic, useful and practically applicable account.

Presenter:

Jochen L. Leidner, Thomson Reuters Research & Development and University of Shefﬁeld, United Kingdom

(T6) Data Science Workflows Using R and Spark

Abstract:

This tutorial covers the data science process using R as a programming language and Spark as a big-data platform. Powerful workflows are developed for data extraction, data transformation and tidying, data modeling, and data visualization. During the course R-based illustrations show how data is transported using REST APIs, sockets, etc. into persistent data stores such as the Hadoop Distributed File System (HDFS), relational databases and in some cases sent directly to Spark's real-time compute engine. Workflows using dplyr verbs are used for data manipulation within R, within relational databases (PostgreSQL),and within Spark using sparklyr. These data-based workflows extend into machine learning algorithms, model evaluation, and data visualization. The machine learning algorithms taught in this tutorial include supervised techniques such as linear regression, logistic regression, decision trees, gradient-boosted trees, and random forests. Feature selection is done primarily by regularization and models are evaluated using various metrics. Unsupervised techniques include k-means clustering and dimension reduction. Big-data architectures are discussed including the Docker containers used for building the tutorial infrastructure called rspark.

See: https://github.com/jharner/rspark
The Docker containers can be run on the desktop, run using vagrant, or deployed to Amazon Web Services (AWS). As a result, students will have access to a full big-data computing platform and extensive course content.

Presenter:

E. James Harner, West Virginia University, USA

(T7) Kernel Methods for Machine Learning and Data Science

Abstract:

This tutorial is presented at an intermediate level and seeks to explore the variety of ways in which modern kernel methods use the concept of similarity measures in Reproducing Kernel Hilbert Spaces (RKHS) to extend, strengthen and enrich existing learning machines while creating brand new ones. Techniques like kernel PCA which extends the ubiquitous method of principal component analysis are presented, as well as spectral clustering, kernel kMeans, and the whole suite of kernel regression techniques from Radial basis function regression to the Relevance vector Machine Regression, the Support vector Regression machine and the Gaussian Process Regression method, just to name a few. Examples involving small, medium and extremely large (Big data) datasets are used to illustrate the methods. The software environment used is R studio.

Presenter:

Ernest Fokoué, School of Mathematical Sciences, Rochester Institute of Technology, USA

(T8) Recurrent Neural Nets with applications to Language Modeling

Abstract:

Many phenomena in our increasingly technological world evolve and change over time, from state to state or from value to value, with each state being (implicitly, or explicitly) dependent on its recent history. Obvious examples being timeseries (sequences of numeric states) and language (sequences of words). To properly understand and model them, we must to take in to account the sequence of values seen in previous steps and even long term temporal correlations. In this tutorial we will explore how to use Recurrent Neural Networks, to model and forecast series of events, using language as an intuitive example. Their advantages and disadvantages with respect to more traditional approaches will behighlighted and simple implementations using the Keras python library will be demonstrated. You will implement deep neural networks that are capable of guessing what the next letter in a word is or even what the next word in a sentence might be. Code and slides will be made available on GitHub.

Presenter:

Bruno Gonçalves, JPMorgan Chase

(T9) Managing Data Science Product Development

Abstract:

More and more often, data science capabilities need to be developed and integrated with software systems to deliver the intended value to the organization. While most data science practices come from scientific and/or academic domains, most software engineering projects are executed using methods and practices developed in industry. Executing data science development along-side, and complimentary to software engineering has proven a challenge in many organizations. In this short tutorial, we will cover how the basic components of agile software development can be adapted to data science efforts. We will discuss how to define “stories” and “epics” for data science, how to manage and prioritize backlogs, stand-ups, sprint reviews, and how to communicate with stakeholders. The methods covered have, in our application of them, resulted in better-managed stakeholder expectations, kept teams focussed on delivering a business capability, and helped to ensure that we deliver the intended impact to the business.

Presenter:

John Akred, Silicon Valley Data Science

IEEE DSAA 2018

IEEE DSAA 2018

The 5^th IEEE International
Conference on Data Science
and Advanced Analytics

1–4 October 2018
Turin — Italy

Tutorials

(T1) NDlib: Modelling and Analyzing Diﬀusion Processes over Complex Networks

Abstract:

Presenters:

(T2) Data Sources and Techniques to Mine Human Mobility Patterns

Abstract:

Presenters:

(T3) Reproducible research using lessons learn from software development

Abstract:

Presenter:

(T4) Deep Learning for Computer Vision: A practitioner’s viewpoint

Abstract:

Presenters:

(T5) Project Management for Data Science

Abstract:

Presenter:

(T6) Data Science Workflows Using R and Spark

Abstract:

Presenter:

(T7) Kernel Methods for Machine Learning and Data Science

Abstract:

Presenter:

(T8) Recurrent Neural Nets with applications to Language Modeling

Abstract:

Presenter:

(T9) Managing Data Science Product Development

Abstract:

Presenter: