Data Wrangling & Spark^2

Welcome to the next meeting with three exciting new talks! 

Join us for some of the latest tech buzz, good drinks and a pleasant get-together. 

The talks are:

Talk 1: “Data wrangling – The Key to Successful Data Science
by Lars Grammel (Trifacta)

Talk 2: “Tips & tricks for making Spark lightning fast – selected case studies on caching and shuffle avoidance

Talk 3: “Visual framework for Spark
Talk 2&3 by Adam Jakubowski and Michael Szostek (Deepsense.io)

We are also happy to visit the marvellous new venue #openspace this time.

Detailed Description Talk 1:

Data preparation can take up between 60% and 80% of the time spent in data science projects and is crucial for the success of such projects. In this talk, the different data preparation activities such as data discovery, structuring, cleaning, enriching and validating are examined, challenges are highlighted, and combinations of intelligent algorithms and user interfaces to speed up data
preparation are explored.

Biography:

Lars is the manager of the German office of Trifacta, a San Francisco-based startup that is the leader in the data transformation space. At Trifacta, he has led the development of central functionality for each major Trifacta release starting with the first private beta release. In 2015, he has started and built out the Trifacta office in Berlin. Lars holds a PhD in computer science (specialising in data visualization) from University of Victoria, Canada, and a Master’s degree in computer science (specialising in software engineering) from RWTH Aachen University, Germany.

Detailed Description Talk 2:

Our experience has shown numerous examples of suboptimal usage of Spark that leads to severe performance bottlenecks. Such issues very often crystallise around two specific problem areas: caching and shuffling. Having helped multiple companies to regain the lost performance, we’ll share our common observations. In this talk, we’ll describe
● what are the most commons misconceptions about Spark’s caching,
● when and how caching should be used,
● why is it user’s responsibility to cache (instead of Spark’s),
● the difference between caching and checkpointing,
● how to substitute shuffling with aggregation.

Detailed Description Talk 3:

In this session, we will present a tool for building Spark applications visually, with limited coding skills required. A graphical user interface, based on manipulating directed acyclic graphs of operations speeds up the process of building data processing pipelines and helps avoid writing boilerplate code. During the presentation, we will give a live demo of Seahorse and show how to build, test, and productionize a distributed computing application on Spark.

Biography Speakers Talk 2 & 3:

Michał Iwanowski: Michal earned a Master’s degree in Computer Science from Warsaw University of Technology, specializing in software engineering and machine learning. Prior to deepsense.io he worked with Big Data processing, predictive analytics and data warehousing at IBM. Being an author of a number of publications and invention disclosures, he has collaborated with medical researchers on statistical analysis of medical data and built systems for computer-aided experiment design.

Michał is a Software Engineer at deepsense.io – company focused on providing deep-learning solutions for enterprise. He studied Computer Science on the University of Warsaw. He is dedicated to creating reliable, big-scale solutions.

We wish to thank #openspace for sponsoring this event!