DevOps, Spark/Knime, GDPR/Hadoop

Dear Big Data Beers Members,

I am happy to announce the next Big Data Beer Meeting in the Sony Center thanks to our Sponsors (Think Big Analytics):

• Talk 1: Dr. Arif Wider, Sebastian Herold: “DataDevOps – A Manifesto on Shared Data Responsibility in Times of Microservices”

• Talk 2: Dr. Tobias Kötter: “Heterogeneous Data Mining with Spark

• Talk 3: Janosch Woschitz: “The elephant in the room: GDPR and Hadoop

The Details:

Speaker 1: Dr. Arif Wider, Sebastian Herold

Title: DataDevOps – A Manifesto on Shared Data Responsibility in Times of Microservices

Abstract: More and more companies successfully migrate their monolithic applications to a Microservices architecture. However, maintaining a consistent and usable data landscape has only become more challenging by this: unstructured data, huge amounts of data, and hundreds of data sources. Having a centralized data team does not scale in this setting as it becomes the bottleneck between application developers and business analysts.

We created a Data Manifesto of seven principles which break with traditional role separations and show a path how to deal with distributed data in a federal and scalable fashion. This leads to DataDevOps: a culture where application developers also own their data. Learn about the experiences we made with facilitating this cultural transformation at Scout24, the provider of Europe’s largest online markets for cars and real estate.

BIO: (1) Dr. Arif Wider is a senior consultant and developer at ThoughtWorks Germany, where he builds scalable web applications, teaches Scala, and consults on Big Data topics. Before joining ThoughtWorks he has been in research with a focus on data synchronization, bidirectional transformations, and domain-specific languages.  (2) Sebastian Herold is a big data architect and data engineering manager at Scout24 AG. He is leading a team of data engineers that build a scalable, cloud-based, self-service data landscape for ImmobilienScout24 and AutoScout24. As a former software architect, he still has a passion for transferring engineering paradigms into the data world and level engineers up to be data engineers.

Speaker 2: Dr. Tobias Kötter

Title: Heterogeneous Data Mining with Spark

Abstract: In this talk, we will look at a case of learning a predictive model to forecast flight delays using heterogeneous flight data. The talk will cover the whole analysis flow. We will start by reading data from various sources using the Spark Data Source API. Once we have the data in Spark, we will perform feature engineering to extract useful information from large and unstructured data, such as radar images and textual weather reports. We then combine all the extracted features and use ad-hoc analysis methods to get a better feeling for the data. Finally, we learn a model from the combined data to predict flight delays. The talk ends with a look behind the scenes of the prototype that we developed at KNIME for the feature engineering part.

Bio: Tobias Koetter received his Ph.D. in Computer Science from the University of Konstanz in 2012. In 2013 he has been a postdoctoral fellow at Carnegie Mellon University. Since 2014 he works as a Sr. Data Scientist at KNIME where he is responsible for the integration of various big data platforms and tools such as Hive, Impala, and Spark. His research interests include large scale data integration and mining, graph mining, text mining as well as creativity support systems.

Speaker 3: Janosch Woschitz

Title:  The elephant in the room: GDPR and Hadoop

Abstract: The European General Data Protection Regulation (GDPR) will come into effect in May 2018 and it will impact all organizations that store or process personal data of EU citizens. The European Commission is exporting European data protection principles to the rest of the world while widening the definition of personal data and enforcing privacy by design. These changes will not only have an impact on the organizations but also on the software which is used for data processing. How does it affect the Hadoop ecosystem?

Distributed data processing at scale is one of Hadoop’s core features and we will explore how the GDPR could potentially affect it. We will also take a look at the technical aspects of the rights of data subjects and see if and how we can address those, with a particular focus on open-source technologies.

This talk will give you an overview of the key themes of the GDPR including the rights of the data subject and will investigate the technical implications for data processing within the Hadoop ecosystem.

Bio: Janosch Woschitz is Senior Data Engineer at ThinkBig Analytics. He currently focuses in helping clients with the definition and implementation of large scale analytical solutions with modern big data technologies. Janosch’s main areas of interest and expertise are large scale data processing, streaming, and automation. Before Think Big Analytics, he worked at ResearchGate where he was responsible for an engineering team with a focus on big data infrastructure.

Happy to meet u all!