DevOps, Spark/Knime, GDPR/Hadoop

Dear Big Data Beers Members,

I am happy to announce the next Big Data Beer Meeting in the Sony Center thanks to our Sponsors (Think Big Analytics):

• Talk 1: Dr. Arif Wider, Sebastian Herold: “DataDevOps – A Manifesto on Shared Data Responsibility in Times of Microservices”

• Talk 2: Dr. Tobias Kötter: “Heterogeneous Data Mining with Spark

• Talk 3: Janosch Woschitz: “The elephant in the room: GDPR and Hadoop

The Details:

Speaker 1: Dr. Arif Wider, Sebastian Herold

Title: DataDevOps – A Manifesto on Shared Data Responsibility in Times of Microservices

Abstract: More and more companies successfully migrate their monolithic applications to a Microservices architecture. However, maintaining a consistent and usable data landscape has only become more challenging by this: unstructured data, huge amounts of data, and hundreds of data sources. Having a centralized data team does not scale in this setting as it becomes the bottleneck between application developers and business analysts.

We created a Data Manifesto of seven principles which break with traditional role separations and show a path how to deal with distributed data in a federal and scalable fashion. This leads to DataDevOps: a culture where application developers also own their data. Learn about the experiences we made with facilitating this cultural transformation at Scout24, the provider of Europe’s largest online markets for cars and real estate.

BIO: (1) Dr. Arif Wider is a senior consultant and developer at ThoughtWorks Germany, where he builds scalable web applications, teaches Scala, and consults on Big Data topics. Before joining ThoughtWorks he has been in research with a focus on data synchronization, bidirectional transformations, and domain-specific languages.  (2) Sebastian Herold is a big data architect and data engineering manager at Scout24 AG. He is leading a team of data engineers that build a scalable, cloud-based, self-service data landscape for ImmobilienScout24 and AutoScout24. As a former software architect, he still has a passion for transferring engineering paradigms into the data world and level engineers up to be data engineers.

Speaker 2: Dr. Tobias Kötter

Title: Heterogeneous Data Mining with Spark

Abstract: In this talk, we will look at a case of learning a predictive model to forecast flight delays using heterogeneous flight data. The talk will cover the whole analysis flow. We will start by reading data from various sources using the Spark Data Source API. Once we have the data in Spark, we will perform feature engineering to extract useful information from large and unstructured data, such as radar images and textual weather reports. We then combine all the extracted features and use ad-hoc analysis methods to get a better feeling for the data. Finally, we learn a model from the combined data to predict flight delays. The talk ends with a look behind the scenes of the prototype that we developed at KNIME for the feature engineering part.

Bio: Tobias Koetter received his Ph.D. in Computer Science from the University of Konstanz in 2012. In 2013 he has been a postdoctoral fellow at Carnegie Mellon University. Since 2014 he works as a Sr. Data Scientist at KNIME where he is responsible for the integration of various big data platforms and tools such as Hive, Impala, and Spark. His research interests include large scale data integration and mining, graph mining, text mining as well as creativity support systems.

Speaker 3: Janosch Woschitz

Title:  The elephant in the room: GDPR and Hadoop

Abstract: The European General Data Protection Regulation (GDPR) will come into effect in May 2018 and it will impact all organizations that store or process personal data of EU citizens. The European Commission is exporting European data protection principles to the rest of the world while widening the definition of personal data and enforcing privacy by design. These changes will not only have an impact on the organizations but also on the software which is used for data processing. How does it affect the Hadoop ecosystem?

Distributed data processing at scale is one of Hadoop’s core features and we will explore how the GDPR could potentially affect it. We will also take a look at the technical aspects of the rights of data subjects and see if and how we can address those, with a particular focus on open-source technologies.

This talk will give you an overview of the key themes of the GDPR including the rights of the data subject and will investigate the technical implications for data processing within the Hadoop ecosystem.

Bio: Janosch Woschitz is Senior Data Engineer at ThinkBig Analytics. He currently focuses in helping clients with the definition and implementation of large scale analytical solutions with modern big data technologies. Janosch’s main areas of interest and expertise are large scale data processing, streaming, and automation. Before Think Big Analytics, he worked at ResearchGate where he was responsible for an engineering team with a focus on big data infrastructure.

Happy to meet u all!

Stefan

Spark/Flink, Emma and Kafka Meetup

Welcome to our next meeting. We have three interesting talks and a new hot location in the centre of Berlin. The details:

• Talk 1: “Meet Emma: A quotation-based Scala DSL for Scalable Data Analysis“, by Alexander Alexandrov

• Talk 2: “Spark and Flink at the limit: Benchmarking Data Flow Systems for Scalable Machine Learning” by Christoph Boden

• Talk 3: “Kafka Streams Test-Drive” by Christoph Bauer

The location is: SAP Kantine, Rosenthaler Straße 30, 10178 Berlin Mitte, More info: Data-Space, IoT-StartUp

Detailed talk info 1:

Title: Meet Emma: A quotation-based Scala DSL for Scalable Data Analysis

Abstract

Scala DSLs for data-parallel collection processing are usually embedded through types (e.g., RDD, DataFrame, Dataset in Spark; Dataset, Table in Flink). This approach introduces a design trade-off between two important DSL features: deep reuse of syntactic constructs from the host language (e.g., for comprehensions, while loops, conditionals, pattern matching) on the one side, and the ability to lift DSL terms to an intermediate representation (IR) suitable for automatic optimizations. We argue that a different embedding approach based on quotations allows for reconciling these features. As a proof-of-concept, we present Emma – a Scala DSL for scalable data analysis based on quotations.

Bio: Alexander Alexandrov is a PhD candidate at the Database and Information Management (DIMA) group at Technische Universität Berlin. His main research interest is in bridging the gap between the demands of modern data analysis platforms and the need for high-level, declarative analytics languages. In addition, he is also interested in methods and techniques for scalable data generation and benchmarking of data analysis platforms.

He is the lead developer in two open-source projects:

Peel – a framework that helps you to define, execute, analyze, and share experiments for distributed systems and algorithms.

Emma – a quotation-based Scala DSL for scalable data analysis.

Detailed talk info 2:

Title: Spark and Flink at the limit: Benchmarking Data Flow Systems for Scalable Machine Learning

Abstract. Distributed data flow systems such as Apache Spark or Apache Flink are popular choices for scaling machine learning algorithms in production. Industry applications of large scale machine learning such as click-through rate prediction rely on models trained on billions of data points which are both highly sparse and high-dimensional. This talk will shed light on the performance of both systems for scalable machine learning workloads. Rather than relying on existing library implementations, we implemented a representative set of distributed machine learning algorithms suitable for large scale distributed settings which have close resemblance to industry-relevant applications and provide generalizable insights into system performance. We tuned relevant system parameters and ran a comprehensive set of experiments to assess the scalability of Apache Flink and Apache Spark for data up to four billion data points and 100 million dimensions.  This talk will present the results and insights of these experiments as well as lessons learned and pitfalls encountered during tuning of relevant system parameters.

Bio: Christoph Boden is currently a Computer Science Research Associate at the TU Berlin Database Systems and Information Management Group (DIMA) where he contributes to the coordination and management of the Berlin Big Data Center (BBDC). His research focus is on benchmarking distributed data processing platforms such as Apache Spark and Apache Flink for scalable machine learning workloads, Large Scale Data Analysis and Text Mining. He studied Industrial Engineering at Technische Universität Dresden, Technische Universität Berlin and the University of California, Berkeley and received a masters degree (“Dipl-Ing.”) from TU Berlin in 2011. Christoph teaches a graduate level course on Scalable Data Analytics at TU Berlin and is a laureate of the Software Campus program. He published numerous peer-reviewed scientific papers at prestigious international conferences, workshops and journals in the field of distributed data processing systems and data mining.

Detailed talk info 3:

Title: Kafka Streams Test-Drive

Abstract: In March 2016 Confluent, Inc. introduced yet another Stream Processing API. With Apache Spark, Apache Storm, Apache Flink, etc out there the question arises – Why another one? This talk will give a general introduction to the library and will try to answer some fundamental questions.

We will cover
– what it is,
– what it looks like,
– how it works,
– deployments and
– what it’s good for.

Bio: Christoph is a Big Data Engineer and consultant based in Berlin. His first visit to the zoo was in 2010 and he has been with the elephants, pigs, bees, … ever since. He has seen weather conditions like storms and electrical discharges in form of sparks a lot.

Hope 2cu all!

Best Regards

Stefan Edlich

Data Wrangling & Spark^2

Welcome to the next meeting with three exciting new talks! 

Join us for some of the latest tech buzz, good drinks and a pleasant get-together. 

The talks are:

Talk 1: “Data wrangling – The Key to Successful Data Science
by Lars Grammel (Trifacta)

Talk 2: “Tips & tricks for making Spark lightning fast – selected case studies on caching and shuffle avoidance

Talk 3: “Visual framework for Spark
Talk 2&3 by Adam Jakubowski and Michael Szostek (Deepsense.io)

We are also happy to visit the marvellous new venue #openspace this time.

Detailed Description Talk 1:

Data preparation can take up between 60% and 80% of the time spent in data science projects and is crucial for the success of such projects. In this talk, the different data preparation activities such as data discovery, structuring, cleaning, enriching and validating are examined, challenges are highlighted, and combinations of intelligent algorithms and user interfaces to speed up data
preparation are explored.

Biography:

Lars is the manager of the German office of Trifacta, a San Francisco-based startup that is the leader in the data transformation space. At Trifacta, he has led the development of central functionality for each major Trifacta release starting with the first private beta release. In 2015, he has started and built out the Trifacta office in Berlin. Lars holds a PhD in computer science (specialising in data visualization) from University of Victoria, Canada, and a Master’s degree in computer science (specialising in software engineering) from RWTH Aachen University, Germany.

Detailed Description Talk 2:

Our experience has shown numerous examples of suboptimal usage of Spark that leads to severe performance bottlenecks. Such issues very often crystallise around two specific problem areas: caching and shuffling. Having helped multiple companies to regain the lost performance, we’ll share our common observations. In this talk, we’ll describe
● what are the most commons misconceptions about Spark’s caching,
● when and how caching should be used,
● why is it user’s responsibility to cache (instead of Spark’s),
● the difference between caching and checkpointing,
● how to substitute shuffling with aggregation.

Detailed Description Talk 3:

In this session, we will present a tool for building Spark applications visually, with limited coding skills required. A graphical user interface, based on manipulating directed acyclic graphs of operations speeds up the process of building data processing pipelines and helps avoid writing boilerplate code. During the presentation, we will give a live demo of Seahorse and show how to build, test, and productionize a distributed computing application on Spark.

Biography Speakers Talk 2 & 3:

Michał Iwanowski: Michal earned a Master’s degree in Computer Science from Warsaw University of Technology, specializing in software engineering and machine learning. Prior to deepsense.io he worked with Big Data processing, predictive analytics and data warehousing at IBM. Being an author of a number of publications and invention disclosures, he has collaborated with medical researchers on statistical analysis of medical data and built systems for computer-aided experiment design.

Michał is a Software Engineer at deepsense.io – company focused on providing deep-learning solutions for enterprise. He studied Computer Science on the University of Warsaw. He is dedicated to creating reliable, big-scale solutions.

We wish to thank #openspace for sponsoring this event!

Multimodel-DataChallenge-AnomalieDetection

It is time for the next attack with three wonderful and different Talks:

1. Next Generation NoSQL: Multi Model Databases (Michael Hackstein)
2. SMS Digital Data Challenge (Wagner, Weimer-Hablitzel, Arnold)
3. What the heck is normal? (Christian Glatschke)

We meet at Lieferando.de (thanks!!) which is very close to Potsdamer Platz!

Here the details:

TALK 1: Next Generation NoSQL: Multi Model Databases

The recent progress in the database development has lead us a long and winding road: 

1. Using relational for everything
2. NoSQL and polyglot persistence: Introducing a ton of different technologies to learn
3. Next wave: Multi Model databases and polyglot data In this talk we will see the pros and cons of each of these stations.

We will also have a more in-depth introduction to Multi Model Databases and why they are a wonderful next step in database evolution. Finally we will talk about big data requirements and how they fit into this landscape of database technologies. How can we scale and run databases in production? We are using a whole cluster of machines, collecting and mining terabytes of data with ease with the help of Mesos and DC/OS.

Speaker: Michael Hackstein holds a master degree in computer science and is the creator of the ArangoDB graph capabilities. During his academic career he focused on complex algorithms and especially graph databases. Michael is an internationally experienced speaker who loves salad, cake and clean code.

TALK 2: SMS digital’s Data Challenge 

Data Science meets Steel Industry – an exciting match of heavy metal and high tech. SMS digital presents its first open data challenge. SMS digital is the digital laboratory of the SMS group, a worldwide leading steel plant manufacturer from Dusseldorf. Its mission is to test and develop digital business models with the help of state of the art technologies. Naturally, applied data science plays a major role in this initiative.

Casting slabs of quality steel is a demanding process that is highly sensitive to changes in its production and environmental parameters. Hence predicting casting defects is essential to provide ongoing quality assurance and indication for adjustments of the production parameters. 

The potential for optimization and the implied business opportunity are huge. So far, the primary method for prediction and detection of slab surface defects have been stochastic procedures on continuous measurements. In the future however, SMS digital hopes to improve the predictive quality of the existing procedures by applying state-of-the-art data mining and analytics methods. This will be the objective of SMS digital’s data challenge that promises a 5-figures price money for the top contributors.  At our meetup, SMS digital’s team will present this data science use case, explain the provided data sets, and answer all remaining question regarding the data challenge. 

Speaker:

Maximilian Wagner, SMS digital, CEO

Max combines extensive hands-on experience in the steel industry with an ever-curious, entrepreneurial mindset. As CEO of SMS digital he’s always looking for trends and technologies that might lead to the ‘next big thing’ in his industry and beyond.

Marc Weimer-Hablitzel, etventure, Senior Manager

Marc combines technical knowledge, creativity and business expertise. Having worked as a Data Mining Expert for several companies in the past, Marc knows the realities and raptures of data science. Now he is on a mission to revolutionize the steel industry and the B2B sector with etventure and SMS Digital.

Friedrich Arnold etventure, Project Manager

As mechanical engineer and serial entrepreneur, Friedrich is very excited about the endless opportunities of the B2B, industrial sector. He’s currently responsible for testing and scaling SMS digital’s new Data Lab. 

Talk 3: What the heck is normal?

To analyse and understand data and certain circumstances and their organizing factors, we do need a constant supervision of certain influences. In physics or information technology this could be boundary values and the definition of a ‘normal’ range.

But who defines what is normal?

Not only in psychoanalysis, also in other sciences, it is a tough challenge to tell what kind of behavior is beyond the norm.

We can answer these questions today for many applications using algorithms and machine learning. This talk will show insights, principles and tools for a succsessful anomaly detection & analytics.

Speaker: Christian Glatschke
17 years IT experience, 8 years Big Data and Analytic (Splunk, Lucidworks und Anodot)

Tech Day – Berlin – http://goo.gl/EV5e­Xj

Come and spend a morning dedicated to learning about new technology designed to maximize the value of your existing compute resources. Tech Day brings together the brightest minds across a variety of industry verticals in a local forum to share best practices and to collaborate on building modern HPC and data analytics ecosystems.

The networking event is sponsored by Univa, a leading innovator of workload management solutions that optimize throughput and performance of applications, containers and services.

Details and RSVP at: http://goo.gl/EV5eXj

Live Webinar:- Introduction to Apache Spark & Scala

Hello,

We’d like to invite you for an expert live Webinar on ‘Apache Spark & Scala‘ scheduled on 2nd September 2016, 11:30 AM to 1 PM  EDT.

TOPICS

• Introduction to big data

• Introduction to spark

• Why spark

• Spark ecosystem

• Introduction to scala

• Practical’s on spark

This promises to be an extremely enriching session and we hope you can make it – Register Now

In case you can’t make it sign-up anyway, we’ll send you the recording. 

Cheers!

Live Webinar: Data Loading Techniques in Hadoop 2.x

Hello,

We’d like to invite you for an expert live Webinar on ‘Data Loading Techniques in Hadoop‘ scheduled on 25th August Thursday, 2016 11:00 AM to 12:30 PM (EDT)

TOPICS

• Introduction to Big Data

• Challenges of Big Data

• Introduction to Hadoop 

• HDFS Overview

• Hadoop characteristics and HDFS

• Structured & Unstructured data

• Data Loading techniques Using HDFS, Hive, Pig, HBASE.

• Practical Demonstration

• Q&A Session

This promises to be an extremely enriching session and we hope you can make it – Register Now

In case you can’t make it sign-up anyway, we’ll send you the recording. 

Cheers!

Live Webinar: Apache Spark VS Hadoop MapReduce

Hello,

We’d like to invite you for an expert live Webinar on ‘Apache Spark vs. Hadoop MapReduce‘ scheduled on 22nd August 2016, Thursday 9:30 PM to 11:00 PM ( EDT )

TOPICS

• Introduction to Big Data and it’s challenges

• Introduction to Hadoop and it’s characteristics

• Hadoop ecosystem

• HDFS and MapReduce (Yarn)

• Advantage and Disadvantage of Hadoop

• Introduction to Spark and Scala

• Why Spark and Scala

• Data Loading Using RDD

• Difference between Spark and Hadoop

This promises to be an extremely enriching session and we hope you can make it – Register Now

In case you can’t make it sign-up anyway, we’ll send you the recording. 

Cheers!

Live Webinar: Apache Spark VS Hadoop MapReduce

Hello,

We’d like to invite you for an expert live Webinar on ‘Apache Spark vs. Hadoop MapReduce‘ scheduled on 18th August 2016, Thursday 9:30 PM to 11:00 PM ( EDT )

TOPICS

• Introduction to Big Data and it’s challenges

• Introduction to Hadoop and it’s characteristics

• Hadoop ecosystem

• HDFS and MapReduce (Yarn)

• Advantage and Disadvantage of Hadoop

• Introduction to Spark and Scala

• Why Spark and Scala

• Data Loading Using RDD

• Difference between Spark and Hadoop

This promises to be an extremely enriching session and we hope you can make it – Register Now

In case you can’t make it sign-up anyway, we’ll send you the recording. 

Cheers!

Database Time!

Dear Big Data Beers Community,

we are excited to announce the next meeting. This time we have a wonderful location provided by Amazon Berlin (thanks!).

Furthermore we have two interesting and distinct hot databases to be presented.

BigchainDB and CortexDB.

Happy to welcome you all!

The details:

Title: Blockchains in a Big Data World
Speaker Trent McConaghy

Abstract: “Big data” distributed databases emerged as new storage paradigm, and quickly rose to prominence. Now we have “blockchain” storage, with promising applications. Yet blockchains have been strangely separate from the Big Data database world. What if they weren’t? What if there was a way to reconcile blockchains with Big Data? This talk describes how, using the example of BigchainDB. BigchainDB starts with a “big data” distributed database and adds blockchain characteristics of decentralized control, greater tamper-resistance, and the ability to issue & transfer assets.

Bio: Trent McConaghy is co-inventor of the BigchainDB blockchain database. Previously, he co-founded Solido Design Automation, which uses big data & large-scale machine learning to help drive Moore’s Law. Trent has written two critically-acclaimed books on machine learning and circuits, in addition to 50 papers+patents. He has given keynotes & invited talks at MIT, Columbia, Berkeley, JPL, Nvidia, and more.

“CortexDB – a NoSQL Database inspired by human brain research”
Speaker: Jan Buss, Michael Backhaus

The CortexDB world’s only database provides automatic transformation of imported raw data in the highest (6th) Normal Form

This results in a number of significant advantages (e.g. Semantic Network) for the development of analytical applications in Big Data environments. Jan Buss (ceo) and Michael Backhaus (core developer) will show, based on live demos, how the digital trends and challenges to the current software development can be solved with Cortex. We are looking forward to a lively discussion.

CortexDB Technology Features:
• CortexNF6: Automatic transformation of the raw data in the highest (6) Normal Form (Entity Relationship Model) / Extremely fast context queries on a “low footprint” on top Big Data
• Schema-less multi-model NoSQL technology (Key value; Document store; Graph DB; Multi value DB; Column DB)
• temporal database technology (bi-temporal) with transaction date and validity period of data (Bitemporal Data)
• Implemented Google V8 / JavaScript library for executing program code and algorithms directly in the CortexDB”