The year of COVID-19 pandemic has spotlighted as never before the many shortcomings of the world’s data management workflows. The lack of established ways to exchange and access data was a highly recognized contributing factor in our poor response to the pandemic. On multiple occasions we have witnessed how our poor practices around reproducibility and provenance have completely sidetracked major vaccine research efforts, prompting many calls for action from scientific and medical communities to address these problems.
Breaking down silos, reproducibility and provenance are all complex problems that will not disappear overnight – solving them requires a continuous process of incremental improvements. Unfortunately, we believe that our workflows are not suited even for that. Modern data science encourages routine copying of data, with every transformation step producing data that is disjoint from its source. It’s hard to tell where most data comes from, how it was altered, and no practical way to verify that no malicious or accidental alterations were made. All of our common data workflows are in complete contradiction with the essential prerequisites for collaboration and trust, meaning that even when the results are shared they often cannot be easily reused.
This talk is the result of 2 years of R&D work in taking a completely different perspective on data pipeline design. We demonstrate what happens if the prerequisites for collaboration such as repeatability, verifiability, and provenance are chosen as core properties of the system. We present a new open standard for decentralized and trusted data transformation and exchange that leverages the latest advancements of modern data processing frameworks like Apache Spark and Apache Flink to create a truly global data pipeline. We also present a prototype tool that implements this standard and show how its core ideas can scale from a laptop to a data center, and into a worldwide data processing network that encourages reuse and collaboration.
What you will learn:
– Shortcomings of the modern data management workflows and tools
– The important role of the temporal dimension in data
– How latest data modeling techniques in OLTP, OLAP, and stream processing converge together
– How bitemporal data modeling ideas apply to data streams
– How by combining these ideas we can satisfy all preconditions for trust and collaboration
– A summary of the proposed “Open Data Fabric” protocol for decentralized exchange and transformation of data
About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: [ Ссылка ]
See all the previous Summit sessions: [ Ссылка ]
Connect with us:
Website: [ Ссылка ]
Facebook: [ Ссылка ]
Twitter: [ Ссылка ]
LinkedIn: [ Ссылка ]
Instagram: [ Ссылка ] Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. [ Ссылка ]
Ещё видео!