In Memory Storage Evolution in Apache SparkDr Kazuaki Ishizaki IBM

This talk will summarize recent activities in Apache Spark developer's community to enhance columnar storage in Spark 2.3. Columnar storage is known as an efficient format for keeping consecutive fields of a column. On the other hand, previous versions of Spark used columnar storage in a few places. Columnar storage was an internal data structure. Spark 2.3 published an abstract class ColumnVector as a public API. Then, Spark 2.3 uses ColumnVector to effectively support several columnar storages with huge performance improvements. Pre-Spark 2.3 uses columnar storages for reading Apache Parquet and creating table cache in a program written in SQL, DataFrame, or Dataset (e.g. df.cache()). These columnar storages are accessed using different internal APIs. This difference led to performance inefficiency of table cache. Spark 2.3 defined ColumnVector as a public API. Then, Spark 2.3 can read data in Apache Arrow and Apache ORC thru ColumnVector without extra data conversion and data copy. While PySpark in pre-Spark 2.3 had huge overhead regarding serialization and desterilization, Spark 2.3 eliminated this overhead by using to use pandas with Apache Arrow. Thus, Spark 2.3 improves performance of PySpark. Spark 2.3 accesses columnar storage for table cache thru ColumnVector without data copy. Spark 2.3 also improves performance of table cache. Here are takeaways of this talk: (1) ColumnVector in Spark 2.3 is a public API of columnar storage to exchange data with other columnar storages. (2) Spark 2.3 uses ColumnVector to exchange famous columnar storages Apache Arrow and Apache ORC with low overhead, and improves performance. (3) Spark 2.3 and later versions improve performance of PySpark by using Pandas. (4) Spark 2.3 and later versions use ColumnVector for table cache and improved performance.

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: [ Ссылка ]

Connect with us:
Website: [ Ссылка ]
Facebook: [ Ссылка ]
Twitter: [ Ссылка ]
LinkedIn: [ Ссылка ]
Instagram: [ Ссылка ] Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. [ Ссылка ]

Теги

Смотрите далее

Un Amor Entre Dos Mundos pelicula completa español

How to remove Nobel Biocare N1 abutments

Stillstand der Energiewende - Auswege | #38 Energie und Klima

NaOCl Vs. Chlor-XTRA

How to Factory reset Xiaomi Redmi Note 10 Pro (M2101K6G), Delete Pin, Pattern, Password lock.

❗Невозможно установить Windows в раздел . GPT MBR

Не работает MyPublicWiFi/решение проблемы

💣الجـــــديــــــــد TV BOX GOOGLE TV SENIC SC-500 معه تطبيق 4K UHD ريموت كنترول ماجيك

Ремонт Ростова 105 С. Часть 1

Гимнастика для укрепления спины, № 3: лечение сколиоза, остеохондроза, кифоза.

Круиз 80 Автомобильная радиостанция, 80 Вт, аналог Yaesu?

BMS 5S 100A подключение для шуруповерта 12в 14.4в 18в

Загрузочная флешка Сергея Стрельца! 2024! Инструкция по созданию флешки! Обзор возможностей!

Living in a wormhole (EVE Online - Episode 1)

Candy Thieves vs Rigged Candy Bowl

Новые клипы

Тренды Наука