This talk will summarize recent activities in Apache Spark developer's community to enhance columnar storage in Spark 2.3. Columnar storage is known as an efficient format for keeping consecutive fields of a column. On the other hand, previous versions of Spark used columnar storage in a few places. Columnar storage was an internal data structure. Spark 2.3 published an abstract class ColumnVector as a public API. Then, Spark 2.3 uses ColumnVector to effectively support several columnar storages with huge performance improvements. Pre-Spark 2.3 uses columnar storages for reading Apache Parquet and creating table cache in a program written in SQL, DataFrame, or Dataset (e.g. df.cache()). These columnar storages are accessed using different internal APIs. This difference led to performance inefficiency of table cache. Spark 2.3 defined ColumnVector as a public API. Then, Spark 2.3 can read data in Apache Arrow and Apache ORC thru ColumnVector without extra data conversion and data copy. While PySpark in pre-Spark 2.3 had huge overhead regarding serialization and desterilization, Spark 2.3 eliminated this overhead by using to use pandas with Apache Arrow. Thus, Spark 2.3 improves performance of PySpark. Spark 2.3 accesses columnar storage for table cache thru ColumnVector without data copy. Spark 2.3 also improves performance of table cache. Here are takeaways of this talk: (1) ColumnVector in Spark 2.3 is a public API of columnar storage to exchange data with other columnar storages. (2) Spark 2.3 uses ColumnVector to exchange famous columnar storages Apache Arrow and Apache ORC with low overhead, and improves performance. (3) Spark 2.3 and later versions improve performance of PySpark by using Pandas. (4) Spark 2.3 and later versions use ColumnVector for table cache and improved performance.
About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: [ Ссылка ]
Connect with us:
Website: [ Ссылка ]
Facebook: [ Ссылка ]
Twitter: [ Ссылка ]
LinkedIn: [ Ссылка ]
Instagram: [ Ссылка ] Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. [ Ссылка ]
Ещё видео!