This video explains a very frequently used Batch Data Processing Pipeline & the Best Practices to improve the efficiency.
Best practices to follow during implementation:
---------------------------------------------------------------------------------
✅Proper File formats
✅Proper Partitioning
✅Transient EMR clusters instead of Persistent Cluster
✅External Hive Metastore instead of Embedded Metastore
✅Metadata Driven Ingestion Framework
Architecture:
--------------------------
[ Ссылка ]
Reference Links:
------------------------------
Transient Cluster on AWS from Scratch using boto3 | Trigger Spark job from AWS Lambda
[ Ссылка ]
Using the AWS Glue Data Catalog as the metastore for Hive
[ Ссылка ]
AWS Glue Data Catalog as the centralized metastore for Athena & PySpark from EMR
[ Ссылка ]
Generic Framework to load the data from s3 External Stage to Snowflake External Table
[ Ссылка ]
Check this playlist for more Data Engineering related videos:
[ Ссылка ]
Apache Kafka form scratch
[ Ссылка ]
Snowflake Complete Course from scratch with End-to-End Project with in-depth explanation--
[ Ссылка ]
🙏🙏🙏🙏🙏🙏🙏🙏
YOU JUST NEED TO DO
3 THINGS to support my channel
LIKE
SHARE
&
SUBSCRIBE
TO MY YOUTUBE CHANNEL
Ещё видео!