Skip to main content

Explore Delta Lake

Explore Delta Lake:

Delta Lake is an open-source data storage layer that runs on top of existing data lake systems, such as Apache Hadoop or Amazon S3. It was developed by Databricks, a company that provides a unified analytics platform for data engineering, data science, and machine learning.

Delta Lake provides ACID transactions, schema enforcement, and other data management features on top of data lakes, which are typically used for storing large volumes of unstructured and semi-structured data. By adding these features, Delta Lake makes data lakes more suitable for use cases where data quality, consistency, and reliability are important, such as data science, machine learning, and analytics.

Some of the key features of Delta Lake include:

ACID Transactions: Delta Lake provides transactional guarantees for both batch and streaming data. This means that data operations, such as inserts, updates, and deletes, are executed in an atomic, consistent, isolated, and durable (ACID) manner. ACID transactions ensure data integrity and consistency, even in the face of concurrent updates or failures.

Schema Enforcement: Delta Lake provides schema enforcement, which ensures that data adheres to a predefined schema. This feature allows for easier data validation, data quality checks, and data governance. Additionally, Delta Lake allows for schema evolution, meaning that the schema can be changed over time without disrupting existing data.



Ref.: https://delta.io/static/delta-hp-hero-bottom-46084c40468376aaecdedc066291e2d8.png

Time Travel: Delta Lake provides time travel, which allows for versioning and historical querying of data. This feature enables data analysts and data scientists to query data at different points in time, enabling them to track changes and understand trends over time.

Unified Batch and Streaming: Delta Lake supports both batch and streaming workloads, allowing for real-time data processing and analytics. This feature makes it easier to build end-to-end data pipelines that can handle both batch and streaming data, without requiring separate data storage or processing systems.

Open Source: Delta Lake is an open-source project, meaning that it is freely available to use and contribute to. This allows organizations to customize and extend Delta Lake to meet their specific needs and use cases.

Overall, Delta Lake provides a more reliable and consistent data storage layer on top of data lakes, making it easier for organizations to manage and analyze their data effectively. It has become a popular choice for data science and analytics workloads, particularly for organizations that use the Apache Spark processing engine.

Comments

Popular posts from this blog

MySQL InnoDB cluster troubleshooting | commands

Cluster Validation: select * from performance_schema.replication_group_members; All members should be online. select instance_name, mysql_server_uuid, addresses from  mysql_innodb_cluster_metadata.instances; All instances should return same value for mysql_server_uuid SELECT @@GTID_EXECUTED; All nodes should return same value Frequently use commands: mysql> SET SQL_LOG_BIN = 0;  mysql> stop group_replication; mysql> set global super_read_only=0; mysql> drop database mysql_innodb_cluster_metadata; mysql> RESET MASTER; mysql> RESET SLAVE ALL; JS > var cluster = dba.getCluster() JS > var cluster = dba.getCluster("<Cluster_name>") JS > var cluster = dba.createCluster('name') JS > cluster.removeInstance('root@<IP_Address>:<Port_No>',{force: true}) JS > cluster.addInstance('root@<IP add>,:<port>') JS > cluster.addInstance('root@ <IP add>,:<port> ') JS > dba.getC

MySQL slave Error_code: 1032 | MySQL slave drift | HA_ERR_KEY_NOT_FOUND

MySQL slave Error_code: 1032 | MySQL slave drift: With several MySQL, instance with master slave replication, I have one analytics MySQL, environment which is larger in terabytes, compared to other MySQL instances in the environment. Other MySQL instances with terabytes of data are running fine master, slave replication. But this analytics environment get started generating slave Error_code :1032. mysql> show slave status; Near relay log: Error_code: 1032; Can't find record in '<table_name>', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log <name>-bin.000047, end_log_pos 5255306 Near master section: Could not execute Update_rows event on table <db_name>.<table_name>; Can't find record in '<table_name>', Error_code: 1032; Can't find record in '<table_name>', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log <name>-bin.000047, end_l

MySQL dump partition | backup partition | restore partition

MySQL dump Partition and import partition: $ mysqldump --user=root --password=<code> \ -S/mysql/<db_name>/data/<db_name>.sock --set-gtid-purged=OFF - -no-create-info \ <db_name> <table_name> --where="datetime between 'YYYY-MM-DD'  and 'YYYY-MM-DD'"  \  > /mysql/backup/<partition_name>.sql Where data type is bigint for partition, it will dump DDL for table also: $ mysqldump -uroot -p -S/mysql/mysql.sock --set-gtid-purged=OFF  \ <db_name> <table_name> --where="ENDDATE" between '20200801000000' and '20201101000000' \  > /mysql/dump/<schema_name>.<table_name>.sql   Alter table and add partitions which are truncated: Note: In following case partition 2018_MAY and 2018_JUN were truncated, so we need to reorganize the partition which is just after the desired partition. ALTER TABLE <table_name> REORGANIZE PARTITION 2018_JUL INTO ( PARTITION 2018_MAY VALUES LESS TH