Skip to main content

Lakehouse Concept | Data Warehouse | Data Lake

Lakehouse concept:

In the context of data management and analytics, a "lakehouse" refers to a modern data architecture that combines the capabilities of data lakes and data warehouses. It aims to address the limitations and challenges associated with these traditional data storage and processing approaches.

A data lakehouse provides a unified and scalable platform for storing, managing, and analyzing large volumes of structured and unstructured data. It incorporates the following key features:

Data Storage: Similar to a data lake, a lakehouse enables the storage of raw and unprocessed data in its native format. This includes structured data (e.g., relational databases, CSV files) and unstructured data (e.g., logs, sensor data, images). By using a common storage layer, such as a distributed file system, data can be ingested from various sources without requiring upfront schema design or transformation.


Ref.: https://www.databricks.com/wp-content/uploads/2020/01/data-lakehouse-new-1024x538.png

ACID Transactions: Unlike traditional data lakes, a lakehouse provides ACID (Atomicity, Consistency, Isolation, Durability) transactional guarantees. This means that data can be updated, deleted, and queried reliably, ensuring consistency and data integrity. ACID compliance enables the execution of complex analytics workflows and supports real-time and batch processing.

Schema Enforcement: A lakehouse allows for schema enforcement and schema evolution. It enables the definition of a schema upon data ingestion, ensuring that data adheres to a specific structure or schema. This feature makes it easier to maintain data quality, enforce governance policies, and enable self-service analytics.

Data Processing: A lakehouse incorporates data processing capabilities, typically using distributed processing frameworks like Apache Spark. This enables data transformation, cleansing, aggregation, and other data preparation tasks. The processing capabilities are integrated within the same platform, eliminating the need to move data between different systems.

Querying and Analytics: A lakehouse provides SQL-based querying capabilities, allowing users to perform ad-hoc and complex analytics directly on the stored data. It supports both batch processing and real-time streaming analytics, enabling organizations to derive insights and make data-driven decisions in a timely manner.

By combining the strengths of data lakes and data warehouses, a lakehouse architecture aims to provide a more flexible, scalable, and efficient approach to managing and analyzing data. It enables organizations to store and process data in a cost-effective manner while supporting a wide range of data analytics use cases.

Comments

Popular posts from this blog

MySQL InnoDB cluster troubleshooting | commands

Cluster Validation: select * from performance_schema.replication_group_members; All members should be online. select instance_name, mysql_server_uuid, addresses from  mysql_innodb_cluster_metadata.instances; All instances should return same value for mysql_server_uuid SELECT @@GTID_EXECUTED; All nodes should return same value Frequently use commands: mysql> SET SQL_LOG_BIN = 0;  mysql> stop group_replication; mysql> set global super_read_only=0; mysql> drop database mysql_innodb_cluster_metadata; mysql> RESET MASTER; mysql> RESET SLAVE ALL; JS > var cluster = dba.getCluster() JS > var cluster = dba.getCluster("<Cluster_name>") JS > var cluster = dba.createCluster('name') JS > cluster.removeInstance('root@<IP_Address>:<Port_No>',{force: true}) JS > cluster.addInstance('root@<IP add>,:<port>') JS > cluster.addInstance('root@ <IP add>,:<port> ') JS > dba.getC...

InnoDB cluster Remove Instance Force | Add InnoDB instance

InnoDB cluster environment UUID is different on node: To fix it stop group replication, remove instance (use force if require), add instance back Identify the node which is not in sync: Execute following SQL statement on each node and identify the node has different UUID on all nodes. mysql> select * from mysql_innodb_cluster_metadata.instances; Stop group replication: Stop group replication on the node which does not have same UUID on all nodes. mysql > stop GROUP_REPLICATION; Remove instances from cluster: Remove all secondary node from the cluster and add them back if require. $mysqlsh JS >\c root@<IP_Address>:<Port_No> JS > dba.getCluster().status() JS > dba.getCluster () <Cluster:cluster_name> JS > var cluster = dba.getCluster("cluster_name"); JS >  cluster.removeInstance('root@<IP_Address>:<Port_No>'); If you get "Cluster.removeInstance: Timeout reached waiting......" JS > cluster.removeInstance(...

Oracle E-Business Suite Online Patch Phases executing adop

Following description about Oracle E-Business Suite is high level and from documentation https://docs.oracle.com/cd/E26401_01/doc.122/e22954/T202991T531062.htm#5281339 for in depth and detail description refer it. The online patching cycle phases: Prepare Apply Finalize Cutover Cleanup Prepare phase: Start a new online patching cycle, Prepares the environment for patching. $ adop phase=prepare Apply phase: Applies the specified patches to the environment. Apply one or more patches to the patch edition. $ adop phase=apply patches=123456,789101 workers=8 Finalize phase: Performs any final steps required to make the system ready for cutover. Perform the final patching operations that can be executed while the application is still online. $ adop phase=finalize Cutover phase: Shuts down application tier services, makes the patch edition the new run edition, and then restarts application tier services. This is the only phase that involves a brief ...