How to build on-premise Data Lake? | Build your own Data Lake | Open Source Tools | On-Premise

BI Insights Inc

zhlédnutí 9 232

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 27. 07. 2024
In this video, we will cover the exciting world of data-lake. Data Lake is an essential component of Modern Data Stack. We have developed a Data Lake in AWS environment using AWS S3, Glue and Athena. What if we want to deploy our own data lake with open source tools on our infrastructure? We will deploy an on-premise data lake using open source technologies. This way we can learn the technologies behind data lake and most of the cloud offering use the same technologies.
What is Data Lake? aws.amazon.com/big-data/datal...
Link to GitHub repo: github.com/hnawaz007/pythonda...
💥Subscribe to our channel:
/ haqnawaz
📌 Links
-----------------------------------------
#️⃣ Follow me on social media! #️⃣
🔗 GitHub: github.com/hnawaz007
📸 Instagram: / bi_insights_inc
📝 LinkedIn: / haq-nawaz
🔗 / hnawaz100
-----------------------------------------
#dataanalytics #datalake #opensource
Topics covered in this video:
==================================
0:00 - Introduction to Data Lake
1:36 - Tech Stack of on-premise Data Lake
1:49 - Docker Containers Overview
3:26 - Data Lake Configurations
4:48 - Start Docker Containers
5:59 - MinIO (S3) Bucket and File(s)
6:53 - File mapping to SQL Table
7:12 - Trino Cluster
7:32 - Trino SQL Engine Connection
8:37 - Create Schema
9:03 - Create Table
9:36 - Query External Table
10:12 - SQL Analysis
10:29 - Data Lake Tech Review
11:51 - Coming Soon
Věda a technologie

Komentáře • 34

@alaab82 Před měsícem
one of the best tuto's on youtube, thank you so much !
@hernanlopezvergara6133 Před 5 měsíci
Thank you very much for this short but very useful video!
@datawise.education Před 6 měsíci
I love all your videos Haq. Great work. :)
@rafaelg8238 Před 3 měsíci
great video, congrats.
@wallacecamargo1043 Před 8 měsíci
Thanksss!!!
@LucasRalambo-bp3vb Před rokem
This is very informative !!! Thank you...
Can you please also make a video about creating an open source version of Amazon Forecast?
@BiInsightsInc Před rokem ⁺¹
Amazon Forecast is a time-series forecasting service based on machine learning (ML). We can certainly do it using open source. I will cover time-series forecasting in the future. In the mean time check out the ML Predictive Analytics with following video: czcams.com/video/TR6vn4lZ3Mo/video.html&t
@TheMahardiany Před rokem ⁺¹
Thankyou for the video, great as always 🎉
I want to ask in this video, when we use trino for query engine, can we use DML and even DDL for that external table ?
or we can just select from it ?
Thank you
@BiInsightsInc Před rokem ⁺¹
There are a number limitations to do DMLs on Hive. Please read the documentation link for more details - cwiki.apache.org/confluence/display/Hive/Hive+Transactions. It’s recommend not to use DML on Hive managed tables especially if the data volume is huge these operations would become too slow. DML operations would be considerably faster if done on a partition/bucket instead of the full tables. Nevertheless it better to handle the edits in file and do a full refresh via external table and only use DML on managed tables as last resort. We define the table via DDL so yes.
@zera215 Před 4 měsíci
You mentioned to someone that Apache Iceberg could be an alternative to Hive. Would you be interested in recording a new video about it?
@BiInsightsInc Před 4 měsíci
@zera215 I have covered the Apache Iceberg and how to utilize in the similar setup in the following video: czcams.com/video/vnNHDylGtEk/video.html
@zera215 Před 4 měsíci
@@BiInsightsInc I am looking for full open source solution. Do you know if I can just exchange hive and Iceberg in the architecture of this video?
@BiInsightsInc Před 4 měsíci
@@zera215 you can use Hive and Iceberg together. Yout still need a metastore in order to work with Iceberg. Here is an example of how to use them together.
iceberg.apache.org/hive-quickstart/
@zera215 Před 4 měsíci
@@BiInsightsInc Thank you, and congrats for your great work =-D
@hungnguyenthanh4101 Před 11 měsíci
How to build data lakehouse? Please next video with topic Lakehouse.❤
@BiInsightsInc Před 11 měsíci
Yes, data lake house is on my radar. I will cover it in the future videos.
@akaile2233 Před 11 měsíci ⁺¹
Thankyou for video
Hi sir, if I want to use Spark to save data to the data lake you built, how do I do that? (I just started learning about Data lake and Spark)
@BiInsightsInc Před 11 měsíci ⁺²
Below is a sample code to write data to MinIO bucket with Spark.
package com.medium.scala.sparkbasics
import com.amazonaws.SDKGlobalConfiguration
import org.apache.spark.sql.SparkSession
object MinIORead_Medium extends App {
System.setProperty(SDKGlobalConfiguration.DISABLE_CERT_CHECKING_SYSTEM_PROPERTY, "true")
lazy val spark = SparkSession.builder().appName("MinIOTest").master("local[*]").getOrCreate()
val s3accessKeyAws = "minioadmin"
val s3secretKeyAws = "minioadmin"
val connectionTimeOut = "600000"
val s3endPointLoc: String = "127.0.0.1:9000"
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", s3endPointLoc)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", s3accessKeyAws)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", s3secretKeyAws)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.timeout", connectionTimeOut)
spark.sparkContext.hadoopConfiguration.set("spark.sql.debug.maxToStringFields", "100")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true")
val yourBucket: String = "minio-test-bucket"
val inputPath: String = s"s3a://$yourBucket/data.csv"
val outputPath = s"s3a://$yourBucket/output_data.csv"
val df = spark
.read
.option("header", "true")
.format("minioSelectCSV")
.csv(inputPath)
df
.write
.mode("overwrite")
.parquet(outputPath)
}
@akaile2233 Před 11 měsíci
@@BiInsightsInc Hi sir, I did everything like your video and it worked fine, but when I remove the schema 'sales' there is an error 'access denied' ?
@BiInsightsInc Před 11 měsíci
@@akaile2233 You cannot delete objects from Trino engine. You can do so in the Hive Metastore. In this example, we re using Maria db. So you can connect to it delete objects from there. Changes will be reflected in the mappings you see in Trino.
@akaile2233 Před 10 měsíci
@@BiInsightsInc Sorry to bother, there are too many tables in metastore_db, which ones should I delete?
@user-wd1od9cu1g Před rokem
thank you. can we build transactional data lake using iceberg /hudi on this minio storage.
@BiInsightsInc Před rokem
Yes, you can build a data lake using Iceburg and MinIO. Here is a guide that showcases both of these tools in conjunction.
resources.min.io/c/lakehouse-architecture-with-iceberg-minio?x=jAF4uk&Lakehouse+%2B+Icerberg+on+PF+1.0+-+080322&hsa_acc=8976569894&hsa_cam=17954061482&hsa_grp=139012460799&hsa_ad=614757163838&hsa_src=g&hsa_tgt=kwd-1717916787486&hsa_kw=apache%20iceberg&hsa_mt=b&hsa_net=adwords&hsa_ver=3&gclid=Cj0KCQjwnrmlBhDHARIsADJ5b_mrNZMG2PHc14akJyBoy3nW-8INcEQ8MFRjifDGkjGDeDiNqcAxVvkaAgToEALw_wcB
@anujsharma4011 Před rokem
Can we directly connect trino with S3? no hive inbetween. I want to install trino on EC2
@BiInsightsInc Před rokem
I’m afraid not. Trino needs the tables schema/metadata and that’s managed by the Hive metastore. Alternatively we can use Apache icebergs but we need the table mappings before Trino query engine can access the data stored in s3.
@juliovalentim6178 Před rokem
Sorry for the noob question, but can I create a Data Lake like this inside a PowerEdge T550 server instead of my desktop or laptop? Without resorting to paid cloud services?
@BiInsightsInc Před rokem ⁺¹
Yes, you can create this setup on your server. This way you will use your own infrastructure and avoid paid services and data exposure to outside services.
@juliovalentim6178 Před rokem
@@BiInsightsInc Hi! Thank you very much for answering my question. And would you be able to tell me what RAM, cache and SSD requirements I need to have on the server to implement this setup, without slowing down processing for Data Science?
@BiInsightsInc Před rokem
@@juliovalentim6178 the hardware requirements will ultimately depends on the amount of data you are processing and you can tweak it once you perform tests with actual data. Anyways, here are some recommendations from minIO. First is an actual data lake deployment you can use for reference. Second link is for a production scale data lake. Hope this helps.
blog.min.io/building-an-on-premise-ml-ecosystem-with-minio-powered-by-presto-r-and-s3select-feature/
min.io/product/reference-hardware
@juliovalentim6178 Před rokem
@@BiInsightsInc Of course it helped! Thank you so much again. Congratulations on the excellent content of your channel. I will always be following. Best Regards!
@oscardelacruz3087 Před 4 měsíci
Hi, Haq. Nice video, i'm trying to make it works but cannot load minio catalog.
@BiInsightsInc Před 4 měsíci
You are not able to connect to MinIO in DBeaver? What's the error you receive there?
@oscardelacruz3087 Před 4 měsíci
@@BiInsightsInc hi haq, thanks. Finally i connect minio and trino. But i have a question how much deep trino can read parquet file? I trying to read parquet file from minio with directory structure: s3a://datalake/bronze/erp/customers/. Inside customer folder i have folders for each year/month/day. When try to read the files trino return 0 rows.

Další v pořadí

Automatické přehrávání

Manage your data pipelines with Dagster | Software defined assets | IO Managers | Updated project