How to build on-premise Data Lake? | Build your own Data Lake | Open Source Tools | On-Premise

SdĂ­let
VloĆŸit
  • čas pƙidĂĄn 27. 07. 2024
  • In this video, we will cover the exciting world of data-lake. Data Lake is an essential component of Modern Data Stack. We have developed a Data Lake in AWS environment using AWS S3, Glue and Athena. What if we want to deploy our own data lake with open source tools on our infrastructure? We will deploy an on-premise data lake using open source technologies. This way we can learn the technologies behind data lake and most of the cloud offering use the same technologies.
    What is Data Lake? aws.amazon.com/big-data/datal...
    Link to GitHub repo: github.com/hnawaz007/pythonda...
    đŸ’„Subscribe to our channel:
    / haqnawaz
    📌 Links
    -----------------------------------------
    #ïžâƒŁ Follow me on social media! #ïžâƒŁ
    🔗 GitHub: github.com/hnawaz007
    📾 Instagram: / bi_insights_inc
    📝 LinkedIn: / haq-nawaz
    🔗 / hnawaz100
    -----------------------------------------
    #dataanalytics #datalake #opensource
    Topics covered in this video:
    ==================================
    0:00 - Introduction to Data Lake
    1:36 - Tech Stack of on-premise Data Lake
    1:49 - Docker Containers Overview
    3:26 - Data Lake Configurations
    4:48 - Start Docker Containers
    5:59 - MinIO (S3) Bucket and File(s)
    6:53 - File mapping to SQL Table
    7:12 - Trino Cluster
    7:32 - Trino SQL Engine Connection
    8:37 - Create Schema
    9:03 - Create Table
    9:36 - Query External Table
    10:12 - SQL Analysis
    10:29 - Data Lake Tech Review
    11:51 - Coming Soon
  • Věda a technologie

Komentáƙe • 34

  • @alaab82
    @alaab82 Pƙed měsĂ­cem

    one of the best tuto's on youtube, thank you so much !

  • @hernanlopezvergara6133
    @hernanlopezvergara6133 Pƙed 5 měsĂ­ci

    Thank you very much for this short but very useful video!

  • @datawise.education
    @datawise.education Pƙed 6 měsĂ­ci

    I love all your videos Haq. Great work. :)

  • @rafaelg8238
    @rafaelg8238 Pƙed 3 měsĂ­ci

    great video, congrats.

  • @wallacecamargo1043
    @wallacecamargo1043 Pƙed 8 měsĂ­ci

    Thanksss!!!

  • @LucasRalambo-bp3vb
    @LucasRalambo-bp3vb Pƙed rokem

    This is very informative !!! Thank you...
    Can you please also make a video about creating an open source version of Amazon Forecast?

    • @BiInsightsInc
      @BiInsightsInc  Pƙed rokem +1

      Amazon Forecast is a time-series forecasting service based on machine learning (ML). We can certainly do it using open source. I will cover time-series forecasting in the future. In the mean time check out the ML Predictive Analytics with following video: czcams.com/video/TR6vn4lZ3Mo/video.html&t

  • @TheMahardiany
    @TheMahardiany Pƙed rokem +1

    Thankyou for the video, great as always 🎉
    I want to ask in this video, when we use trino for query engine, can we use DML and even DDL for that external table ?
    or we can just select from it ?
    Thank you

    • @BiInsightsInc
      @BiInsightsInc  Pƙed rokem +1

      There are a number limitations to do DMLs on Hive. Please read the documentation link for more details - cwiki.apache.org/confluence/display/Hive/Hive+Transactions. It’s recommend not to use DML on Hive managed tables especially if the data volume is huge these operations would become too slow. DML operations would be considerably faster if done on a partition/bucket instead of the full tables. Nevertheless it better to handle the edits in file and do a full refresh via external table and only use DML on managed tables as last resort. We define the table via DDL so yes.

  • @zera215
    @zera215 Pƙed 4 měsĂ­ci

    You mentioned to someone that Apache Iceberg could be an alternative to Hive. Would you be interested in recording a new video about it?

    • @BiInsightsInc
      @BiInsightsInc  Pƙed 4 měsĂ­ci

      @zera215 I have covered the Apache Iceberg and how to utilize in the similar setup in the following video: czcams.com/video/vnNHDylGtEk/video.html

    • @zera215
      @zera215 Pƙed 4 měsĂ­ci

      @@BiInsightsInc I am looking for full open source solution. Do you know if I can just exchange hive and Iceberg in the architecture of this video?

    • @BiInsightsInc
      @BiInsightsInc  Pƙed 4 měsĂ­ci

      @@zera215 you can use Hive and Iceberg together. Yout still need a metastore in order to work with Iceberg. Here is an example of how to use them together.
      iceberg.apache.org/hive-quickstart/

    • @zera215
      @zera215 Pƙed 4 měsĂ­ci

      @@BiInsightsInc Thank you, and congrats for your great work =-D

  • @hungnguyenthanh4101
    @hungnguyenthanh4101 Pƙed 11 měsĂ­ci

    How to build data lakehouse? Please next video with topic Lakehouse.❀

    • @BiInsightsInc
      @BiInsightsInc  Pƙed 11 měsĂ­ci

      Yes, data lake house is on my radar. I will cover it in the future videos.

  • @akaile2233
    @akaile2233 Pƙed 11 měsĂ­ci +1

    Thankyou for video
    Hi sir, if I want to use Spark to save data to the data lake you built, how do I do that? (I just started learning about Data lake and Spark)

    • @BiInsightsInc
      @BiInsightsInc  Pƙed 11 měsĂ­ci +2

      Below is a sample code to write data to MinIO bucket with Spark.
      package com.medium.scala.sparkbasics
      import com.amazonaws.SDKGlobalConfiguration
      import org.apache.spark.sql.SparkSession
      object MinIORead_Medium extends App {
      System.setProperty(SDKGlobalConfiguration.DISABLE_CERT_CHECKING_SYSTEM_PROPERTY, "true")
      lazy val spark = SparkSession.builder().appName("MinIOTest").master("local[*]").getOrCreate()
      val s3accessKeyAws = "minioadmin"
      val s3secretKeyAws = "minioadmin"
      val connectionTimeOut = "600000"
      val s3endPointLoc: String = "127.0.0.1:9000"
      spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", s3endPointLoc)
      spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", s3accessKeyAws)
      spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", s3secretKeyAws)
      spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.timeout", connectionTimeOut)
      spark.sparkContext.hadoopConfiguration.set("spark.sql.debug.maxToStringFields", "100")
      spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
      spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
      spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true")
      val yourBucket: String = "minio-test-bucket"
      val inputPath: String = s"s3a://$yourBucket/data.csv"
      val outputPath = s"s3a://$yourBucket/output_data.csv"
      val df = spark
      .read
      .option("header", "true")
      .format("minioSelectCSV")
      .csv(inputPath)
      df
      .write
      .mode("overwrite")
      .parquet(outputPath)
      }

    • @akaile2233
      @akaile2233 Pƙed 11 měsĂ­ci

      @@BiInsightsInc Hi sir, I did everything like your video and it worked fine, but when I remove the schema 'sales' there is an error 'access denied' ?

    • @BiInsightsInc
      @BiInsightsInc  Pƙed 11 měsĂ­ci

      @@akaile2233 You cannot delete objects from Trino engine. You can do so in the Hive Metastore. In this example, we re using Maria db. So you can connect to it delete objects from there. Changes will be reflected in the mappings you see in Trino.

    • @akaile2233
      @akaile2233 Pƙed 10 měsĂ­ci

      @@BiInsightsInc Sorry to bother, there are too many tables in metastore_db, which ones should I delete?

  • @user-wd1od9cu1g
    @user-wd1od9cu1g Pƙed rokem

    thank you. can we build transactional data lake using iceberg /hudi on this minio storage.

    • @BiInsightsInc
      @BiInsightsInc  Pƙed rokem

      Yes, you can build a data lake using Iceburg and MinIO. Here is a guide that showcases both of these tools in conjunction.
      resources.min.io/c/lakehouse-architecture-with-iceberg-minio?x=jAF4uk&Lakehouse+%2B+Icerberg+on+PF+1.0+-+080322&hsa_acc=8976569894&hsa_cam=17954061482&hsa_grp=139012460799&hsa_ad=614757163838&hsa_src=g&hsa_tgt=kwd-1717916787486&hsa_kw=apache%20iceberg&hsa_mt=b&hsa_net=adwords&hsa_ver=3&gclid=Cj0KCQjwnrmlBhDHARIsADJ5b_mrNZMG2PHc14akJyBoy3nW-8INcEQ8MFRjifDGkjGDeDiNqcAxVvkaAgToEALw_wcB

  • @anujsharma4011
    @anujsharma4011 Pƙed rokem

    Can we directly connect trino with S3? no hive inbetween. I want to install trino on EC2

    • @BiInsightsInc
      @BiInsightsInc  Pƙed rokem

      I’m afraid not. Trino needs the tables schema/metadata and that’s managed by the Hive metastore. Alternatively we can use Apache icebergs but we need the table mappings before Trino query engine can access the data stored in s3.

  • @juliovalentim6178
    @juliovalentim6178 Pƙed rokem

    Sorry for the noob question, but can I create a Data Lake like this inside a PowerEdge T550 server instead of my desktop or laptop? Without resorting to paid cloud services?

    • @BiInsightsInc
      @BiInsightsInc  Pƙed rokem +1

      Yes, you can create this setup on your server. This way you will use your own infrastructure and avoid paid services and data exposure to outside services.

    • @juliovalentim6178
      @juliovalentim6178 Pƙed rokem

      @@BiInsightsInc Hi! Thank you very much for answering my question. And would you be able to tell me what RAM, cache and SSD requirements I need to have on the server to implement this setup, without slowing down processing for Data Science?

    • @BiInsightsInc
      @BiInsightsInc  Pƙed rokem

      @@juliovalentim6178 the hardware requirements will ultimately depends on the amount of data you are processing and you can tweak it once you perform tests with actual data. Anyways, here are some recommendations from minIO. First is an actual data lake deployment you can use for reference. Second link is for a production scale data lake. Hope this helps.
      blog.min.io/building-an-on-premise-ml-ecosystem-with-minio-powered-by-presto-r-and-s3select-feature/
      min.io/product/reference-hardware

    • @juliovalentim6178
      @juliovalentim6178 Pƙed rokem

      @@BiInsightsInc Of course it helped! Thank you so much again. Congratulations on the excellent content of your channel. I will always be following. Best Regards!

  • @oscardelacruz3087
    @oscardelacruz3087 Pƙed 4 měsĂ­ci

    Hi, Haq. Nice video, i'm trying to make it works but cannot load minio catalog.

    • @BiInsightsInc
      @BiInsightsInc  Pƙed 4 měsĂ­ci

      You are not able to connect to MinIO in DBeaver? What's the error you receive there?

    • @oscardelacruz3087
      @oscardelacruz3087 Pƙed 4 měsĂ­ci

      @@BiInsightsInc hi haq, thanks. Finally i connect minio and trino. But i have a question how much deep trino can read parquet file? I trying to read parquet file from minio with directory structure: s3a://datalake/bronze/erp/customers/. Inside customer folder i have folders for each year/month/day. When try to read the files trino return 0 rows.