ETL | AWS Glue | AWS S3 | Load Data from AWS S3 to Amazon RedShift

Sdílet
Vložit
  • čas přidán 14. 08. 2021
  • ===================================================================
    1. SUBSCRIBE FOR MORE LEARNING :
    / @cloudquicklabs
    ===================================================================
    2. CLOUD QUICK LABS - CHANNEL MEMBERSHIP FOR MORE BENEFITS :
    / @cloudquicklabs
    ===================================================================
    3. BUY ME A COFFEE AS A TOKEN OF APPRECIATION :
    www.buymeacoffee.com/cloudqui...
    ===================================================================
    This video shows complete end-to-end demo achieving ETL with AWS glue service where a data will be loaded from AWS S3 bucket to Amazon RedShift. It has clean architectural explanation, part by part explanation as well. And demo configuring complete automation set up.
    Please find the relevant file at repo : github.com/RekhuGopal/PythonH...
  • Věda a technologie

Komentáře • 133

  • @nikhilrao1266
    @nikhilrao1266 Před 2 lety +2

    This is Awesome. I subbed. Hope to see more videos on AWS.

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thank you very much for encouraging words and watching my videos.
      Looking forward to contribute more value to the AWS community.

  • @olorunfemirichard5976
    @olorunfemirichard5976 Před 2 lety +1

    Great presentation. You speak exactly like someone I met at Generation & I.

  • @funforkids9927
    @funforkids9927 Před rokem +1

    This is very useful video with deep level of understanding. Thank you so much for this kind of videos. Looking Forword to upcoming videos.

    • @cloudquicklabs
      @cloudquicklabs  Před rokem

      Thank you for watching my videos.
      And thank you very much encouraging words. Keep watching and keep learning from my videos.

  • @senthilkumar5129
    @senthilkumar5129 Před 2 lety +1

    Clean explanation. Thank you keep uploaded more

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thank you for watching my videos.
      Keep watching, lots of videos are coming ahead.

  • @behnamvakili1638
    @behnamvakili1638 Před 2 lety +1

    It was a great demo. Thank you.

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      You are welcome. Thank you for watching my videos.

  • @90daner
    @90daner Před 9 měsíci +1

    a wonderfull video! thanks man !

    • @cloudquicklabs
      @cloudquicklabs  Před 9 měsíci

      Thank you for watching my videos.
      Glad that it helped you.

  • @sriharsharompicherla9704
    @sriharsharompicherla9704 Před 2 lety +1

    Awesome Session very helpful💐

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thank you for watching my videos.
      Please do share it your friends as well.

  • @canye1662
    @canye1662 Před 2 lety +1

    Nice video n explanation.

  • @artijadon9734
    @artijadon9734 Před rokem +2

    Thanks for the demo. Have you created the (Setting the ETL Infrastructure using Terraform) demo for the same?

    • @cloudquicklabs
      @cloudquicklabs  Před rokem

      Thank you for watching my videos.
      I have set Infrastructure using manually here. But its good idea to keep this infrastructure with using Infrastructure as Code i.e Terraform. I shall keep in mind this in my next video

  • @gourbhoi722
    @gourbhoi722 Před 11 měsíci +1

    Thanks man

    • @cloudquicklabs
      @cloudquicklabs  Před 11 měsíci

      Thank you for watching my videos.
      Glad that it helped you.

  • @DanielWeikert
    @DanielWeikert Před 2 lety +1

    Hi
    when exactly do I need to set up a route table and NAT Gateway for using glue? Can you elaborate on that? I can also use glue and crawlers without the route table as long as the roles are correct?
    Thanks

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thank you for watching my videos.
      Yes, please make sure that you have all trafic are allowed from internet.
      And then create Glue crawlers and Jobs, it works as expected.

  • @nachiket26
    @nachiket26 Před 2 lety +10

    this demo is good but there are two things missing here. only the incremental data should be loaded in redshift when the job runs again. in your case there is duplication. another thing u could have shown is making changes on the fly in the excel sheet which invokes the crawler and automatically runs a job.

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety +3

      Thank you for watching my videos.
      These are indeed valuable inputs.
      I am planning to make next version of this video where these inputs are taken care.
      Keep watching and Happy learning.

    • @ManojKumar-vp1zj
      @ManojKumar-vp1zj Před rokem

      right

    • @akshaymuktiramrodge3233
      @akshaymuktiramrodge3233 Před 8 měsíci +1

      Hi brother if you have any update on how we can load only incremental data please share it with me....

    • @cloudquicklabs
      @cloudquicklabs  Před 8 měsíci

      Thank you for providing reminder.
      I am on it , I shall upload it soon.

    • @akshaymuktiramrodge3233
      @akshaymuktiramrodge3233 Před 8 měsíci

      @@cloudquicklabsok

  • @gilbertyoungjr4898
    @gilbertyoungjr4898 Před 2 lety +2

    what if you have multiple tables? Do you need to create a redshift crawler for each tbl??

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety +2

      Thank you for watching my videos.
      The current capability of aws glue crawlers needs to create multiple points of crawlers when you have multiple tables.

  • @sandeepreddy9689
    @sandeepreddy9689 Před 24 dny +1

    Hi,
    @cloudquicklabs What was the use of second crawler. Is it only run once to get schema of redshift in to temp database and later it won't be running any time?

    • @cloudquicklabs
      @cloudquicklabs  Před 18 dny

      Thank you for watching my videos.
      It's to load the data from s3 to aws glue data catalog and then to Amazon Redshift services.
      Did you watch the second version of this video czcams.com/video/RGSKeK9xow0/video.htmlsi=FB_1BXVQp-SnfUtq

  • @kishlayamourya3141
    @kishlayamourya3141 Před 2 lety +3

    Awesome video!!
    I have a query: -
    I wanted to push s3 data(csv) to redshift tables.
    Can I anyhow use table schema created by crawler to create table in redshift?
    In every tutorial instructor 1st hand creates a table in redshift, then uses crawler again to create schema in glue then pushes the data to reshift...then what is the use of creating schema using crawler?

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety +3

      Thank you for watching my videos.
      This is really good question, Glue crawlers works based on source (pre - configured details about source) and destination (pre - configured details about destination ) mapping and it expects both parts to be present before it starts pumping the data using job.
      Your requirement is more of custom and dynamic one, in this case you might need to rely with AWS Lambda where you can dynamically create table and upload data.

    • @MrMadmaggot
      @MrMadmaggot Před rokem

      There must be a way to get the schema easily witouth having to us many resources bu I havne't found it yet, what about you?
      I know that Pandas has a function that helps with that maybe I can download the data and read a chunk and create the schema, but what if the data is FRIGIN BIG? Like teras of data, no way I'm goin to dowload that.
      As far as I'm concearned the crawler gives you a data schema from the csv which u can use it.
      RN I will try to upload a csv in S3, the redshift cluster is connected to that S3 which means I can copy the csv from s3 to a table in redshift but i wanna se if the csv appears already there in the DB of the redshift cluster.

  • @pallavkan
    @pallavkan Před rokem +2

    Appreciate yuor efforts! But you have not explained the significance of creating reshift crawler, why security group required with the same name... I would request you to go step by step for making this things better understand.. Patience!

    • @cloudquicklabs
      @cloudquicklabs  Před rokem

      Thank you for watching videos.
      I appreciate your patience and valuable feedback are taken. Keep learning. :)

  • @shadabbarmare7797
    @shadabbarmare7797 Před 7 měsíci +1

    it was a good demo ...thank you so much. just a small question
    . Why we didn't use S3 trigger destination to Lambda function if new file is added or update instead of EventBridge ?

    • @cloudquicklabs
      @cloudquicklabs  Před 7 měsíci

      Thank you for watching my videos.
      This is custom solution that designed here. But Indeed we can use S3 object create event to trigger the workflow as well.

  • @vijaykumarreddyt3287
    @vijaykumarreddyt3287 Před měsícem +1

    I am getting "Account *** is denied access" while creating crawlers. I tried signing in afresh. It's not helping. Also, while modifying inbound rules for security groups i selected allow all for the types.

    • @cloudquicklabs
      @cloudquicklabs  Před měsícem

      Thank you for watching my videos.
      May be AWS might have disabled crawler or ETL job in your account , just raise service quota limit or raise support case should help you.

  • @phanikumar9310
    @phanikumar9310 Před 15 dny +1

    few more examples create it and place the videos. super

    • @cloudquicklabs
      @cloudquicklabs  Před 9 dny

      Indeed , I shall create more in this context of Data engineering.

  • @harikaanasuya8976
    @harikaanasuya8976 Před rokem +2

    We can also load data to redshift using copy command from S3. So, what is the difference between the above method and using copy command.

    • @cloudquicklabs
      @cloudquicklabs  Před rokem +1

      Thank you for watching my videos.
      This really a good question.
      The method shown in this video is using complete aws given service (Glue crawlers) to extract, load and transform, whereas COPY commond is using SQL method by connecting to RDS and uploads the data.

  • @rohitmukherjee4271
    @rohitmukherjee4271 Před rokem +1

    Hey Quick question,
    why is not there a trigger at target point? is it cause the Destination would be a one-time activity?? please revert?

    • @cloudquicklabs
      @cloudquicklabs  Před rokem

      Thank you for watching my videos.
      Yes.. Here in this lab, I have shown for one table at Redshift cluster which is destination and it would one time activity. But source file could keep on coming which needs to updated to Redshift cluster table.

  • @amanchoudhary9518
    @amanchoudhary9518 Před rokem +1

    Could you please explain the creation of s3 endpoint in detail and what policy is attached while creating s3 endpoint?

    • @cloudquicklabs
      @cloudquicklabs  Před rokem

      Thank you for watching my videos.
      S3 endpoint here is enabling VPC S3 bucket endpoints, this is needed to securely connect to Redshift via VPC AWS backbone network. You need have VPC and S3 bucket access to do this.

  • @veerachegu
    @veerachegu Před 2 lety +1

    We have invoked the crawler by lamda but we were crawled it se table again what is the purpose same table again pushed? we got already table in redshift from source.

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      There two aspects are been shown there.
      1. Created Glue job based ETL pipeline (manually triggered once) whcih creates table for the first time in redshift.
      2. Next we, are triggering the same Glue pipeline using event rule (when there is a file uploaded in s3 bucket), this automation scenario.
      Hope this helps you to understand it.

  • @monishbhanushali3214
    @monishbhanushali3214 Před 2 lety +1

    @Cloud Quick Labs How to configure cloudwatch and lambda function to trigger when job is succeeded I tried but my lambda is not getting invoked.

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thank you for watching my videos.
      Hope you have followed all steps been mentioned in the video to invoke lambda..
      In that case could you please check the permissions been assigned to role of Lambda.

    • @monishbhanushali3214
      @monishbhanushali3214 Před 2 lety +1

      @@cloudquicklabs Hi, thanks for replying really appreciated.
      I checked my lamda role permission applied was administrative access and trusted policy I have given for lambda, aws glue and s3 by adding each in json

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thanks for coming back on this.
      Hope this helped you

  • @damienspectre4231
    @damienspectre4231 Před rokem +1

    Can you do the same with Firehose? It supports Redshift from S3 automatically using the COPY command...

    • @cloudquicklabs
      @cloudquicklabs  Před rokem

      Thank you for watching my videos.
      Indeed I shall create a new video on it, thank you for giving new ideas.

  • @Parker_Perry
    @Parker_Perry Před 2 lety +1

    At 9:11 how did you add both lambda and glue as trusted entities? Im only allowed to add one and then Im unsure of how to edit the json to add glue

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thank you for watching my videos.
      This can be added when you are defining role , put trust values as 'glue.amazonaws.com'.

    • @Parker_Perry
      @Parker_Perry Před 2 lety +1

      @@cloudquicklabs It doesnt allow you to put multiple trust values. I couldnt add both lambda and glue. I had to edit the json to add glue

    • @Parker_Perry
      @Parker_Perry Před 2 lety +1

      Great video though!!

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Exactly, this the way we need to add it.

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thank you for watching my videos, Happy learning.

  • @Pavan-kn5pg
    @Pavan-kn5pg Před rokem +3

    Good content, but the flow of your explanation is quite confusing.

    • @cloudquicklabs
      @cloudquicklabs  Před rokem +1

      Thank you for watching my video.
      Apologies that you find this video bit confusing.
      Please watch it with patience , it should be helping you.

  • @mindset873
    @mindset873 Před 2 lety +1

    Hi I am having problem with steps Iam role at minute 37:54.
    No found roles with services principal aws-glue. this UI I see after clicking the iam roles. I need really your help to solve my problem.

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety +1

      Thank you for watching my videos.
      In my video I have suggested to use one IAM role with administrative policy attached to it and trust to AWS Glue and Lambda services.
      Please have these configuration done IAM role use same while creating Glue crawlers job and Lambda.. It should work

  • @user-qx4iy6nq9x
    @user-qx4iy6nq9x Před 8 měsíci +1

    Again Running Crawler and Watch dog and Lamda it creates duplicate rows right ?
    How will be the New data from source to target keep updated automatically kindly explain pls...Other than that Your Full Demo session is wonderful for beginners bro thanks

    • @cloudquicklabs
      @cloudquicklabs  Před 8 měsíci

      Thank you for watching my videos.
      I am currently looking to create v2 video this lab , I shall solve this problem in that video.

  • @Alexmontoyaangel
    @Alexmontoyaangel Před 2 lety +1

    Thanks for the tutorial.
    I would like you to give me your point on two doubts that I have
    - What is the best way to debug the code that does the ETL Glue process?
    - Is it possible to do all this infrastructure using IaaC ? Do you have any example?

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thank you for watching my videos.
      Please find my answers below.
      1. To debug the AWS Glue configuration, we have to check logs and follow the standard methods.
      2. Yes complete configuration fone above can maintained as Infrastructure as Code using Terraform, for examples please check my AWS Infrastructure as Code using Terraform videos.

  • @asishb
    @asishb Před 10 měsíci +1

    Hello !
    Wonderful video. You got a sub. Thanks for saving my time of hours of reasearch.
    However, I have two questions/requests.
    1) When I use a crawler in the Console, the Glue Catalog is able to create a table automatically. But, when I use Pyspark to type the code, I am required to provide a table . Why is that ?
    2) Can you show how to transfer data from S3 (csv file) to Redshift after some transformation using Glue, in the Visual editor of the Glue console ?

    • @cloudquicklabs
      @cloudquicklabs  Před 10 měsíci +1

      Thank you for watching my videos.
      1. It could due pyspark does not the capability while it is meant for Transforming the data and provides high scalable power.
      2. Indeed , I am going to create new video here on how to migrate data from S3 (csv file) to Redshift cluster soon.

    • @asishb
      @asishb Před 10 měsíci

      @@cloudquicklabs Thanks forcthe info

  • @suryatejasingasani256
    @suryatejasingasani256 Před rokem +1

    Hii bro i have a doubt i have a datastage job converted into XML file i want to convert the XML file into glue job how can I do

    • @cloudquicklabs
      @cloudquicklabs  Před rokem

      Thank you for watching my videos.
      Do you mean that you want to convert the xml file data to Redshift cluster DB.?

  • @manjindersingh8677
    @manjindersingh8677 Před 2 lety +2

    Hello Sir,
    I hope you are doing well.
    I have a confusion while creating a new role.
    At Step 1 of creating role it gives an option to choose Use case . What should we choose there in order to keep both (Glue and Lambda) as trusted entities.
    Your guidance will be appriciated.
    Thanks

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thank you for watching my videos.
      I am doing good, hope you are also doing good as well.
      You could add 'S3' and 'AWS Lambda' in trusted entities of so that same role can be used on either side.

  • @himansusekharmishra5114
    @himansusekharmishra5114 Před 2 lety +1

    Hi Thank you for this beautiful video.
    Unfortunately when we are running the glue job it is putting the duplicate data inside redshift. Like suppose we have new data in s3 and we run the crawler job after the job succeeded it is triggering the glue job and the new data is being updated is inside redshift and the data in redshift is getting doubled as it is duplicating the past data. Can you please help with this...??

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety +1

      Thank you for watching my videos and reaching on this.
      Here the solution is not intelligent enough to deduplicate the data or rather take the differential data upload in redshift, so I would suggest to clean up the file which uploaded in redshift and make only required data to be present in s3 bucket side when file is uploaded.

    • @himansusekharmishra5114
      @himansusekharmishra5114 Před 2 lety +1

      @@cloudquicklabs Thank you again, Shall we do it with the glue job itself?

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      I believe we can't do it with glue job but by making s3 file containing unique data. And might be cleaning temp table created by crawlers from sources side.

    • @himansusekharmishra5114
      @himansusekharmishra5114 Před 2 lety +1

      @@cloudquicklabs Hi I got the solution I think all we need to enable bookmarks in gluejob from the console and it will not repeat the old data.

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety +1

      Great to hear back from you side.
      Happy for you that you explored it.
      And thank you very much for sharing it here.

  • @jhonamorgan9194
    @jhonamorgan9194 Před 2 lety +1

    Please try to add subtitles to your videos

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety +1

      Thank you for watching my videos.
      Yes.. I am enabling it right away.

  • @sabaakram9310
    @sabaakram9310 Před 2 lety

    what are all policies u have attached in ETLlambdaAccessRole ????

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety +1

      Thanl you for watching my videos.
      I have attached policy 'Administrator Access' to the role. You need fine grain it (like give only download access on S3, CloudWatch log group creation, log put access, AWS Lambda, Event bridge, AWS glue access and Redshift accesses) when it is production to follow least privilege method..

    • @sabaakram9310
      @sabaakram9310 Před 2 lety +1

      @@cloudquicklabs thankyou 😊

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      You are welcome.

  • @udaykumar-tb5kn
    @udaykumar-tb5kn Před rokem +1

    How to configure redshift, vpc ur explain high levele

    • @cloudquicklabs
      @cloudquicklabs  Před rokem

      Thank you for watching my videos.
      Thank you for your encouraging words.

    • @udaykumar-tb5kn
      @udaykumar-tb5kn Před rokem

      @@cloudquicklabs no comment pls answer my question.. Provide your data

  • @phanikumar9310
    @phanikumar9310 Před 15 dny +1

    aws glue software any tail version?

    • @cloudquicklabs
      @cloudquicklabs  Před 9 dny

      Thank you for watching my videos.
      Glue is PaaS there is versions here but feature options can be explored.

  • @santaellal
    @santaellal Před 2 lety +1

    thanks for the tutorial, please load with sub in spanish :( !

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety +1

      Thank you for watching my videos.
      Yes, I shall try to update it.

    • @santaellal
      @santaellal Před 2 lety +1

      @@cloudquicklabs thanks, can i load .txt with aws glue? or only csv?

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      It supports csv (as it has column and rows). I dont think that it supports. txt.

  • @kailashchand7596
    @kailashchand7596 Před rokem +2

    Confusing at many place

    • @cloudquicklabs
      @cloudquicklabs  Před rokem

      Thank you for watching my videos.
      Apologies that it made you confused. Please watch again may be just pause and rewind it. Feel free ask doubts as well.

  • @udaykumar-tb5kn
    @udaykumar-tb5kn Před rokem +1

    Can u share files and script used here

    • @cloudquicklabs
      @cloudquicklabs  Před rokem

      Thank you for watching my videos.
      You can find file share in videos description, you can find the repo github.com/RekhuGopal/PythonHacks/tree/main/AWSBoto3Hacks/AWS-ETL-S3-Glue-RedShift

  • @100derman
    @100derman Před 2 lety +1

    34 minutos para importar un csv en una tabla... mamita...

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thank you for watching my videos.
      Yes took that much time as it has step-by-step explanation..

  • @sauravverma6179
    @sauravverma6179 Před rokem +1

    Dude, Administrator access!! SERIOUSLY!

    • @cloudquicklabs
      @cloudquicklabs  Před rokem +1

      Thank you for watching my videos.
      Administrator access is blanket access but you could finegrain your IAM role with least previlage approach and use it for solution(I left this part as self service)

    • @sauravverma6179
      @sauravverma6179 Před rokem +2

      @@cloudquicklabs Yes.
      The people who are new to DevOps will just use this administrator role.
      You could have provided an IAM policy for people to use.

    • @cloudquicklabs
      @cloudquicklabs  Před rokem +1

      Thank you for the valuable inputs.
      I shall consider these points in my next video.

  • @eylesit9268
    @eylesit9268 Před 9 měsíci +1

    I tried to watch this video, but there were adverts every 2 or 3 minutes. For me this is unacceptable and unwatchable.

    • @cloudquicklabs
      @cloudquicklabs  Před 9 měsíci

      Thank you for watching my videos.
      Apologies that you are not liking the often adverts coming while playing it. But note that this is how CZcams works.

    • @eylesit9268
      @eylesit9268 Před 9 měsíci +1

      No other videos I watch have this many adverts. The overwhelming majority have one advert at the start and no others. Forcing viewers to watch adverts every few minutes disrupts the learning process.@@cloudquicklabs

    • @AshisRaj
      @AshisRaj Před 6 měsíci +1

      Mr. Author, you could be more humble. Your tone should be modulated.

    • @cloudquicklabs
      @cloudquicklabs  Před 6 měsíci

      Thank you for watching my videos.
      I shall consider this input. Thank you again for your inputs.

    • @cloudquicklabs
      @cloudquicklabs  Před 6 měsíci

      I understand the pain point here.
      Apologies for inconvenience.

  • @jubrilbabatunde3040
    @jubrilbabatunde3040 Před 2 lety +1

    poor video quality.

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thanl you valuable input, I am in progress to improve quality soon. .
      Till then please do rely on this and keep learning.

  • @saurabhpuranik5474
    @saurabhpuranik5474 Před 2 lety +2

    Most confused trainer I have ever seen. This guy just jumps here and there. Follow proper steps if you want to create proper content.

    • @cloudquicklabs
      @cloudquicklabs  Před 2 lety

      Thank you for giving valuable inputs.
      I shall to take it to fine tune myself.
      This Lab scenario is unfortunately bit complex and content was meant to reach wider audience like starter, proficient and experts.

  • @suneelkumarparvatala7527
    @suneelkumarparvatala7527 Před 25 dny +1

    Hi Thank you for the informational videos, would you clarify this doubt I have, here gluecrawler created and ran only once I believe instead of creating glue crawler can it be possible to migrate data directly from s3source crawler to redshift table .
    My intention of asking this query is since we are running gluecrawler only once or only at a time when we would like to see records in redshift table since the functionally of accessing records in redshift is possible by quering in the query editor

    • @cloudquicklabs
      @cloudquicklabs  Před 25 dny

      Thank you for watching my videos.
      We migrating data from source s3 to Amazon redshift which is destination here. Crawlers can be scheduled or invoked on demand.