DataSpark
DataSpark
  • 30
  • 52 849
Rename Columns in PySpark || Quick Tips for Renaming Columns in PySpark DataFrames || #pyspark
In this Video we discussed about Rename the column spaces in the df using pyspark dynamically , in order to create the Delta Tables. Column names with spaces are not allowed while creating Delta Tables
Source Link ::
drive.google.com/file/d/1VfTu9TAE_wkyVa35iw0f95nXNB-Mppfi/view?usp=sharing
#PySpark
#BigData
#DataScience
#Python
#DataEngineering
#ApacheSpark
#MachineLearning
#DataAnalytics
#Coding
#Programming
#TechTutorial
#DataTransformation
#ETL
#DataProcessing
#CodeWithMe
zhlédnutí: 199

Video

Handle or Fill Null Values Using PySpark Dynamically | Real Time Scenario | #pyspark #dataengineers
zhlédnutí 249Před měsícem
In this video, we dive into the essential techniques for handling and filling null values dynamically using PySpark. In this tutorial, you will learn: How to identify and handle null values in PySpark DataFrames. Techniques to dynamically fill null values based on various conditions. Practical examples and step-by-step code demonstrations. Notebook link ::: drive.google.com/file/d/1oHJTDblzt2fi...
ADF UNTIL ACTIVITY || REAL TIME SCENARIO || VERIFY COUNT OF RECORDS || #azuredatafactory
zhlédnutí 124Před měsícem
In this video we will understand the Privilege's of Using Until Activity in ADF CREATE TABLE dev.TableMetadata ( TableName NVARCHAR(128), LastRowCount INT ); Initialize the metadata for your table if not already done IF NOT EXISTS (SELECT 1 FROM dev.TableMetadata WHERE TableName = 'YourTableName') BEGIN INSERT INTO dev.TableMetadata (TableName, LastRowCount) VALUES ('YourTableName', 0); END CRE...
Lakehouse Arch || DWH v/s DATALAKE v/s DELTALAKE || #dataengineering #databricks
zhlédnutí 207Před 2 měsíci
In this video we discussed about Difference between DWH v/s DATALAKE v/s DELTALAKE . In the next part we will see practical Drawbacks of DATALAKE Link for notes:: drive.google.com/file/d/10gbSmYnNUThWYHWCIZR9vWR1pJiVm14x/view?usp=sharing #dataanalytics #azuredataengineer #databricks #pyspark #datawarehouse #datalake #sql
ADF COPYDATACTIVITY || Copy Behavior || Quote & Escape Characters || Hands On || #dataengineering
zhlédnutí 177Před 2 měsíci
In this Video we will discuss about the Quote char, Escape Char and Copy Behavior in ADF Copy DataActivity considering source as a CSV file. Note :: If u leave the Copy Behavior as empty by default it takes "Preserve Hierarchy" Notes Link :: drive.google.com/file/d/1yVsU1HsdShe2On21LBKyurR4JfPOWDO9/view?usp=sharing #dataengineering #azuredataengineer #azuredatabricks #pyspark #database #databricks
Data Validation with Pyspark || Rename columns Dynamically ||Real Time Scenario
zhlédnutí 259Před 3 měsíci
In this video explained about how we can rename the columns for the selected columns from source file dynamically with pyspark. To execute this dynamically we used Metadata files Important Links : Meta columns : drive.google.com/file/d/1EWxcWNpG52rznjK2MnRo9jUGQpfpnxyl/view?usp=sharing MetaFiles : drive.google.com/file/d/1szbTXZuDxYk2Hk6kk_VttEoetZBdQj4E/view?usp=sharing Source Files 4 wheelers...
Pyspark with YAML File || Part-3 || Real Time Scenario || #pyspark #python #interviewquestions
zhlédnutí 205Před 4 měsíci
In this video will see how to read sql server tables from YAML file and create a pyspark df on top of that part2 link: czcams.com/video/aQlazXrjgrU/video.html part1 link : czcams.com/video/ujoF2Wd_2T0/video.htmlsi=kV48HUg88exWVJY2 Playlist link: czcams.com/play/PLWhMEKuFLBt8Kt-Y2DOeTzFtAOxzQdwTe.html #pyspark #dataengineering #pythonprogramming #sql #spark #databricks #dataanalytics
Pyspark with YAML File || Part-2 || Real Time Scenario || #pyspark #python #interviewquestions
zhlédnutí 224Před 5 měsíci
In this video will see the issue or error for the part-1 and who to read csv sources from YAML file part1 link : czcams.com/video/ujoF2Wd_2T0/video.htmlsi=kV48HUg88exWVJY2 Playlist link: czcams.com/play/PLWhMEKuFLBt8Kt-Y2DOeTzFtAOxzQdwTe.html #pyspark #dataengineering #pythonprogramming #sql #spark #databricks #dataanalytics
Pyspark with YAML file || Part-1 || Pyspark Real Time Scenario || #pyspark
zhlédnutí 561Před 5 měsíci
In this video , we will consider YAML file as a Config file, Read Sources mentioned in the YAML file to load the data playlist link : czcams.com/play/PLWhMEKuFLBt8Kt-Y2DOeTzFtAOxzQdwTe.html #pyspark #databricks #dataanalytics #spark #interviewquestions #pythonprogramming #dataengineering #databricks #yaml
Data Insertion in DimDate || Using SQL || Real Time Scenario
zhlédnutí 199Před 5 měsíci
In this Video we will discus about Basic Date functions and how to use those to insert data into DimDate in the DataWareHouse Model recursively .Once after 90days completed again re-run the script code link: drive.google.com/file/d/1gyIQMOtVjHNTqwzMdT0jMj5yJSQWOv_c/view?usp=sharing #pyspark #sqlserver #dataengineering #dataanalytics #datawarehouse #sql
Data Validation using pyspark || Handle Unexpected Records || Real Time Scenario ||
zhlédnutí 314Před 5 měsíci
This video tells you about , how we can handle the unexpected records using pyspark on top the source data frame playlist link: czcams.com/play/PLWhMEKuFLBt8Kt-Y2DOeTzFtAOxzQdwTe.html&si=tg-Du5LOsXUe8-Ju code : drive.google.com/file/d/1Z r3KePT0uI_WpvKJSKN0GGK8vWZdq5/view?usp=sharing #pyspark #databricks #dataanalytics #data #dataengineering
Data Validations using Pyspark || Filtering Duplicate Records || Real Time Scenarios
zhlédnutí 356Před 6 měsíci
This video tells about How we can filter out or handle the duplicate records using Pyspark in a Dynamic Way... #azuredatabricks #dataengineering #dataanalysis #pyspark #pythonprogramming #dataengineering #dataanalysis #pyspark #python #sql Playlist Link: czcams.com/play/PLWhMEKuFLBt8Kt-Y2DOeTzFtAOxzQdwTe.html&si=oRPexgefXxT0R8Y7
Data Validation Using Pyspark || ColumnPositionComparision ||
zhlédnutí 437Před 6 měsíci
How we can develop a function or script using Pyspark to compare ColumnPosition while dumping data into raw layer from stage or source layer. #pyspark #databricks #dataanalytics #spark #interviewquestions #pythonprogramming #dataengineering linkedin : www.linkedin.com/in/lokeswar-reddy-valluru-b57b63188/
Data Validation with Pyspark || Schema Comparison || Dynamically || Real Time Scenario
zhlédnutí 1,3KPřed 7 měsíci
In this Video we covered how we can perform quick data validation like Schema comparison between source and Target: In the next video we will look into Date/TimeStamp format check and duplicate count check . Column Comparison link : czcams.com/video/U9QqTh9ynAM/video.html #dataanalytics #dataengineeringessentials #azuredatabricks #dataanalysis #pyspark #pythonprogramming #sql #databricks #PySpa...
Data Validation with Pyspark || Real Time Scenario
zhlédnutí 4,2KPřed 7 měsíci
In this video will discuss about , how we are going to perform data validation with pyspark Dynamically Data Sources Link: drive.google.com/drive/folders/10aEhm5xcazOHgOGzRouc8cDZw4X8KC0o?usp=sharing #pyspark #databricks #dataanalytics #data #dataengineering
Implementing Pyspark Real Time Application || End-to-End Project || Part-5 || HiveTable ||MYSQL
zhlédnutí 3,8KPřed rokem
Implementing Pyspark Real Time Application || End-to-End Project || Part-5 || HiveTable ||MYSQL
Introduction to Spark [Part-1] || Spark Architecture || How does it works internally !!
zhlédnutí 758Před rokem
Introduction to Spark [Part-1] || Spark Architecture || How does it works internally !!
Implementing Pyspark Real Time Application || End-to-End Project || Part-4
zhlédnutí 2,3KPřed rokem
Implementing Pyspark Real Time Application || End-to-End Project || Part-4
Implementing Pyspark Real Time Application || End-to-End Project || Part-3||
zhlédnutí 2KPřed rokem
Implementing Pyspark Real Time Application || End-to-End Project || Part-3||
Implementing Pyspark Real Time Application || End-to-End Project || Part-2
zhlédnutí 3,5KPřed rokem
Implementing Pyspark Real Time Application || End-to-End Project || Part-2
Implementing Pyspark Real Time Application || End-to-End Project || Part-1
zhlédnutí 23KPřed rokem
Implementing Pyspark Real Time Application || End-to-End Project || Part-1
Implementing SCD-Type2 in ADF||Part2-Updated
zhlédnutí 818Před rokem
Implementing SCD-Type2 in ADF||Part2-Updated
Implementing SCD-Type2 in Azure Data Factory Dynamically ||Part-1
zhlédnutí 1,3KPřed rokem
Implementing SCD-Type2 in Azure Data Factory Dynamically ||Part-1
Implementing FBS Project with Azure Part-3 ||Azure Data Engineer End-To End Project
zhlédnutí 506Před rokem
Implementing FBS Project with Azure Part-3 ||Azure Data Engineer End-To End Project
Implementing FBS Project with Azure Part-2 ||Azure Data Engineer End-To End Project
zhlédnutí 702Před rokem
Implementing FBS Project with Azure Part-2 ||Azure Data Engineer End-To End Project
Implementing FBS Project with Azure Part-1 ||Azure Data Engineer End-To End Project
zhlédnutí 3,5KPřed rokem
Implementing FBS Project with Azure Part-1 ||Azure Data Engineer End-To End Project
Excel Multiple Sheets to Azure SQL Dynamically || Using Azure Data Factory || Data Factory Pipelines
zhlédnutí 473Před rokem
Excel Multiple Sheets to Azure SQL Dynamically || Using Azure Data Factory || Data Factory Pipelines
Full Load Data Pipeline Using Azure Data Factory Part 1 || Azure Data Factory || Data Engineering
zhlédnutí 381Před rokem
Full Load Data Pipeline Using Azure Data Factory Part 1 || Azure Data Factory || Data Engineering
Incremental Data Loading Part - 2 || For Multiple Tables Using Azure Data Factory
zhlédnutí 447Před rokem
Incremental Data Loading Part - 2 || For Multiple Tables Using Azure Data Factory
Incremental Data Loading Part 1 || For Single Table Using Azure Data Factory
zhlédnutí 909Před rokem
Incremental Data Loading Part 1 || For Single Table Using Azure Data Factory

Komentáře

  • @sainadhvenkata
    @sainadhvenkata Před 7 dny

    @dataspark Could you please provide those data links again because those link got expired

  • @tejathunder
    @tejathunder Před 18 dny

    sir, please upload continuation for this project.

  • @samar8136
    @samar8136 Před měsícem

    Now it is possible to save delta table with column name having spaces: Rename and drop columns with Delta Lake column mapping

    • @DataSpark45
      @DataSpark45 Před měsícem

      Renamed columns or removed the spaces and then created

  • @shaasif
    @shaasif Před měsícem

    thank you so much for your real time project explanation on 5 parts it's really awesome..can you please upload remaining multiple files and file name concept video

    • @DataSpark45
      @DataSpark45 Před měsícem

      Hi actually that concept covered in the Data validation playlist. By creating metadata files. Thanks

    • @shaasif
      @shaasif Před měsícem

      @@DataSpark45 can you share you email id i want to communicate with you

  • @amandoshi5803
    @amandoshi5803 Před měsícem

    source code ?

    • @DataSpark45
      @DataSpark45 Před měsícem

      def SchemaComparision(controldf, spsession, refdf): try: #iterate controldf and get the filename and filepath for x in controldf.collect(): filename = x['filename'] #print(filename) filepath = x['filepath'] #print(filepath) #define the dataframes from the filepaths print("Data frame is creating for {} or {}".format(filepath, filename)) dfs = spsession.read.format('csv').option('header', True).option('inferSchema', True).load(filepath) print("DF Created for {} or {}".format(filepath, filename)) ref_filter = refdf.filter(col('SrcFileName') == filename) for x in ref_filter.collect(): columnNames = x['SrcColumns'] refTypes = x['SrcColumnType'] #print(columnNames) columnNamesList = [x.strip().lower() for x in columnNames.split(",")] refTypesList = [x.strip().lower() for x in refTypes.split(",")] #print(refTypesList) dfsTypes = dfs.schema[columnNames].dataType.simpleString() #StringType() : string , IntergerType() : int dfsTypesList = [x.strip().lower() for x in dfsTypes.split(",")] # columnName : Row id, DataFrameType : int, reftype: int missmatchedcolumns = [(col_name, df_types, ref_types) for (col_name, df_types, ref_types) in zip(columnNamesList, dfsTypesList, refTypesList) if dfsTypesList != refTypesList] if missmatchedcolumns : print("schema comparision has been failed or missmatched for this {}".format(filename)) for col_name, df_types, ref_types in missmatchedcolumns: print(f"columnName : {col_name}, DataFrameType : {df_types}, referenceType : {ref_types}") else: print("Schema comaprision is done and success for {}".format(filename)) except Exception as e: print("An error occured : ", str(e)) return False

  • @maheswariramadasu1301
    @maheswariramadasu1301 Před měsícem

    Highly underrated channel and need more videos

  • @ArabindaMohapatra
    @ArabindaMohapatra Před měsícem

    I just started watching this playlist. I'm hoping to learn how to deal with schema-related issues in real time.Thanks

  • @gregt7725
    @gregt7725 Před 2 měsíci

    That is great - but how to handle deletion from Source ? Actually I do not understand why after sucessful changes/inserts - any deleletion of sorurce (e.g row number 2) creates duplicated rows of previously changed records.( last_updated_date do it - but why )

    • @DataSpark45
      @DataSpark45 Před 2 měsíci

      Hi, can you please share the details or pic where u have the doubt

  • @erwinfrerick3891
    @erwinfrerick3891 Před 2 měsíci

    Great explain, very clearly, this video very helpfull for me

  • @ChetanSharma-oy4ge
    @ChetanSharma-oy4ge Před 2 měsíci

    how can i find this code? is there any repo where you have uploaded it.?

    • @DataSpark45
      @DataSpark45 Před 2 měsíci

      Sorry to say this bro , unfortunately we lost those files

  • @waseemMohammad-qx7ix
    @waseemMohammad-qx7ix Před 2 měsíci

    thank you for making this project it has helped me a lot.

  • @maheswariramadasu1301
    @maheswariramadasu1301 Před 2 měsíci

    This video really help me because tomorrow I have to explain about this topics I am searching in the CZcams for the best explanation.This video helps me to know about from scratch

  • @mohitupadhayay1439
    @mohitupadhayay1439 Před 2 měsíci

    Amazing content. Keep a playlist for Real time scenarios for Industry.

  • @mohitupadhayay1439
    @mohitupadhayay1439 Před 2 měsíci

    Very underrated channel!

  • @ajaykiranchundi9979
    @ajaykiranchundi9979 Před 2 měsíci

    Very helpful! Thank you

  • @shahnawazahmed7474
    @shahnawazahmed7474 Před 2 měsíci

    I'm looking for ADF training will u provide that! how can I contact you?Thanks

    • @DataSpark45
      @DataSpark45 Před 2 měsíci

      Hi, u can contact me through LinkedIn Lokeswar Reddy Valluru

  • @MuzicForSoul
    @MuzicForSoul Před 2 měsíci

    sir, can you please also show us the run failing, you are only showing passing case, when I tested by swaping the columns in dataframe it is still not failing because the set still have them in same order.

    • @DataSpark45
      @DataSpark45 Před 2 měsíci

      Set values will come from reference df .so it always a constant one

  • @pranaykumar581
    @pranaykumar581 Před 2 měsíci

    Can you provide me the source data file?

    • @DataSpark45
      @DataSpark45 Před 2 měsíci

      Hi in the description i provided the link bro

  • @MuzicForSoul
    @MuzicForSoul Před 2 měsíci

    why we have to do ColumnPositionComparision? shouldn't the column name comparison you did earlier catch this?

  • @irfanzain8086
    @irfanzain8086 Před 2 měsíci

    Bro, thanks a lot! Great explaination 👍 Can you share part 2?

  • @vamshimerugu6184
    @vamshimerugu6184 Před 3 měsíci

    Sir Can you make a video on how to connect adls to DataBricks using Service principle

    • @DataSpark45
      @DataSpark45 Před 3 měsíci

      Thanks for asking, will do that one for sure .

  • @rohilarohi
    @rohilarohi Před 3 měsíci

    This video helped me a lot.hope we can expect more real time scenarios like this

  • @SuprajaGSLV
    @SuprajaGSLV Před 3 měsíci

    This really helped me understand the topic better. Great content!

    • @DataSpark45
      @DataSpark45 Před 3 měsíci

      Glad to hear it!

    • @SuprajaGSLV
      @SuprajaGSLV Před 3 měsíci

      Could you please upload a video on differences between Data Lake vs Data warehouse vs Delta Tables

    • @DataSpark45
      @DataSpark45 Před 2 měsíci

      Thanks a million. That will do for sure

  • @Lucky-eo8cl
    @Lucky-eo8cl Před 3 měsíci

    Good explanation bro👏🏻. It's Really helpful

  • @vamshimerugu6184
    @vamshimerugu6184 Před 3 měsíci

    I think schema comparison is the important topic in pyspark . Great explanation sir ❤

  • @vamshimerugu6184
    @vamshimerugu6184 Před 3 měsíci

    Great explanation ❤.Keep upload more content on pyspark

  • @saibhargavreddy5992
    @saibhargavreddy5992 Před 3 měsíci

    I found this very useful as I had a similar issue with data validations. It helped a lot while completing my project.

  • @maheswariramadasu1301
    @maheswariramadasu1301 Před 3 měsíci

    This video helps me to understand the multiple even triggers in adf

  • @maheswariramadasu1301
    @maheswariramadasu1301 Před 3 měsíci

    It's help me a lot for learning pyspark easily

  • @0adarsh101
    @0adarsh101 Před 3 měsíci

    can i use databricks community edition?

    • @DataSpark45
      @DataSpark45 Před 3 měsíci

      Hi, You can use databricks, then you have to play around dbutils.fs methods in order to get the list / file path as we did in get_env.py file. Thank you

  • @VaanisToonWorld-rp5xy
    @VaanisToonWorld-rp5xy Před 3 měsíci

    Please share files for FBS project

    • @DataSpark45
      @DataSpark45 Před 3 měsíci

      unfortunately, we lost those files and account

  • @SaadAhmed-js5ew
    @SaadAhmed-js5ew Před 4 měsíci

    where's your parquet file located?

    • @DataSpark45
      @DataSpark45 Před 3 měsíci

      Hi, r u talking about source parquet file! It's under source folder

  • @OmkarGurme
    @OmkarGurme Před 4 měsíci

    while working with databricks we dont need to start a spark session right ?

    • @DataSpark45
      @DataSpark45 Před 4 měsíci

      No need brother, we can continue with out defining spark session, i just kept for practice

  • @listentoyourheart45
    @listentoyourheart45 Před 4 měsíci

    Nice explanation sir

  • @kaushikvarma2571
    @kaushikvarma2571 Před 4 měsíci

    is this continuation of part - 2? In part-2 we have neve discussed about test.py and udfs.py

    • @DataSpark45
      @DataSpark45 Před 4 měsíci

      yes, test.py here i used for just to run the functions and about udfs.py please watch it from 15:oo min onwards

  • @kaushikvarma2571
    @kaushikvarma2571 Před 4 měsíci

    To solve header error, replace csv code to this "elif file_format == 'csv': df = spark.read.format(file_format).option("header",True).option("inferSchema",True).load(file_dir)"

  • @charangowdamn8661
    @charangowdamn8661 Před 4 měsíci

    Hi sir how can I reach to you can you ?

  • @charangowdamn8661
    @charangowdamn8661 Před 4 měsíci

    Hi sir how can I reach to you can you please share your mail id or how can I connect to you

    • @DataSpark45
      @DataSpark45 Před 4 měsíci

      you can reach me on linkedin id : valluru lokeswar reddy

  • @aiviet5497
    @aiviet5497 Před 5 měsíci

    I can't download the dataset 😭.

    • @DataSpark45
      @DataSpark45 Před 5 měsíci

      Take a look at this : drive.google.com/drive/folders/1XMthOh9IVAScA8Lk-wfbBnKCEtmZ6UKF?usp=sharing

  • @sauravkumar9454
    @sauravkumar9454 Před 5 měsíci

    sir, you are the best, love how you have taught and mentioned even the small of things. Would be looking forward to videos like this.

  • @World_Exploror
    @World_Exploror Před 5 měsíci

    how did you define reference_df and control_df

    • @DataSpark45
      @DataSpark45 Před 5 měsíci

      we defined as a table in any DataBase. As of know i used them as a csv

  • @mrunalshahare4841
    @mrunalshahare4841 Před 5 měsíci

    Can you share part 2

  • @finnegan2741
    @finnegan2741 Před 5 měsíci

    ✋ Promo_SM

  • @vishavsi
    @vishavsi Před 5 měsíci

    I am getting error with logging. Python\Python39\lib\configparser.py", line 1254, in __getitem__ raise KeyError(key) KeyError: 'keys' can you share the code written in the video?

    • @DataSpark45
      @DataSpark45 Před 5 měsíci

      sure, here is the link drive.google.com/drive/folders/1QD8635pBSzDtxI-ykTx8yquop2i4Xghn?usp=sharing

    • @vishavsi
      @vishavsi Před 5 měsíci

      Thanks@@DataSpark45

    • @subhankarmodumudi9033
      @subhankarmodumudi9033 Před 5 měsíci

      did your problem resolved? @@vishavsi

  • @jitrana6813
    @jitrana6813 Před 5 měsíci

    how can we use spark.sql instead pyspark dataframe select cmds, can you advise how can we do

    • @DataSpark45
      @DataSpark45 Před 5 měsíci

      Hi when you write df to hive generally we use df.saveasTable() . so that table will created in Hive environment then we can use spark.sql(select * from table). If you don't want to use HIVE then probably we use df.registerTempTable("TableName")

  • @ritesh_ojha
    @ritesh_ojha Před 5 měsíci

    <Error> <Code>AuthenticationFailed</Code> <Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:ea8e17b4-701e-004d-1db1-573f6a000000 Time:2024-02-04T21:31:20.0816196Z</Message> <AuthenticationErrorDetail>Signature not valid in the specified time frame: Start [Tue, 22 Nov 2022 07:36:34 GMT] - Expiry [Wed, 22 Nov 2023 15:36:34 GMT] - Current [Sun, 04 Feb 2024 21:31:20 GMT]</AuthenticationErrorDetail> </Error>

    • @DataSpark45
      @DataSpark45 Před 5 měsíci

      where did you got this error bro

    • @ritesh_ojha
      @ritesh_ojha Před 5 měsíci

      @@DataSpark45 while downloading data. But i got data from part 2

  • @user-fz1rj6gz2g
    @user-fz1rj6gz2g Před 5 měsíci

    Thank you for the amazing project sir. can you please provide the GitHub link for this project or the project file

  • @user-fz1rj6gz2g
    @user-fz1rj6gz2g Před 5 měsíci

    thanks for the amazing content , please upload more videos like this

  • @ranjithrampally7982
    @ranjithrampally7982 Před 5 měsíci

    Do u provide training ?

    • @DataSpark45
      @DataSpark45 Před 5 měsíci

      As of Now i'm not providing training bro.But you can reach out me at any time for any sort of doubts Thank you

  • @vinothkannaramsingh8224
    @vinothkannaramsingh8224 Před 6 měsíci

    Sort the both ref/df column name based on alphabetical order and compare column names ? will it be sufficient ?

    • @DataSpark45
      @DataSpark45 Před 6 měsíci

      Certainly, whatever the order will mention at reference_df is the correct order as we expect.If we sort dfs column names in alphabetical order then their would be chances of failure. Thank you