Implementing Pyspark Real Time Application || End-to-End Project || Part-2

Sdílet
Vložit
  • čas přidán 7. 09. 2024
  • In this video we covered about Data Processing (cleaning) and validating
    Data Clean for year_of_exp column using regex_extract::
    pattern = '\d+'
    idx = 0
    df_presc_sel = df_presc_sel.withColumn('years_of_exp', regexp_extract(col('years_of_exp'), pattern, idx))
    part1:
    • Implementing Pyspark R...
    link for file::
    drive.google.c...
    #azuredatabricks
    #dataanalysis
    #dataengineering
    #pyspark
    #pythonprogramming
    #dataengineering
    #dataanalysis
    #pyspark
    #python
    #sql

Komentáře • 11

  • @kaushikvarma2571
    @kaushikvarma2571 Před 6 měsíci +6

    To solve header error, replace csv code to this
    "elif file_format == 'csv':
    df = spark.read.format(file_format).option("header",True).option("inferSchema",True).load(file_dir)"

  • @sachinmittal5308
    @sachinmittal5308 Před měsícem

    Hello Sir, Link is not working to download the full code from google drive?

  • @Amarjeet-fb3lk
    @Amarjeet-fb3lk Před rokem +2

    Why all null coulumn count is zero,when you dropped only two null value column

    • @DataSpark45
      @DataSpark45  Před rokem

      Hi Amarjeet, their i purposely did that, in the next part we will relive that...Thanks for watching

  • @skateforlife3679
    @skateforlife3679 Před 10 měsíci

    It is not good that for every transformations we eneed to execute all the code again end again. So what is the best practice ? Do in a notebook cell by cell ? And then develop the production code in py files when all tested in notebook ?

  • @nikhilgr7539
    @nikhilgr7539 Před 11 měsíci

    Still getting same header error even after reformatting

    • @Vidush05
      @Vidush05 Před 10 měsíci +1

      Hi nikhil, Use the below line the issue will be resolved.
      df = spark.read.format("csv").option("header", header) .option("inferSchema", inferSchema).load(file_dir)

    • @balaa2670
      @balaa2670 Před 9 měsíci +2

      In the ingest.py file replace (header=header) and (inferschema=inferschema) to ("header", header) and ("inferschema", inferschema)

  • @yogeshpalegar9269
    @yogeshpalegar9269 Před 10 měsíci

    Hi sir how can i contact you for the coarse u not mentioned any contact?????

    • @DataSpark45
      @DataSpark45  Před 4 měsíci

      Hi Yogesh you can reach out to me in LinkedIn Lokeswar Reddy Valluru