Databricks for Apache Spark - Importing and Setting up dataset

Sdílet
Vložit
  • čas přidán 18. 01. 2020
  • #databricks #apachespark #sparkml
    This video is to just demonstrate how to get access to community edition and start uploading your file and analyzing it
    I will be following this up with detailed walk-through of Spark ML code. In case if you have not seen my Intro to Spark ML you can view it below
    • Machine Learning using...
  • Věda a technologie

Komentáře • 42

  • @AIEngineeringLife
    @AIEngineeringLife  Před 4 lety +4

    Dataset used in this video is available in my git repo here - github.com/srivatsan88/CZcamsLI/tree/master/dataset

  • @user-ji9og4oh8f
    @user-ji9og4oh8f Před 5 měsíci

    awesome video

  • @bensycamore
    @bensycamore Před 4 lety

    Do you know if intellisense can be turned on in the notebook? ie dropdowns for functions, columns in a table, etc

  • @close_to_life7954
    @close_to_life7954 Před 3 lety +3

    For new comers,
    The file_location = "/FileStore/tables/..."
    Here, FileStore/tables is mandatory,
    And after that, you have to write the .csv file name with .csv extension.

  • @shubhamtalks9718
    @shubhamtalks9718 Před 3 lety

    Are there any advantages of using Spark in Databricks community edition vs using it in google Colab? Or both will give the same performance? If yes, then what are those advantages?

  • @amirabouaouina883
    @amirabouaouina883 Před 2 lety

    thanks a lot, keep going

  • @pradipawasthi2883
    @pradipawasthi2883 Před 2 lety

    I want a serious help from you..actually I have a sequence data i.e., a single row data I want to split the data into multiple rows after every 5th delimiter ('|') how can I do that??

  • @raniataha9876
    @raniataha9876 Před 3 lety +1

    Thanks, good job
    I need help in solve the problem for fee , can you help me

  • @nsudeesh601
    @nsudeesh601 Před 3 lety

    Autoloader perform notebooks orchestration?

  • @tanushreenagar3116
    @tanushreenagar3116 Před 3 lety +1

    Nice thnk u

  • @ssoupa334
    @ssoupa334 Před 4 lety +1

    thanks a lot

  • @rashmimalhotra123
    @rashmimalhotra123 Před 3 lety +1

    Hi Thanks for this video...
    where can we check the log or detail when notebook is stuck ...and status is only running..and finally have to cancel it ..what could be the reason

    • @AIEngineeringLife
      @AIEngineeringLife  Před 3 lety

      You can check on Spark UI that you can find on top of databricks page. If job is running for long you can check that stage and see if there is any bottleneck that can be fixed

  • @maheshteja7407
    @maheshteja7407 Před 4 lety +1

    I want to load data directly from local system into the code. I dont want to load data to dbfs and then use it. can you help me?

    • @AIEngineeringLife
      @AIEngineeringLife  Před 4 lety +1

      Mahesh.. Then you have to install pyspark in local.. databricks need file into their environment.. If you do not want to upload you can also pull the file for that session during notebook execution from github or any URL and try. I have some similar sample in my colab demo
      czcams.com/video/_kFNxF2MM_M/video.html

  • @ankushojha5089
    @ankushojha5089 Před 4 lety +1

    Please help me how to re-start the 'Terminated' cluster in databricks. As in this Video you already have a terminated cluster can we re-start the same?

    • @AIEngineeringLife
      @AIEngineeringLife  Před 4 lety +1

      Kush.. Just start a new cluster with same name or different name. In community edition I dont think you can restart it

    • @0615801523
      @0615801523 Před 3 lety +1

      as far as i know you cant restart it. you should create a new cluster :)

  • @sumalisamanta295
    @sumalisamanta295 Před 3 lety +1

    What is the difference between Azure Databricks and only Databricks Analytical Platform?

    • @AIEngineeringLife
      @AIEngineeringLife  Před 3 lety +1

      The functionality is same from data bricks end it is only connectivity to Azure storage services is easy on databricks hosted on Azure and one that is on Amazon connectivity to AWS storage like S3 is easy

  • @vaishnavigurav8890
    @vaishnavigurav8890 Před 2 lety

    I thought apache spark was some quantum physics🙄until this video popped in my recommendations😅..... apache spark here i come💃

  • @priteshpatel9051
    @priteshpatel9051 Před 4 lety +1

    I want to import community data using wget directly from http endpoint, I tried using mkdir and wget but I dont see any folder setup in my data. Can you please suggest
    %mkdir ../data
    !wget -O ../data/aclImdb_v1.tar.gz ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    !tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

    • @AIEngineeringLife
      @AIEngineeringLife  Před 4 lety

      See if below works but wget should also work. You can write in /tmp directory and move to dbfs
      %sh curl -O 'ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
      %fs ls "file:/databricks/driver"

  • @santhoshreddy1612
    @santhoshreddy1612 Před 4 lety +3

    In add data what file u uploaded?

    • @AIEngineeringLife
      @AIEngineeringLife  Před 4 lety +1

      I just used the corona virus dataset that I prepared from john hopkins university
      It is in my github
      github.com/srivatsan88/CZcamsLI/tree/master/dataset/coronavirus

  • @daniela.3851
    @daniela.3851 Před 4 lety

    I'll probably check it out for myself before you answer this, but:
    Does the spark connector only have support for Scala?

    • @AIEngineeringLife
      @AIEngineeringLife  Před 4 lety +1

      I did not get it. If language support spark does support python. Spark is written in scala but has python bindings

    • @daniela.3851
      @daniela.3851 Před 4 lety

      @@AIEngineeringLife I work with pyspark, I just wanted to know that if inside databricks there was this option. Thank you!!

    • @praveenprakash143
      @praveenprakash143 Před 3 lety

      @@daniela.3851 its supports

  • @krishnakishorepeddisetti4387

    i want to pivot data using python that is present in table inside databricks.. how to do that...can you help me ..
    by the way..great tutorial

    • @AIEngineeringLife
      @AIEngineeringLife  Před 3 lety

      Krishna.. you are asking pivot in python then pandas dataframe has a pivot function. Spark DF as well as pivot and you can check it here - databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
      I think I have covered it as part of my SPark transformation videos but do not recollect it

  • @aarthi1111
    @aarthi1111 Před 3 lety

    for a community edition on free trial how many clusters can i create?

  • @shubhamtalks9718
    @shubhamtalks9718 Před 3 lety

    Are there other databricks videos also on this channel?

    • @AIEngineeringLife
      @AIEngineeringLife  Před 3 lety +2

      Yes.. I have quiet a good number in this playlist - czcams.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html

    • @shubhamtalks9718
      @shubhamtalks9718 Před 3 lety

      @@AIEngineeringLife I was following your Apache spark for Data Scientist playlist. I guess it is not sorted sequentially.

    • @AIEngineeringLife
      @AIEngineeringLife  Před 3 lety +1

      @@shubhamtalks9718 Data scientist playlist only focuses on ML but the one I shared focuses on both data engineering and machine learning

  • @shubhamtalks9718
    @shubhamtalks9718 Před 3 lety

    What is parquet format in databricks?

    • @AIEngineeringLife
      @AIEngineeringLife  Před 3 lety +1

      Parquet is a file format for storing data. It is columnar format and many processing engine is optimized for it

    • @MsVimarsha
      @MsVimarsha Před 3 lety

      parquet files are in columnar format, best fit in OLAPs ! the format is efficient for parallel processing .

  • @tanushreenagar3116
    @tanushreenagar3116 Před 3 lety +1

    Nice thnk u