Solve Data Science Tasks In Python (Saving Hours) (automating import CSV files to database)

Sdílet
Vložit
  • čas přidán 4. 09. 2024

Komentáře • 106

  • @stratascratch
    @stratascratch  Před 3 lety +10

    Part 2 where I fully automate the notebook is here czcams.com/video/TDwy1lSjEZo/video.html

  • @sahej97
    @sahej97 Před 3 lety +6

    Love this series of real world on the job tasks

  • @prateek2159
    @prateek2159 Před 3 lety +9

    Hey Nate, your videos are just too good. I love how your channel is so dedicated towards real word data science. By the way I noticed that you started a video series, "For your Data Science Project" and I really want you to continue making videos for this particular series because there's literally no one on CZcams with such guidance on DS projects and I have been looking for one since a very long time because I have my placements just after 12 months and I really want to make a full stack data science project. Thank you.

  • @lettalkaboutit
    @lettalkaboutit Před rokem +1

    Your video is fantastic, it help me, but this is where I struggle and I don't know if you can help with the following: my entire csv file needs to go in the table results, the dataframe created from the csv file have columns that contain foreign keys, I mean the csv files is a mix of different tables values, each column of my dataframe represents a table, My use case is that: 1/ first step: I have to loop through each column and compare the value in the column with the value from the corresponding table in the database, if the value does not exist then I create a SQL insert query to add this new value. After getting all new values in the databases then I do step 2
    2/ Step : I get the value from each corresponding table from the database and for each columns I have to replace the data in each column by the table ID (foreign key) , and next send all the dataframe in the database
    Can you help me achieve this please?

  • @hameedmulani21
    @hameedmulani21 Před rokem +2

    One of the best tutorial I have ever seen

  • @brentsimpson3791
    @brentsimpson3791 Před rokem

    Mate! I'm a newbie python 'programmer' and this video has to be one of the most useful practical examples I've found. It'll take me a while to absorb it all but THANK YOU!😊
    Brent

  • @mickyelmb
    @mickyelmb Před 2 lety +2

    Excellent tutorial. I was able to follow along in Jupyter. with no issues. Thank you for posting it.

  • @austincelestine1925
    @austincelestine1925 Před 2 lety +1

    Holy cow, I wish I came across this earlier. This helped alot to importing these csv files into mysql. The 2nd video was great to except my formatting was different for my job. Great work!!! Absolutely love your content it really helps!

    • @stratascratch
      @stratascratch  Před 2 lety +1

      Thank you and glad you found this vid useful!

  • @alanamantinodasilva4362
    @alanamantinodasilva4362 Před 2 lety +2

    Awesome. Thank you for this content.

  • @stratascratch
    @stratascratch  Před 3 lety +3

    Timestamps:
    Intro: (0:00)
    Outlining steps & import python libraries: (2:23)
    Reading your CSV file as a pandas dataframe: (4:31)
    Cleaning your table name: (6:03)
    Cleaning your column names: (10:25)
    Generating SQL query to create database table: (12:40)
    Connecting to your database: (19:04)
    Generating SQL queries to import CSV file to database: (21:53)
    Double checking file was imported correctly: (30:50)
    Conclusion: (31:47)

  • @luzestrada9536
    @luzestrada9536 Před rokem +1

    You are the best! 💯 Thank you so much for sharing this valuable information 👏

  • @rj_nelson_usmc
    @rj_nelson_usmc Před rokem +1

    Thanks again for the tutorial:
    I ran this code:
    # Upload to database
    SQL_STATEMENT = """
    COPY fast_food_data FROM STDIN WITH
    CSV
    HEADER
    DELIMITER AS ','
    """
    cursor.copy_expert(sql=SQL_STATEMENT, file=my_file)
    print('File copied to database')
    I received this error:
    InFailedSqlTransaction Traceback (most recent call last)
    Input In [34], in ()
    1 # Upload to database
    3 SQL_STATEMENT = """
    4 COPY fast_food_data FROM STDIN WITH
    5 CSV
    6 HEADER
    7 DELIMITER AS ','
    8 """
    ---> 10 cursor.copy_expert(sql=SQL_STATEMENT, file=my_file)
    11 print('File copied to database')
    InFailedSqlTransaction: current transaction is aborted, commands ignored until end of transaction block
    Currently stuck. Bummer. I'll keep working it.

  • @StefanoVerugi
    @StefanoVerugi Před rokem +2

    As a beginner I find this video useful and well explained, thanks for posting it
    To remove non alphanumerical characters can't you use re.sub from re?
    like: print(re.sub("[^a-zA-Z.0-9]+","",string))
    more or less you have the same result, anything wrong with this method?
    cheers

    • @stratascratch
      @stratascratch  Před rokem +2

      Thanks for the kind words. Yes, you can use re.sub just fine!

    • @cubano100pct
      @cubano100pct Před rokem

      @@stratascratch There is the issue of changing the whitespaces into one _ character, then removing all other non a-z and 0-9 characters and _
      tablename = re.sub('[\x00-\x1F\s\-]+', '_', filename.lower(), 0)
      tablename = re.sub('[^0-9a-z_]*','',tablename, 0)

  • @radhakrishnanaik5656
    @radhakrishnanaik5656 Před 2 lety

    Hey Nate, your way of explanation is really very good .I would like to request you to prepare a video to fetch database table in csv file in python code in the form of pipeline

  • @exili
    @exili Před 3 lety +1

    Extremely helpful video, and a very useful tool , glad I found this on Reddit.

    • @stratascratch
      @stratascratch  Před 3 lety

      I'm glad you found it useful. You should also take a look at part 2 where I talk about automating the entire process (czcams.com/video/TDwy1lSjEZo/video.html). Part 3 will come out soon and will cover using software development techniques to polish off the code and share with your team.

  • @akshayjadiya
    @akshayjadiya Před 3 lety +2

    Great Video Nate! Can you make a video playlist on how to get started with cloud services or provide relevant reading material for that here in comments?

    • @stratascratch
      @stratascratch  Před 3 lety +4

      Absolutely! I'd start simple and try to create a database on AWS first. If you're not going to be creating apps, there's no reason to use the other services like EC2. Here's a nice overview on how to create your (medium.com/edureka/rds-aws-tutorial-for-aws-solution-architects-eec7217774dd). In reality it probably won't be so simple. But a lot of learning this stuff is trial and error. My videos always seems simple but there's actually hours and hours of troubleshooting that never make it to the video. Maybe I should add a few minutes of "outtakes" showing all the errors I get in my next video =)

  • @zynkers401
    @zynkers401 Před 2 lety

    These series are the best! Thanks Nate

  • @RedShipsofSpainAgain
    @RedShipsofSpainAgain Před 3 lety +1

    Great video Nate. Few suggestions:
    6:58 It is more clear to use a variable name like "file_name" rather than the more ambiguous variable name "file"
    8:31 To improve readability, consider putting each .replace() call on a separate line followed by the escape character. And perhaps consider consolidating all these replace statements into a single replace() call, using, say, a regex.

    • @stratascratch
      @stratascratch  Před 3 lety +1

      Thanks so much for your input. I'll keep those in mind in my next python video. Keep the feedback coming. Also, if you have any requests on topics, I'd be happy to try to cover them on this channel.

    • @RedShipsofSpainAgain
      @RedShipsofSpainAgain Před 3 lety +1

      @@stratascratch happy to help and offer constructive feedback! Admittedly my advice is opinionated, but I've found adapting certain details like these really help to prevent headaches for debugging and for future development of your code. Sort of like a set of "best practices". Love your videos, btw. Keep up the great work Nate.

    • @stratascratch
      @stratascratch  Před 3 lety +1

      @@RedShipsofSpainAgain Thanks so much for the kind words. I'm hoping to get into more python code but it takes sooo long to record compared to SQL stuff. Hope to have another python series in Jan or Feb. Your "best practices" are great advice and I'll keep these in mind as I build my solutions out.

  • @gambu4810
    @gambu4810 Před 2 lety +3

    Hi Nate. Your work is so good, I can't believe my eyes. Can you please let me know how to make a connection to a SQL management studio instead? This can really help if I knew how to do it in SQL server management studio instead of postgres.

    • @ihhdina
      @ihhdina Před 6 měsíci +1

      I also need help with importing CSV to SQL.

  • @smsyrandomacts1725
    @smsyrandomacts1725 Před 3 lety

    Simply amazing.. knowledge transfer series. Great job Nat 👍

  • @Davidkiania
    @Davidkiania Před 2 lety

    This is really good thanks. You’ve saved me lots of manual hours.

  • @TheSwil325
    @TheSwil325 Před 3 lety

    I really appreciate the quality of your videos! 💯

  • @gregNFL
    @gregNFL Před 2 lety

    10/10 Clear, concise.

  • @arpitakar3384
    @arpitakar3384 Před rokem +1

    I loved it man thAt"s a serious effort vro

  • @cristiansoto7581
    @cristiansoto7581 Před rokem

    Hi Nate, thank you for your this valiable information. Could you make more videos about this?
    I'm suscribed in stratascratch an it's amazing.

  • @komaldiwe5667
    @komaldiwe5667 Před 2 lety +1

    hi Nate, how can we do cursor.copy_expert(sql= SQL_statement, file = my_file) for MySQL ???

  • @sergeitokarev6033
    @sergeitokarev6033 Před 2 lety

    If someone has problems with connecting to the database, just edit your VPC Security Group and allow all IPs. It helped in my case.

    • @stratascratch
      @stratascratch  Před 2 lety

      Wonderful suggestion!

    • @tacy91
      @tacy91 Před 2 lety

      Yea. Please can you help? How do I edit it? I've been trying since yesterday. Thanks

  • @junaidmalik9593
    @junaidmalik9593 Před 2 lety

    Fantastic Video , thanks Nate, u r awesome .
    i just struggled understanding the col_str part at 17:50 of the Video , may be coz i am new to this!

  • @yassw1995
    @yassw1995 Před 2 lety +1

    Hi Nate, this video is amazing and concise, thank you so much. However, I'm trying to learn this using as few advanced python libraries as possible for a class, so I can't use pandas or the like. I'm limiting myself to mySQL.connector and csv libraries. For this reason, MySQLCursor does not accept the "cursor.copy_expert" syntax. What would be a good alternative method to insert the contents of the csv file we saved during that step into the sql database?

  • @adithya5604
    @adithya5604 Před rokem +1

    Thank you so much 🎉

  • @chyang0107
    @chyang0107 Před rokem

    Hello Nate, I appreciate your valuable work! Could you provide your insights on these two questions, First, what is 'STDIN' within the SQL Statement '"""Copy Customer_contract from STDIN With ....""" around (28:51)?
    Second, if I would like to insert value to table from Excel file, My SQL Statement would be
    """"
    Copy Customer_contract from STDIN With
    Excel
    Header
    """"
    Am I on the right track?
    Thanks,
    Hank

  • @javierjdaza
    @javierjdaza Před 3 lety

    How The fuck this video only has +1000 views? you deserve more visibility man, u re a fucking Genius and great explaining the things. Go ahead, u got a new subs

    • @stratascratch
      @stratascratch  Před 3 lety

      I guess not a lot of people like to see me code =) But, honestly, thanks for the kind words. I'll keep trying to code and provide thoughtful explanations to my solutions and approaches. This python series took a really long time and I had probably like 20 pages of notes that I covered across the 3 videos.

  • @sd0n0158
    @sd0n0158 Před 2 lety +1

    Do you know if there is a work around for Redshift without using an S3 bucket? It doesn't allow COPY FROM STDIN.

    • @stratascratch
      @stratascratch  Před 2 lety

      No, I don't know about such workaround. Have you tried asking on re: Post or Stack Overflow?

  • @prabirkumarsahoo6368
    @prabirkumarsahoo6368 Před rokem

    Could you tell me from where you copied the 'AWS connection string'????

  • @CawlCawl1
    @CawlCawl1 Před 3 lety +2

    Amazing video - thank you! I am using MS SQL SERVER Management Studio, thus have to use PYODBC and get an error "'pyodbc.Cursor' object has no attribute 'copy_expert'" in the line of code "cursor.copy_expert(SQL_STATEMENT, file=my_file)". Is there any alternative code I can use since Im using PYODBC?

    • @stratascratch
      @stratascratch  Před 3 lety +1

      If you're going to use PYODBC, you'll probably need to INSERT your values row by row rather than using something like copy_expert() which takes the entire file and just does it INSERT all at once. Here's the documentation on INSERTs for PYODBC (docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-3-proof-of-concept-connecting-to-sql-using-pyodbc?view=sql-server-ver15). Every database is different and requires a different python library. Hope this helps.

  • @sureshkumarramachandran6993

    Great job!

  • @AndyLeal-pz7xw
    @AndyLeal-pz7xw Před rokem

    This is incredibly helpful. I have a question, I'm using paramiko to connect to a SFTP to read csv files from there, how would I go about setting up the cwd to the sftp server? I can only navigate to local directories currently.

    • @stratascratch
      @stratascratch  Před rokem

      Thanks for the kind words! Is there a way to navigate to the directory of the stfp server? It might be a permissions issue that you can have your IT dept resolve. If they are able to allow you to access the STFP server via your script, then you're all set. Otherwise, you'll need to manually move the files.

    • @AndyLeal-pz7xw
      @AndyLeal-pz7xw Před rokem

      @@stratascratch Thanks for the quick feedback! I checked permissions and everything seemed fine. I ended up switching to pysftp and it worked on the first try! keep up the great work!

  • @Jay-eh5rw
    @Jay-eh5rw Před 2 lety

    Thank you very much. Great help:)

  • @BelowIce
    @BelowIce Před 2 lety

    Love this video. Can you share why you copy the dataframe to a csv, and then import the csv to the database? Is it possible to directly reference the values in the dataframe and use those to populate the insert statements?

    • @stratascratch
      @stratascratch  Před 2 lety +1

      You can definitely just upload the df to the database. I used a function that required a csv to import to a db. But there are other approaches like you mentioned that would likely work as well.

  • @darshpancholi4919
    @darshpancholi4919 Před 2 lety

    I use a macbook M1 and I am unable to import the psycopg2 library in my jupyter notebook. have tried almost all the solutions provided on youtube and StackOverflow. Can someone help me out with the same?

  • @sasongkosan8832
    @sasongkosan8832 Před 2 lety

    Very Good ,, :) Nice tutorial ...

  • @sai6837
    @sai6837 Před 2 lety +1

    Hai, I am using mysql...what keyword should i use instead of copy_expert

    • @stratascratch
      @stratascratch  Před rokem

      I don't use MySQL that often in these situations, but shouldn't execute() or executemany() work?

    • @sai6837
      @sai6837 Před rokem +1

      @@stratascratch no worries I tried and it worked

  • @thelatecomer4245
    @thelatecomer4245 Před 2 lety

    Hey, Can you give a solution on exporting table from sql server into local as multi character delimited file?

  • @CodingNinja-lt8ge
    @CodingNinja-lt8ge Před rokem

    Hi Nate,
    Thanks for the wonderful explanation. I tried to create the SQL table with your data and code using jupyter notebook. The code works fine but it failed to create the table in Postgres database. I am wondering what could be the reason. In the Pgadmin I found this under SQL TAB. Would appreciate your hints to help me figure out why I am not getting the desired result although code works
    Database: aw
    -- DROP DATABASE IF EXISTS aw;
    CREATE DATABASE aw
    WITH
    OWNER = postgres
    ENCODING = 'UTF8'
    LC_COLLATE = 'English_Finland.1252'
    LC_CTYPE = 'English_Finland.1252'
    TABLESPACE = pg_default
    CONNECTION LIMIT = -1
    IS_TEMPLATE = False;

    • @stratascratch
      @stratascratch  Před rokem

      Maybe you need to commit the code to the db? It's a postgres thing to do that.

  • @jerrywang1550
    @jerrywang1550 Před 2 lety +1

    connect to the AWS part is difficult

    • @stratascratch
      @stratascratch  Před 2 lety

      I hear you, could be tricky.

    • @tacy91
      @tacy91 Před 2 lety

      Yeah. I've been trying for 2 days. Is there any solution for that? The error: could not connect to server: Connection timed out.

    • @tacy91
      @tacy91 Před 2 lety

      @Andrew Levinton Yeah I did....I don't know what else to do. Thanks for your reply🙏

  • @martithon
    @martithon Před 2 lety

    [SOLVED] at @16.10 I'm getting a " ', ' KeyError " and have no idea how to solve it. Any tips?
    edit: misplaced the closing bracket of the join function

  • @its_me7363
    @its_me7363 Před 3 lety

    Thanks for these amazing helpful videos, something different from other youtubers. Can we use vscode to connect and open AWS database and write queries? I don't want to download another software for database.

    • @stratascratch
      @stratascratch  Před 3 lety +1

      I think you can use vscode to connect to a db but I've never used that platform before so I don't know how to. But seeing as how vscode is for developers, it's likely you can use it.

    • @its_me7363
      @its_me7363 Před 3 lety

      @@stratascratch I am trying to use an extension in vscode named SQLtools but it is not able to connect to AWS RDS for postgresql database i created as you showed.

    • @stratascratch
      @stratascratch  Před 3 lety +1

      @@its_me7363 You might need to find another platform that allows you to connect to an aws rds. Another reason could be that the database you created isn't configured right. There could be a firewall in place or ports closed. Try to see if you can connect to the rds through another way.

  • @rj_nelson_usmc
    @rj_nelson_usmc Před rokem +2

    Good video. I'm presently looking for something that will process a massive dataset with thousands of rows of data, then upload that data into a database. I've yet to run across a tutorial that processes and uploads a massive dataset into a database. I'm running into some interesting errors.
    1st Code Attempt: df = pd.read_csv('fast-food-data-2020.csv')
    1st Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 20: invalid continuation byte
    2nd Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', encoding='UTF-8')
    2nd Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 52676: invalid continuation byte
    3rd Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', encoding='latin-1')
    3rd Error: DtypeWarning: Columns (12,13,14,15,16,17,18,19,20,21,22,24,25,26,27,28,29,30,37,49,52) have mixed types. Specify dtype option on import or set low_memory=False.
    df = pd.read_csv('fast-food-data-2020.csv', encoding='latin-1')
    4th Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', error_bad_lines=False, index_col=False, dtype='unicode')
    4th Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 20: invalid continuation byte
    5th Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', low_memory=False)
    5th Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 20: invalid continuation byte
    6th Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', engine='python')
    6th Code Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3524: invalid continuation byte

  • @MichelUNITED
    @MichelUNITED Před 3 lety +1

    hello i ran this code:: col_str = ", ".join("{} {}" .format(n, d) for (n, d) in zip(dta.columns, dta.dtypes.replace(replacements)))
    col_str the error received was that i cannot compare types ndarray(dtype = objects) and str. please help

    • @stratascratch
      @stratascratch  Před 3 lety

      Check out the notebook here: github.com/Strata-Scratch/csv_to_db_automation Hopefully that helps you troubleshoot.

    • @stratascratch
      @stratascratch  Před 3 lety +1

      But the error sounds like you have an object dtype which is expected but you're trying to compare that to a string. It leads me to believe that your "col_str" might be wrong? Or your zip() code? Hard to say without really seeing your code in its entirety.

    • @MichelUNITED
      @MichelUNITED Před 3 lety

      @@stratascratch Hello thank you for the reply. i did exactly what did. i followed your line one after the other.
      datype_str = ", ".join("{} {}" .format(n, d) for (n, d) in zip(dta.columns ,dta.dtypes.replace(replacements)))
      this is the line. sorry but i don't see any kind of comparing going on.

    • @stratascratch
      @stratascratch  Před 3 lety

      @@MichelUNITED I believe there might be something wrong with the dta values or column headers. You should also check your replacements array too to make sure you're getting the same output as I did. See here for info stackoverflow.com/questions/53478932/cannot-compare-types-ndarraydtype-int64-and-str

    • @MichelUNITED
      @MichelUNITED Před 3 lety

      @@stratascratch can i send you my code please. help me please. can i send you my code to have a look at. i hope you see this. i am working on something related to your video.

  • @full_bladder
    @full_bladder Před 2 lety

    You did not block your password. If it was an important db or a password used before for other accounts, change it.

  • @arpittrivedi6636
    @arpittrivedi6636 Před 2 lety +1

    what about excel

    • @stratascratch
      @stratascratch  Před 2 lety

      That can also work. But for now, we can use Python.

  • @huseyindansman7415
    @huseyindansman7415 Před 2 lety

    its really so slow to insert a lot big data into ms sql server db using pyodbc library, any idea you have might be really useful for me to do it faster.

    • @stratascratch
      @stratascratch  Před 2 lety

      I feel like it's always a slow process no matter what db engine. There are engines that allow you to multi-thread the processes but otherwise, batch processing is the only thing that comes to mind. There's usually a parameter setting for batch loading in whatever python library you're using to import your data

    • @huseyindansman7415
      @huseyindansman7415 Před 2 lety

      @@stratascratch thank you so much for your reply but I found out a way to get it faster using a different parameter being able to insert 16 thousand rows withing 20 or 30 seconds

  • @limichelle6895
    @limichelle6895 Před 3 lety

    Hi, I have a question. How to only upload "newest csv data" in a folder? Thanks

    • @stratascratch
      @stratascratch  Před 3 lety

      Sorry, not sure what you mean?

    • @limichelle6895
      @limichelle6895 Před 3 lety

      @@stratascratch hi , thanks for getting back to me. So I’ve like a folder with a new csv data every single day. I would like to only update the newest data to MySQL. Can help me ? Thanks

    • @stratascratch
      @stratascratch  Před 3 lety +1

      @@limichelle6895 You can just upload it to colabs ever day. That's the manual way. Otherwise, you could automate it by specifying the name of the CSV file in your code. If there's a new csv file everyday, I would make sure that csv file has the current date at the end of the file name. Then have your code find the file name with the current date. The script will find the file and import it.

    • @limichelle6895
      @limichelle6895 Před 3 lety

      And my csv data come with a new date everyday. For example 280721, 290721 and so on

  • @mariusbotezatu6365
    @mariusbotezatu6365 Před rokem

    Try the same thing with 1M rows in csv file :)

    • @stratascratch
      @stratascratch  Před rokem +1

      It'll break! You'll need to import in batches. I think there's a parameter for batching in one of the import functions in python. =)

  • @bennguyen1313
    @bennguyen1313 Před 6 měsíci

    What if the data file is not in a perfect tabular / csv format? For example, I have a large text file like this:
    #Time (HHH:MM:SS): 002:34:02
    # T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
    # ===== === ==== ==== ==== == ==== == == ==== ==== ====== ==== ==== ==== ==== ==== ==== ==== ====
    816 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
    817 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000
    #Time (HHH:MM:SS): 002:34:03
    # T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
    # ===== === ==== ==== ==== == ==== == == ==== ==== ====== ==== ==== ==== ==== ==== ==== ==== ====
    056 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
    057 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000
    How can get just the data from DT00 thru DT07 into an array, without doing lots of preprocessing to scrub out the repeating #Time headers that appear throughout the file?