Pdf Data Extraction Using Python | Pypdf2 Extract PDF Data to Excel | Extract Text From PDF to Excel

Sdílet
Vložit
  • čas přidán 8. 09. 2024

Komentáře • 66

  • @danilsagidullin8116
    @danilsagidullin8116 Před 3 měsíci +1

    Thank you so much! I've broke my brain till was solving such task.

  • @ideationtosuccess5439
    @ideationtosuccess5439 Před 4 měsíci +1

    Fantastic. Exactly what I am looking for. Thanks buddy! One quick question. I found you are using 3 PDF files for this demo which are named file1, file2, file3. Should the PDF names should be in some sequential order for iterating each files through loop or can the files names be any?

    • @Python2020
      @Python2020  Před 4 měsíci +1

      No, that is just for example, you can use any way to loop on each file in folder... And if you want you can sort files by different properties

  • @elitebrightfuture4405
    @elitebrightfuture4405 Před rokem +1

    Vedio for pypdf2 instalation because I have as issue which is download as file like word or internet explorer

  • @yashchavan7880
    @yashchavan7880 Před 2 lety +3

    can you please share the code used in this video ?

  • @technicalknowledge9128
    @technicalknowledge9128 Před měsícem

    I have some problem my code can u debug where problem

  • @elitebrightfuture4405
    @elitebrightfuture4405 Před rokem +1

    How u installed pypdf2 in ur pc

    • @Python2020
      @Python2020  Před rokem

      No in python environment... Watch video 95 on my channel.. How to create new python environment

  • @md-mohammed8455
    @md-mohammed8455 Před rokem +1

    Sir, in my case , I have scanned PDF files, I want to copy only a specific text from each page,
    Text sample xxxxx-xxx-xxx-IR-xxxxx
    I know only IR is fix on each page, before and after IR I don't know what is the text, but I want to copy all together from each page.
    Please help. If you and VBA macro please share. Or any other tools.
    Thanks

    • @Python2020
      @Python2020  Před rokem +1

      There is another video on my channel.. Get Text from Scan pdf

    • @md-mohammed8455
      @md-mohammed8455 Před rokem

      @@Python2020sir specific text also ?

    • @md-mohammed8455
      @md-mohammed8455 Před rokem

      @@Python2020sir I searched but not found, please share link.

    • @Python2020
      @Python2020  Před rokem

      Here..czcams.com/video/Eg5pkNpYdmE/video.html

  • @miladmirzaei1762
    @miladmirzaei1762 Před rokem

    Hello, I have an error : Traceback (most recent call last):
    File "C:\Users\milad\PycharmProjects\pythonProject5\main.py", line 6, in
    for file_name in os.listdir('all_format_pdf'):
    FileNotFoundError: [WinError 3] The system cannot find the path specified: 'all_format_pdf'
    how can I fixed?

  • @1stlookdigitalmedia
    @1stlookdigitalmedia Před rokem

    Sir I have one question,please answer me.
    I have pdf in regional language like hindi,marathi and font in pdf are not in unicode,then how can i extract data in excel from pdf.I need that pdf data in unicode within excel.
    Example:- Pdf file like voter lists in regional languages.
    Please answer me as i am trying all the time but all things are disapointing me.
    Thanks in advance

    • @Python2020
      @Python2020  Před rokem

      You can intall fonts for local languages use that in the text... You need to reach to get the proper text output,, then regex concept remains same

  • @dilkashgazala831
    @dilkashgazala831 Před rokem

    Hi can you please tell me is it possible to extract table of similar structures in different pdfs to an excel sheet using python

    • @Python2020
      @Python2020  Před rokem

      Can share sample on hiteshb0101@gmail.com

    • @dilkashgazala831
      @dilkashgazala831 Před rokem

      @@Python2020 hi there is some confidential data that I can't share, however I can brief my problem to you suppose I have three pdfs which constitute the details of students in a tabular format having the same schema but each of the pdf is from different institute. So, I want to extract the data present in the table from all three pdfs to an excel sheet using python. Kindly help me.

    • @Python2020
      @Python2020  Před rokem +1

      Ok, in this I m already storing data into variables..you have to break star and end of table...Next is how to write data in cells for that watch video no 5

  • @hungsingtsoi9078
    @hungsingtsoi9078 Před rokem

    for file_name in os.listdir('BAML'):
    print(file_name) #Loop on Files
    load_pdf = open(r'M:\Public Trade Operation\\Middle Office Package\\2023\\01 January\\Trade Confirmation\\US trades\\BAML\\'+file_name, 'rb')
    read_pdf = PyPDF2.PdfReader(load_pdf) #Load All Pdf in Variable
    page_count = len(read_pdf.pages) #Count the pdf pages
    first_page = read_pdf.pages[0] #read only the first page
    page_content = first_page.extract_text() #extract string output
    page_content = page_content.replace('
    ','')
    print(page_content)
    Hello, I have multiple pdf in the folder, it did show all the pdf name when i print(file_name)
    However, it only print the content of the first pdf rather than all of them
    Would you please take a look, thx a lot

    • @Python2020
      @Python2020  Před rokem

      You need to get all pages in list and then loop on each page.. Add code in loop bosy

    • @hungsingtsoi9078
      @hungsingtsoi9078 Před rokem

      Can u provide some example
      I am new to python not quite familiar about it

  • @rajasekharreddy.g3952
    @rajasekharreddy.g3952 Před 2 lety

    do we need install pre-requisites before. please share how

    • @Python2020
      @Python2020  Před 2 lety

      Yes we have to... Check video no 95 for installing python library

  • @Flixrin
    @Flixrin Před 2 lety

    When running the code at @2.39 I got a syntax error for for file_name. Any idea why? Im using Jupyter Notebook.

    • @Python2020
      @Python2020  Před 2 lety

      Share full line of your code

    • @Flixrin
      @Flixrin Před 2 lety +1

      @@Python2020 Hi, nevermind. The error was because I have not installed the module for pypdf2

  • @FMP_Media
    @FMP_Media Před 2 lety

    I have another question please, how can i extract all pdf pages not only the first page ? Because some of my pdf files have like 14 pages, so how should i extract all of them ?

    • @Python2020
      @Python2020  Před 2 lety

      There is a line where I have mentioned zero from there you hv to include a loop after getting count of pages

    • @FMP_Media
      @FMP_Media Před 2 lety

      @@Python2020 thank you it works 👍🏻❤️

  • @ritwikmishra4841
    @ritwikmishra4841 Před rokem

    Anyone knows, how to convert excel file sheets into pdf format using python.

    • @Python2020
      @Python2020  Před rokem

      On my channel there is macro code may work for you

  • @kibtiachowdhury6011
    @kibtiachowdhury6011 Před 2 lety

    Sir, I want to remove header, footer from every pages. Could you please help me?

    • @Python2020
      @Python2020  Před 2 lety

      You can create new pdf... First extract the data from existing one and use that data to write new pdf... Refer video no 12 on my channel

  • @kiraningale_
    @kiraningale_ Před 2 lety

    Sir I want extract text fields and below table also, can we do that?

    • @Python2020
      @Python2020  Před 2 lety

      For table there is a different approch.. You have to identify some keyword in the start of table and end of table... In between you need to run the loop

    • @kiraningale_
      @kiraningale_ Před 2 lety

      @@Python2020 I need specific column or specific row.as user want.

    • @Python2020
      @Python2020  Před 2 lety

      Use counter inside the loop... Or you can skip the column by identifying the text... It's a complax logic I know

  • @gourav0934
    @gourav0934 Před rokem

    I am getting an error "zipfile.badzipfile:file is not a zip file" can you pease help me

    • @Python2020
      @Python2020  Před rokem

      Your file might be scanned one... Copy text manually and see if you are able to selcet paste on notepad

  • @rajasekharreddy.g3952
    @rajasekharreddy.g3952 Před 2 lety

    i am getting that error when tired to run the script ( 'tuple' object has no attribute 'seek')

    • @Python2020
      @Python2020  Před 2 lety

      Check last line of the error, copy paste last 2 lines in Google... Or post the line of code which is causing error

    • @rajasekharreddy.g3952
      @rajasekharreddy.g3952 Před 2 lety

      @@Python2020 this error i got
      line 7, in
      read_pdf = PyPDF2.PdfFileReader(load_pdf)
      in read
      stream.seek(-1, 2)
      AttributeError: 'tuple' object has no attribute 'seek'

    • @Python2020
      @Python2020  Před 2 lety

      @@rajasekharreddy.g3952 send me your file on hiteshb0101@gmail.com

    • @rajasekharreddy.g3952
      @rajasekharreddy.g3952 Před 2 lety

      @@Python2020 i sent file to your email

    • @Python2020
      @Python2020  Před 2 lety

      Can you pass full folder path in for loop and try... Just checked your mail in mobile

  • @lepdenlkr2427
    @lepdenlkr2427 Před rokem +1

    are you on fiverr ???

  • @FMP_Media
    @FMP_Media Před 2 lety

    Bro I've done exactly like you and installed the libraries required, but there's an error popping up in line 4 ( for file_name in os.Listdir('.........') )
    The error is: FileNotFoundError: [WinError 3] The system cannot find the path spacified: '....'
    Do you have any solution might help please ?

    • @Python2020
      @Python2020  Před 2 lety

      Code is correct only check below.. It's complete folder path in os.Listdir ... Pdf file should be there in the folder ... Pdf should be text not scanned

    • @Python2020
      @Python2020  Před 2 lety

      Your error is relates to os. Listdir... Check in Google --iterate over pdf files in a folder using python

    • @FMP_Media
      @FMP_Media Před 2 lety

      @@Python2020 Yeah the code is correct sure, even I run another code so simple code just to open a random file in python but I'm getting the same error always (FileNotFoundError...) Not for pdf files only, nah also for text, normal text files, same error..
      I'm using pycharm too, really IDK what's exactly the problem
      I have thousands of pdf files scanned and not scanned, I need to extract the data from them and write it to excel, but can't do anything because of this error...

    • @FMP_Media
      @FMP_Media Před 2 lety +2

      @@Python2020 finally I found the solution, it's because wrong path, I run ( os.chdir ) to change the path then I put the path of my pdf files and it works, anyway thank you man for your time 👍🏻

  • @harishbollineni2588
    @harishbollineni2588 Před 2 lety

    Can you please send me the code sir.

    • @Python2020
      @Python2020  Před 2 lety

      Hi Harish, I have explained the code in the video, if you have doubt at any point mention the time and question... I dont keep files saved... Let me know if you face any error or so

  • @FMP_Media
    @FMP_Media Před 2 lety

    My last question 😁 I'm trying to extract this data from the pdf files into just 2 columns in excel, first column is the pdf file names and second column is the text what I have extracted for each pdf file name, I used your method in the video but it only works for the pdf file names in column 1 and second column for the text what I extracted no, always when I run the code for both columns it pops up long error
    [ Traceback ( most recent call last):
    ...........
    in check_string raise IllegalCharacterError
    openpyxl.utils.exceptions.IllegalCharacterError ]
    And when I run the code just for the first column, it works well and it writes the pdf file names in the first column in excel.
    l just want to write the text what I have extracted for each pdf file in excel in column 2 I don't want to write specific details like names and addresses and mob no I want to write in excel the whole extracted pdf text, if you have any tips or solution for that please tell me and thank you so much 🤗

    • @Python2020
      @Python2020  Před 2 lety

      As per error, there should string encoding issu that is when illigal char error comes, you can try trim, or change encoding,, for fetching specific value you can use slicing or regex

    • @FMP_Media
      @FMP_Media Před 2 lety

      @@Python2020 alright thank you 🙏🏻

  • @anushyaa5442
    @anushyaa5442 Před rokem

    hi sir, i have doubt, i need extract specified text like email -xxxxx@xxxx phone nos-(xxx)xxx-xxxx and name in a sheet and convert the data into excel or csv.wrote this program. plz help to solve . code mentioned below.
    import PyPDF2
    import openpyxl
    import re
    import pytesseract as tess
    tess.pytesseract.tesseract_cmd=r"C:/Tesseract-OCR/tesseract.exe"
    from PIL import Image
    excel=openpyxl.Workbook()
    sheet=excel.active
    sheet.title='pdf'
    sheet.append(['phone number'])
    pdfFileObj = open('C:\Program Files\Python310/filename.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    count = 0
    text = ""
    #The while loop will read each page.
    while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
    text=text.replace('
    ','')
    #print(text)
    print('---------')
    """

    """
    phone = re.findall('\(\d{3}\)\d{3}-\d{4}', text)
    #print(phone)
    zip_code=re.findall('\d{5}',text)
    my_zip=set(zip_code)
    #print(my_zip)
    email=re.findall('@*?\.',text)
    print (email)
    sheet.append(['phone','zip_code'])
    excel.save('C:\Program Files\Python310/file.xlsx')
    print('DONE!!')

    • @Python2020
      @Python2020  Před rokem

      After reading pdf use reqular expressions, use variable to append data in cav

    • @anushyaa5442
      @anushyaa5442 Před rokem

      @@Python2020 not able to understand some example plz