Pdf Data Extraction Using Python | Pypdf2 Extract PDF Data to Excel | Extract Text From PDF to Excel

Python2020

zhlédnutí 20 375

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 8. 09. 2024

Komentáře • 66

@danilsagidullin8116 Před 3 měsíci ⁺¹
Thank you so much! I've broke my brain till was solving such task.
@Python2020 Před 3 měsíci
☺
@ideationtosuccess5439 Před 4 měsíci ⁺¹
Fantastic. Exactly what I am looking for. Thanks buddy! One quick question. I found you are using 3 PDF files for this demo which are named file1, file2, file3. Should the PDF names should be in some sequential order for iterating each files through loop or can the files names be any?
@Python2020 Před 4 měsíci ⁺¹
No, that is just for example, you can use any way to loop on each file in folder... And if you want you can sort files by different properties
@elitebrightfuture4405 Před rokem ⁺¹
Vedio for pypdf2 instalation because I have as issue which is download as file like word or internet explorer
@yashchavan7880 Před 2 lety ⁺³
can you please share the code used in this video ?
@technicalknowledge9128 Před měsícem
I have some problem my code can u debug where problem
@elitebrightfuture4405 Před rokem ⁺¹
How u installed pypdf2 in ur pc
@Python2020 Před rokem
No in python environment... Watch video 95 on my channel.. How to create new python environment
@md-mohammed8455 Před rokem ⁺¹
Sir, in my case , I have scanned PDF files, I want to copy only a specific text from each page,
Text sample xxxxx-xxx-xxx-IR-xxxxx
I know only IR is fix on each page, before and after IR I don't know what is the text, but I want to copy all together from each page.
Please help. If you and VBA macro please share. Or any other tools.
Thanks
@Python2020 Před rokem ⁺¹
There is another video on my channel.. Get Text from Scan pdf
@md-mohammed8455 Před rokem
@@Python2020sir specific text also ?
@md-mohammed8455 Před rokem
@@Python2020sir I searched but not found, please share link.
@Python2020 Před rokem
Here..czcams.com/video/Eg5pkNpYdmE/video.html
@miladmirzaei1762 Před rokem
Hello, I have an error : Traceback (most recent call last):
File "C:\Users\milad\PycharmProjects\pythonProject5\main.py", line 6, in
for file_name in os.listdir('all_format_pdf'):
FileNotFoundError: [WinError 3] The system cannot find the path specified: 'all_format_pdf'
how can I fixed?
@Python2020 Před rokem
Give complete path
@1stlookdigitalmedia Před rokem
Sir I have one question,please answer me.
I have pdf in regional language like hindi,marathi and font in pdf are not in unicode,then how can i extract data in excel from pdf.I need that pdf data in unicode within excel.
Example:- Pdf file like voter lists in regional languages.
Please answer me as i am trying all the time but all things are disapointing me.
Thanks in advance
@Python2020 Před rokem
You can intall fonts for local languages use that in the text... You need to reach to get the proper text output,, then regex concept remains same
@dilkashgazala831 Před rokem
Hi can you please tell me is it possible to extract table of similar structures in different pdfs to an excel sheet using python
@Python2020 Před rokem
Can share sample on hiteshb0101@gmail.com
@dilkashgazala831 Před rokem
@@Python2020 hi there is some confidential data that I can't share, however I can brief my problem to you suppose I have three pdfs which constitute the details of students in a tabular format having the same schema but each of the pdf is from different institute. So, I want to extract the data present in the table from all three pdfs to an excel sheet using python. Kindly help me.
@Python2020 Před rokem ⁺¹
Ok, in this I m already storing data into variables..you have to break star and end of table...Next is how to write data in cells for that watch video no 5
@hungsingtsoi9078 Před rokem
for file_name in os.listdir('BAML'):
print(file_name) #Loop on Files
load_pdf = open(r'M:\Public Trade Operation\\Middle Office Package\\2023\\01 January\\Trade Confirmation\\US trades\\BAML\\'+file_name, 'rb')
read_pdf = PyPDF2.PdfReader(load_pdf) #Load All Pdf in Variable
page_count = len(read_pdf.pages) #Count the pdf pages
first_page = read_pdf.pages[0] #read only the first page
page_content = first_page.extract_text() #extract string output
page_content = page_content.replace('
','')
print(page_content)
Hello, I have multiple pdf in the folder, it did show all the pdf name when i print(file_name)
However, it only print the content of the first pdf rather than all of them
Would you please take a look, thx a lot
@Python2020 Před rokem
You need to get all pages in list and then loop on each page.. Add code in loop bosy
@hungsingtsoi9078 Před rokem
Can u provide some example
I am new to python not quite familiar about it
@rajasekharreddy.g3952 Před 2 lety
do we need install pre-requisites before. please share how
@Python2020 Před 2 lety
Yes we have to... Check video no 95 for installing python library
@Flixrin Před 2 lety
When running the code at @2.39 I got a syntax error for for file_name. Any idea why? Im using Jupyter Notebook.
@Python2020 Před 2 lety
Share full line of your code
@Flixrin Před 2 lety ⁺¹
@@Python2020 Hi, nevermind. The error was because I have not installed the module for pypdf2
@FMP_Media Před 2 lety
I have another question please, how can i extract all pdf pages not only the first page ? Because some of my pdf files have like 14 pages, so how should i extract all of them ?
@Python2020 Před 2 lety
There is a line where I have mentioned zero from there you hv to include a loop after getting count of pages
@FMP_Media Před 2 lety
@@Python2020 thank you it works 👍🏻❤️
@ritwikmishra4841 Před rokem
Anyone knows, how to convert excel file sheets into pdf format using python.
@Python2020 Před rokem
On my channel there is macro code may work for you
@kibtiachowdhury6011 Před 2 lety
Sir, I want to remove header, footer from every pages. Could you please help me?
@Python2020 Před 2 lety
You can create new pdf... First extract the data from existing one and use that data to write new pdf... Refer video no 12 on my channel
@kiraningale_ Před 2 lety
Sir I want extract text fields and below table also, can we do that?
@Python2020 Před 2 lety
For table there is a different approch.. You have to identify some keyword in the start of table and end of table... In between you need to run the loop
@kiraningale_ Před 2 lety
@@Python2020 I need specific column or specific row.as user want.
@Python2020 Před 2 lety
Use counter inside the loop... Or you can skip the column by identifying the text... It's a complax logic I know
@gourav0934 Před rokem
I am getting an error "zipfile.badzipfile:file is not a zip file" can you pease help me
@Python2020 Před rokem
Your file might be scanned one... Copy text manually and see if you are able to selcet paste on notepad
@rajasekharreddy.g3952 Před 2 lety
i am getting that error when tired to run the script ( 'tuple' object has no attribute 'seek')
@Python2020 Před 2 lety
Check last line of the error, copy paste last 2 lines in Google... Or post the line of code which is causing error
@rajasekharreddy.g3952 Před 2 lety
@@Python2020 this error i got
line 7, in
read_pdf = PyPDF2.PdfFileReader(load_pdf)
in read
stream.seek(-1, 2)
AttributeError: 'tuple' object has no attribute 'seek'
@Python2020 Před 2 lety
@@rajasekharreddy.g3952 send me your file on hiteshb0101@gmail.com
@rajasekharreddy.g3952 Před 2 lety
@@Python2020 i sent file to your email
@Python2020 Před 2 lety
Can you pass full folder path in for loop and try... Just checked your mail in mobile
@lepdenlkr2427 Před rokem ⁺¹
are you on fiverr ???
@Python2020 Před rokem
No
@FMP_Media Před 2 lety
Bro I've done exactly like you and installed the libraries required, but there's an error popping up in line 4 ( for file_name in os.Listdir('.........') )
The error is: FileNotFoundError: [WinError 3] The system cannot find the path spacified: '....'
Do you have any solution might help please ?
@Python2020 Před 2 lety
Code is correct only check below.. It's complete folder path in os.Listdir ... Pdf file should be there in the folder ... Pdf should be text not scanned
@Python2020 Před 2 lety
Your error is relates to os. Listdir... Check in Google --iterate over pdf files in a folder using python
@FMP_Media Před 2 lety
@@Python2020 Yeah the code is correct sure, even I run another code so simple code just to open a random file in python but I'm getting the same error always (FileNotFoundError...) Not for pdf files only, nah also for text, normal text files, same error..
I'm using pycharm too, really IDK what's exactly the problem
I have thousands of pdf files scanned and not scanned, I need to extract the data from them and write it to excel, but can't do anything because of this error...
@FMP_Media Před 2 lety ⁺²
@@Python2020 finally I found the solution, it's because wrong path, I run ( os.chdir ) to change the path then I put the path of my pdf files and it works, anyway thank you man for your time 👍🏻
@harishbollineni2588 Před 2 lety
Can you please send me the code sir.
@Python2020 Před 2 lety
Hi Harish, I have explained the code in the video, if you have doubt at any point mention the time and question... I dont keep files saved... Let me know if you face any error or so
@FMP_Media Před 2 lety
My last question 😁 I'm trying to extract this data from the pdf files into just 2 columns in excel, first column is the pdf file names and second column is the text what I have extracted for each pdf file name, I used your method in the video but it only works for the pdf file names in column 1 and second column for the text what I extracted no, always when I run the code for both columns it pops up long error
[ Traceback ( most recent call last):
...........
in check_string raise IllegalCharacterError
openpyxl.utils.exceptions.IllegalCharacterError ]
And when I run the code just for the first column, it works well and it writes the pdf file names in the first column in excel.
l just want to write the text what I have extracted for each pdf file in excel in column 2 I don't want to write specific details like names and addresses and mob no I want to write in excel the whole extracted pdf text, if you have any tips or solution for that please tell me and thank you so much 🤗
@Python2020 Před 2 lety
As per error, there should string encoding issu that is when illigal char error comes, you can try trim, or change encoding,, for fetching specific value you can use slicing or regex
@FMP_Media Před 2 lety
@@Python2020 alright thank you 🙏🏻
@anushyaa5442 Před rokem
hi sir, i have doubt, i need extract specified text like email -xxxxx@xxxx phone nos-(xxx)xxx-xxxx and name in a sheet and convert the data into excel or csv.wrote this program. plz help to solve . code mentioned below.
import PyPDF2
import openpyxl
import re
import pytesseract as tess
tess.pytesseract.tesseract_cmd=r"C:/Tesseract-OCR/tesseract.exe"
from PIL import Image
excel=openpyxl.Workbook()
sheet=excel.active
sheet.title='pdf'
sheet.append(['phone number'])
pdfFileObj = open('C:\Program Files\Python310/filename.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page.
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
text=text.replace('
','')
#print(text)
print('---------')
"""

"""
phone = re.findall('\(\d{3}\)\d{3}-\d{4}', text)
#print(phone)
zip_code=re.findall('\d{5}',text)
my_zip=set(zip_code)
#print(my_zip)
email=re.findall('@*?\.',text)
print (email)
sheet.append(['phone','zip_code'])
excel.save('C:\Program Files\Python310/file.xlsx')
print('DONE!!')
@Python2020 Před rokem
After reading pdf use reqular expressions, use variable to append data in cav
@anushyaa5442 Před rokem
@@Python2020 not able to understand some example plz

Další v pořadí

Automatické přehrávání

Extract tabular data from PDF with Camelot Using Python