VBA Expert: Reading Scanned PDF's

Kalkytron

zhlédnutí 32 700

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 20. 08. 2024
Working example:
drive.google.c...
[updated on 19.01.2021]
Tools: Magick and Tesseract
Tesseract:
github.com/tes...
Variables:
• VBA Beginner: 02. Vari...
Loops:
• VBA Beginner: 03. Loops
Conditionals
/
API's
/
Shell
• Topical: VBA & Shell F...
FSO library
/

Komentáře • 32

@matewojno2103 Před rokem
I am trying to ocr scan documents in pol then extract only sepecific boxes and paste or import them to excel table, i have solution in mind but not enough programming knowledge.
@Lucian0623 Před 7 lety
thank you for the video!! keep it up!
@bcapp7937 Před 3 lety
Thanks very much for sharing this. You mentioned that you have a 2.0 version where you put Tesseract and Unar (apparently Magick now) zip files the code, could you share the code/file on this? Thanks!
@drag1c Před 3 lety ⁺¹
Hi Kalkytron ! I am using Professional 2013, magick 7.0.11 and tess 5. Is there chance your excel file not work with them? Ofcourse, I changed source parts of your excel file.
Also, to point out, I do not have Microsoft Excel 16.0 Object Library. I have Microsoft Excel 15.0 Object Library. The same is with Microsoft Excel 16.0 Office Library.
The problem is I have infinite loading and excel take 0 MB of ram. It goes into bug.
@drag1c Před 3 lety
I've found out Shell function does not work properly for part: Converting from PDF to JPG. Simply, when I run it, I dont get JPG file.
I've tried Part 4 of program on JPG file (manually added JPG file with text into folder) and it works.
Do you know how to fix Shell part for PDF to JPG?
@jetzza1995 Před 4 lety
Hi Bart, can we please get the working example link ,this would be very helpfull for us
@BRVR_ Před 2 lety
Hello, Could you say how can i put a higher resolution in output JPG. files ?I see that in the process from PDF to JPG - VBA does JPG very small and unrecognizable for Tesseract. When VBA tries to recognize the text from JPG. it shows in Command Line: Resolution 98 and mostly gives "abracadabra" result in TXT.
I think i should put here some Command like magick convert inpur.pdf -resize 150% -quality 200 output.jpg
Call Shell(sMagick & " """ & oFile.Path & """ """ & Left(oFile.Path, Len(oFile.Path) - 3) & "jpg" & """", vbNormalFocus) 'Run Magick: PDF to JPG
@KhalilYasser Před 5 lety
Thanks a lot. As for the working example link is not found .. Can you update please?
@jeronimo6159 Před rokem
Hi do you do paid work? If so how do I contact you
@daviddarby3738 Před 6 lety ⁺¹
Complex code and no demo?
@miguelangelsorianobueno5816 Před 7 lety
Hello Working Example link doesn´t work anymork. Could you fix it please? TY!!!!
@ThePimentajoao Před 6 lety
Hi there!
I find this video very usefull ! Could you please help me by posting the code in answer or update the links plz? Thanks in advance for the help!
João Pimenta
@danielohlsson3649 Před 7 lety
Hi, thanks for a helpful video. I'm however trying to read scanned documents (I have them as .tif files) that have a few check boxes, and I would like to know which of the check boxes have a check mark in them. Do you know if Tesseract supports this kind of issue? I have read about OMR (Optical Mark Recognition) but I haven't found anything for custom implementation in VBA or Python, which are the languages I know. Thank you for your help!
@kalkytron6385 Před 7 lety
Hi Daniel. I haven't used Tesseract with .tiff files. Also, Tesseract doesn't return anything for check boxes when read from a JPG file. But when I google "Tesseract tif" then I immediately get some possibly useful hits. I suggest you try to make it work at the command prompt first based on those articles. Once you've covered that you can move it to the script.
@deandog7223 Před 3 lety
Hi @Kalkytron, possible that you still might have the working source-code?
@kalkytron6385 Před 3 lety
Hi DeanDog. Try this link: drive.google.com/file/d/1-vCvJRWg6m6k1d23_fRC2NyQ0q4vsAdP/view?usp=sharing
It's an updated version.
@Eric-pi2rn Před 5 lety
Hi,
Anyone with a working version of this? I tried by manually copying everything, but had no Luck ...
This would be exactly what I Need -.-
@kalkytron6385 Před 3 lety
Hi Eric. Here is an updated version: drive.google.com/file/d/1-vCvJRWg6m6k1d23_fRC2NyQ0q4vsAdP/view?usp=sharing
@daytodatainc.2520 Před 6 lety
Hello, the links are not working. Can you please update?
@romanlight5525 Před 6 lety
hello can you update example link?
@chriscatterall4698 Před 7 lety
Thanks for the really useful video I am trying to recreate this code using MS Excel 2010 on Windows 10. Is there a Window equvilent for Unar and Tessract, or is this solution only for Mac? Thanks again for your help.
@kalkytron6385 Před 7 lety
Hi Chris. The video is made on a Windows machine. So Unar and Tesseract definitely work on there. Also, MS Excel 2010 is fine.
@chriscatterall4698 Před 7 lety
Hi Bart,
Thanks for your reply. I managed to find the Windows versions of Unar and Tesseract, and have downloaded your example spreadsheet. I had the same problem with the Shell command as Andre. I therefore substituted Andre's code (i.e. Function PDF_To_Txt down to ==================Put all the txt files into 1 & rename======================= .... End Function. I changed the file directory locations for my computer. However I get a runtime error '53' File not found message, and the 'Call Shell(spdfEx & " " & """" & sPDF & """ """ & sTXT & """", vbNormalFocus)' line of code immediately below ==============Convert PDF to TXT====================== is highlighted. I've pasted the end of the code below. Am I missing something obvious? I'd be grateful if you could take a look. Thanks again.
Chris
Function PDF_To_Txt(sPDF As String)
'Tools -> Reference -> Microsoft Scripting Runtime
'Process:
' - PDF to JPG/TIFF with Unar --> output is 1 picture per PDF page
' - Make sure the pictures are in the correct folder
' - JPG's to TXT's
' - All TXT's into 1 TXT
' - Collect everything in an Output folder
'To do:
' - If picture; don't call Unar, but straight to Teseract
' - Add .exe and dll's to Macro
' - Check whether files already exist before creating them
Dim sPath As String
Dim spdfEx As String
Dim sTesseract As String
Dim sUnar As String
Dim sTXT As String
Dim i As Integer
Dim iSlashCounter As Integer
Dim iPageCounter As Integer
Dim sPDFname As String
Dim sNewPath As String
Dim FSO As Scripting.FileSystemObject
Dim fsoFolder As Scripting.Folder
Dim fsoFile As Scripting.File
Dim fsoFile2 As Scripting.File
Dim oTxt As Object
Dim oTxtGet As Object
Dim sAPI As String
Set FSO = CreateObject("Scripting.FileSystemObject")
iPageCounter = 1
If UCase(Right(sPDF, 3)) "PDF" Then
Exit Function
End If
'Get Path and PDF name from dir
iSlashCounter = 0
For i = Len(sPDF) To 1 Step -1
If Mid(sPDF, i, 1) = "\" Then
iSlashCounter = iSlashCounter + 1
If iSlashCounter = 2 Then
sPath = Mid(sPDF, 1, i)
Exit For
ElseIf iSlashCounter = 1 Then
sPDFname = Mid(sPDF, i + 1)
sPDFname = Mid(sPDFname, 1, Len(sPDFname) - 4)
End If
End If
Next i
If FSO.FolderExists(sPath & "Output") = False Then
MkDir sPath & "Output\"
Sleep 500
End If
sUnar = sPath & "unar.exe" 'PDF to JPG converter
sTesseract = sPath & "Tesseract\tesseract.exe" 'JPG to TXT converter
spdfEx = sPath & "pdfEx\pdfExtractor.exe"
sTXT = sPath & sPDFname & ".txt"
'==============Convert PDF to TXT======================
Call Shell(spdfEx & " " & """" & sPDF & """ """ & sTXT & """", vbNormalFocus)
sAPI = FindWindow(vbNullString, spdfEx)
i = 0
Do Until sAPI "0" Or i >= 50 'Catch the screen
Sleep 50
sAPI = FindWindow(vbNullString, spdfEx)
i = i + 1
Loop
i = 0
Do Until sAPI = "0" Or i >= 50 'loop until the screen is away
Sleep 500
sAPI = FindWindow(vbNullString, spdfEx)
i = i + 1
Loop
'Check whether there is something in the file
Set oTxtGet = FSO.OpenTextFile(sTXT, ForReading)
If FileLen(sTXT)
@kalkytron6385 Před 7 lety
Hi Chris, I noticed that the code I uploaded contained more than what I explained in the video. I first call an app to directly read the PDF to TXT. This one doesn't work for scans but goes a lot faster. So only if this fails I go to Unar & Tesseract. In any case, I'm sorry for the confusion.
Regarding the error in the shell, I created this video to explain the shell function and what to pay attention to: czcams.com/video/YiHVMF5N9BY/video.html
@andrejackson6020 Před 7 lety
Hi Bart, thanks for the video, i'm having some issues with the shell command. Please help!
@kalkytron6385 Před 7 lety
Hi Andre. Can you paste the code here that you are using? And maybe the error you are getting as well
@andrejackson6020 Před 7 lety
I found the code that was on the original downloadable version was different to the code that was at the end of the video, maybe i'm wrong. I'm only interested in the Unar and Tesseract parts; when i execute the code (F8), it seems that its skipping the Unar part and is trying to convert the PDF directly to a text file rather than going through the whole process.
Thanks in advance.
Function PDF_To_Txt(sPDF As String)
'Tools -> Reference -> Microsoft Scripting Runtime
'Process:
' - PDF to JPG/TIFF with Unar --> output is 1 picture per PDF page
' - Make sure the pictures are in the correct folder
' - JPG's to TXT's
' - All TXT's into 1 TXT
' - Collect everything in an Output folder
'To do:
' - If picture; don't call Unar, but straight to Teseract
' - Add .exe and dll's to Macro
' - Check whether files already exist before creating them
Dim sPath As String
sPath = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract\ReadPDFs\"
Dim i As Integer
Dim iSlashCounter As Integer
Dim iPageCounter As Integer
Dim sPDFname As String
Dim fsoFolder As Scripting.Folder
Dim fsoFile As Scripting.File
Dim fsoFile2 As Scripting.File
Dim oTxt As Object
Dim oTxtGet As Object
Dim sAPI As Long
Dim FSO As Scripting.FileSystemObject
Dim oFolder As Scripting.Folder
Dim oFile As Scripting.File
Dim sFolder As String
Dim sUnar As String
Dim sTesseract As String
Dim sTxt As String
Dim sNewPath As String
Set FSO = CreateObject("Scripting.FileSystemObject")
sFolder = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract\ReadPDFs"
Set oFolder = FSO.GetFolder(sFolder)
sUnar = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract/" & "unar.exe"
sTesseract = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract" & "/Tesseract.exe"
iPageCounter = 1
For Each oFile In oFolder.Files
sTxt = sFolder & oFile.Name & "convert"
sNewPath = sFolder & "\" & oFile.Name & "pdf"
'==============Convert PDF to TXT======================
Call Shell(sUnar & " " & """" & oFile.Name & """", vbNormalFocus) 'Run Unar: PDF to JPG
sAPI = FindWindow(vbNullString, sUnar)
i = 0
Do Until sAPI "0" Or i >= 50 'Catch the screen
Sleep 50
sAPI = FindWindow(vbNullString, sUnar)
i = i + 1
Loop
i = 0
Do Until sAPI = "0" Or i >= 50 'loop until the screen is away
Sleep 500
sAPI = FindWindow(vbNullString, sUnar)
i = i + 1
Loop
'Check whether a folder is made by unar. If not; make one and copy the jpg to it
If FSO.FolderExists(sNewPath) = False Then
MkDir sNewPath
Do Until FSO.FolderExists(sNewPath) = True
Sleep 100
Loop
Dim SourceFile, DestinationFile As String
SourceFile = sPath & oFile.Name 'Define source file name.
DestinationFile = sNewPath & oFile.Name 'Define target file name.
'Copy the jpg to the newly created folder
Set oFolder = FSO.GetFolder(sPath)
If Mid(oFile.Name, 1, 4) = "Page" Then
FileCopy SourceFile, DestinationFile
Sleep 500
Kill sPath & oFile.Name
Sleep 500
Exit For
End If
End If
Call Shell(Chr(34) & sTesseract & Chr(34) & " " & Chr(34) & sNewPath & oFile.Name & Chr(34) & " " & Chr(34) & sTxt & Chr(34), vbNormalFocus)
sAPI = FindWindow(vbNullString, sTesseract)
i = 0
Do Until sAPI "0" Or i >= 50 'Catch the screen
Sleep 50
sAPI = FindWindow(vbNullString, sTesseract)
i = i + 1
Loop
i = 0
Do Until sAPI = "0" Or i >= 50 'Loop until the screen is away
Sleep 500
sAPI = FindWindow(vbNullString, sTesseract)
i = i + 1
Loop
Next oFile
'==================Put all the txt files into 1 & rename=======================
Set oFolder = FSO.GetFolder(sNewPath)
sTxt = sNewPath & sPDFname & ".txt"
Set oTxt = FSO.CreateTextFile(sTxt, True) 'Create new txt file
Do Until FSO.FileExists(sTxt) = True
Sleep 100
Loop
For Each oTxtGet In oFolder.Files 'Loop through other files and copy text
If CStr(oTxtGet) sTxt Then
Set oTxtGet = FSO.OpenTextFile(oTxtGet, ForReading)
Sleep 1500
On Error Resume Next
oTxt.WriteLine (oTxtGet.ReadAll)
On Error GoTo 0
oTxt.WriteLine (" _-_-_-_-_-_-_-_-_- " & "Page " & iPageCounter & " _-_-_-_-_-_-_-_-_- ")
oTxtGet.Close
iPageCounter = iPageCounter + 1
End If
Next oTxtGet
oTxt.Close
'Copy txt file to an output folder
On Error Resume Next
MkDir sPath & "Output\"
On Error GoTo 0
FileCopy sTxt, sPath & "Output\" & sPDFname & ".txt"
On Error Resume Next
Kill sNewPath & "*" 'delete all files in folder
Do Until Err.Number = 0
On Error GoTo 0
On Error Resume Next
Kill sNewPath & "*" 'delete all files in folder
Sleep 200
Loop
On Error GoTo 0
End Function
@kalkytron6385 Před 7 lety
Hi Andre, I don't recommend walking through the code with F8 because the code is trying to find windows and handle them. You might interrupt this by always going back to the VBE.
Also, the code you show starts with running Unar. So that seems OK....
I do have a version in which I first try to read the PDF with a different console app (works in case of eg Word files saved as PDF). If this fails then I go for Unar and Tesseract. In any case, I don't think that I have this in the uploaded version on Dropbox.
@D3_Business_Analytics Před 6 lety
The link is not working dear