VBA Expert: Reading Scanned PDF's

Sdílet
Vložit
  • čas přidán 20. 08. 2024
  • Working example:
    drive.google.c...
    [updated on 19.01.2021]
    Tools: Magick and Tesseract
    Tesseract:
    github.com/tes...
    Variables:
    • VBA Beginner: 02. Vari...
    Loops:
    • VBA Beginner: 03. Loops
    Conditionals
    /
    API's
    /
    Shell
    • Topical: VBA & Shell F...
    FSO library
    /

Komentáře • 32

  • @matewojno2103
    @matewojno2103 Před rokem

    I am trying to ocr scan documents in pol then extract only sepecific boxes and paste or import them to excel table, i have solution in mind but not enough programming knowledge.

  • @Lucian0623
    @Lucian0623 Před 7 lety

    thank you for the video!! keep it up!

  • @bcapp7937
    @bcapp7937 Před 3 lety

    Thanks very much for sharing this. You mentioned that you have a 2.0 version where you put Tesseract and Unar (apparently Magick now) zip files the code, could you share the code/file on this? Thanks!

  • @drag1c
    @drag1c Před 3 lety +1

    Hi Kalkytron ! I am using Professional 2013, magick 7.0.11 and tess 5. Is there chance your excel file not work with them? Ofcourse, I changed source parts of your excel file.
    Also, to point out, I do not have Microsoft Excel 16.0 Object Library. I have Microsoft Excel 15.0 Object Library. The same is with Microsoft Excel 16.0 Office Library.
    The problem is I have infinite loading and excel take 0 MB of ram. It goes into bug.

    • @drag1c
      @drag1c Před 3 lety

      I've found out Shell function does not work properly for part: Converting from PDF to JPG. Simply, when I run it, I dont get JPG file.
      I've tried Part 4 of program on JPG file (manually added JPG file with text into folder) and it works.
      Do you know how to fix Shell part for PDF to JPG?

  • @jetzza1995
    @jetzza1995 Před 4 lety

    Hi Bart, can we please get the working example link ,this would be very helpfull for us

  • @BRVR_
    @BRVR_ Před 2 lety

    Hello, Could you say how can i put a higher resolution in output JPG. files ?I see that in the process from PDF to JPG - VBA does JPG very small and unrecognizable for Tesseract. When VBA tries to recognize the text from JPG. it shows in Command Line: Resolution 98 and mostly gives "abracadabra" result in TXT.
    I think i should put here some Command like magick convert inpur.pdf -resize 150% -quality 200 output.jpg
    Call Shell(sMagick & " """ & oFile.Path & """ """ & Left(oFile.Path, Len(oFile.Path) - 3) & "jpg" & """", vbNormalFocus) 'Run Magick: PDF to JPG

  • @KhalilYasser
    @KhalilYasser Před 5 lety

    Thanks a lot. As for the working example link is not found .. Can you update please?

  • @jeronimo6159
    @jeronimo6159 Před rokem

    Hi do you do paid work? If so how do I contact you

  • @daviddarby3738
    @daviddarby3738 Před 6 lety +1

    Complex code and no demo?

  • @miguelangelsorianobueno5816

    Hello Working Example link doesn´t work anymork. Could you fix it please? TY!!!!

  • @ThePimentajoao
    @ThePimentajoao Před 6 lety

    Hi there!
    I find this video very usefull ! Could you please help me by posting the code in answer or update the links plz? Thanks in advance for the help!
    João Pimenta

  • @danielohlsson3649
    @danielohlsson3649 Před 7 lety

    Hi, thanks for a helpful video. I'm however trying to read scanned documents (I have them as .tif files) that have a few check boxes, and I would like to know which of the check boxes have a check mark in them. Do you know if Tesseract supports this kind of issue? I have read about OMR (Optical Mark Recognition) but I haven't found anything for custom implementation in VBA or Python, which are the languages I know. Thank you for your help!

    • @kalkytron6385
      @kalkytron6385  Před 7 lety

      Hi Daniel. I haven't used Tesseract with .tiff files. Also, Tesseract doesn't return anything for check boxes when read from a JPG file. But when I google "Tesseract tif" then I immediately get some possibly useful hits. I suggest you try to make it work at the command prompt first based on those articles. Once you've covered that you can move it to the script.

  • @deandog7223
    @deandog7223 Před 3 lety

    Hi @Kalkytron, possible that you still might have the working source-code?

    • @kalkytron6385
      @kalkytron6385  Před 3 lety

      Hi DeanDog. Try this link: drive.google.com/file/d/1-vCvJRWg6m6k1d23_fRC2NyQ0q4vsAdP/view?usp=sharing
      It's an updated version.

  • @Eric-pi2rn
    @Eric-pi2rn Před 5 lety

    Hi,
    Anyone with a working version of this? I tried by manually copying everything, but had no Luck ...
    This would be exactly what I Need -.-

    • @kalkytron6385
      @kalkytron6385  Před 3 lety

      Hi Eric. Here is an updated version: drive.google.com/file/d/1-vCvJRWg6m6k1d23_fRC2NyQ0q4vsAdP/view?usp=sharing

  • @daytodatainc.2520
    @daytodatainc.2520 Před 6 lety

    Hello, the links are not working. Can you please update?

  • @romanlight5525
    @romanlight5525 Před 6 lety

    hello can you update example link?

  • @chriscatterall4698
    @chriscatterall4698 Před 7 lety

    Thanks for the really useful video I am trying to recreate this code using MS Excel 2010 on Windows 10. Is there a Window equvilent for Unar and Tessract, or is this solution only for Mac? Thanks again for your help.

    • @kalkytron6385
      @kalkytron6385  Před 7 lety

      Hi Chris. The video is made on a Windows machine. So Unar and Tesseract definitely work on there. Also, MS Excel 2010 is fine.

    • @chriscatterall4698
      @chriscatterall4698 Před 7 lety

      Hi Bart,
      Thanks for your reply. I managed to find the Windows versions of Unar and Tesseract, and have downloaded your example spreadsheet. I had the same problem with the Shell command as Andre. I therefore substituted Andre's code (i.e. Function PDF_To_Txt down to ==================Put all the txt files into 1 & rename======================= .... End Function. I changed the file directory locations for my computer. However I get a runtime error '53' File not found message, and the 'Call Shell(spdfEx & " " & """" & sPDF & """ """ & sTXT & """", vbNormalFocus)' line of code immediately below ==============Convert PDF to TXT====================== is highlighted. I've pasted the end of the code below. Am I missing something obvious? I'd be grateful if you could take a look. Thanks again.
      Chris
      Function PDF_To_Txt(sPDF As String)
      'Tools -> Reference -> Microsoft Scripting Runtime
      'Process:
      ' - PDF to JPG/TIFF with Unar --> output is 1 picture per PDF page
      ' - Make sure the pictures are in the correct folder
      ' - JPG's to TXT's
      ' - All TXT's into 1 TXT
      ' - Collect everything in an Output folder
      'To do:
      ' - If picture; don't call Unar, but straight to Teseract
      ' - Add .exe and dll's to Macro
      ' - Check whether files already exist before creating them
      Dim sPath As String
      Dim spdfEx As String
      Dim sTesseract As String
      Dim sUnar As String
      Dim sTXT As String
      Dim i As Integer
      Dim iSlashCounter As Integer
      Dim iPageCounter As Integer
      Dim sPDFname As String
      Dim sNewPath As String
      Dim FSO As Scripting.FileSystemObject
      Dim fsoFolder As Scripting.Folder
      Dim fsoFile As Scripting.File
      Dim fsoFile2 As Scripting.File
      Dim oTxt As Object
      Dim oTxtGet As Object
      Dim sAPI As String
      Set FSO = CreateObject("Scripting.FileSystemObject")
      iPageCounter = 1
      If UCase(Right(sPDF, 3)) "PDF" Then
      Exit Function
      End If
      'Get Path and PDF name from dir
      iSlashCounter = 0
      For i = Len(sPDF) To 1 Step -1
      If Mid(sPDF, i, 1) = "\" Then
      iSlashCounter = iSlashCounter + 1
      If iSlashCounter = 2 Then
      sPath = Mid(sPDF, 1, i)
      Exit For
      ElseIf iSlashCounter = 1 Then
      sPDFname = Mid(sPDF, i + 1)
      sPDFname = Mid(sPDFname, 1, Len(sPDFname) - 4)
      End If
      End If
      Next i
      If FSO.FolderExists(sPath & "Output") = False Then
      MkDir sPath & "Output\"
      Sleep 500
      End If
      sUnar = sPath & "unar.exe" 'PDF to JPG converter
      sTesseract = sPath & "Tesseract\tesseract.exe" 'JPG to TXT converter
      spdfEx = sPath & "pdfEx\pdfExtractor.exe"
      sTXT = sPath & sPDFname & ".txt"
      '==============Convert PDF to TXT======================
      Call Shell(spdfEx & " " & """" & sPDF & """ """ & sTXT & """", vbNormalFocus)
      sAPI = FindWindow(vbNullString, spdfEx)
      i = 0
      Do Until sAPI "0" Or i >= 50 'Catch the screen
      Sleep 50
      sAPI = FindWindow(vbNullString, spdfEx)
      i = i + 1
      Loop
      i = 0
      Do Until sAPI = "0" Or i >= 50 'loop until the screen is away
      Sleep 500
      sAPI = FindWindow(vbNullString, spdfEx)
      i = i + 1
      Loop
      'Check whether there is something in the file
      Set oTxtGet = FSO.OpenTextFile(sTXT, ForReading)
      If FileLen(sTXT)

    • @kalkytron6385
      @kalkytron6385  Před 7 lety

      Hi Chris, I noticed that the code I uploaded contained more than what I explained in the video. I first call an app to directly read the PDF to TXT. This one doesn't work for scans but goes a lot faster. So only if this fails I go to Unar & Tesseract. In any case, I'm sorry for the confusion.
      Regarding the error in the shell, I created this video to explain the shell function and what to pay attention to: czcams.com/video/YiHVMF5N9BY/video.html

  • @andrejackson6020
    @andrejackson6020 Před 7 lety

    Hi Bart, thanks for the video, i'm having some issues with the shell command. Please help!

    • @kalkytron6385
      @kalkytron6385  Před 7 lety

      Hi Andre. Can you paste the code here that you are using? And maybe the error you are getting as well

    • @andrejackson6020
      @andrejackson6020 Před 7 lety

      I found the code that was on the original downloadable version was different to the code that was at the end of the video, maybe i'm wrong. I'm only interested in the Unar and Tesseract parts; when i execute the code (F8), it seems that its skipping the Unar part and is trying to convert the PDF directly to a text file rather than going through the whole process.
      Thanks in advance.
      Function PDF_To_Txt(sPDF As String)
      'Tools -> Reference -> Microsoft Scripting Runtime
      'Process:
      ' - PDF to JPG/TIFF with Unar --> output is 1 picture per PDF page
      ' - Make sure the pictures are in the correct folder
      ' - JPG's to TXT's
      ' - All TXT's into 1 TXT
      ' - Collect everything in an Output folder
      'To do:
      ' - If picture; don't call Unar, but straight to Teseract
      ' - Add .exe and dll's to Macro
      ' - Check whether files already exist before creating them
      Dim sPath As String
      sPath = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract\ReadPDFs\"
      Dim i As Integer
      Dim iSlashCounter As Integer
      Dim iPageCounter As Integer
      Dim sPDFname As String
      Dim fsoFolder As Scripting.Folder
      Dim fsoFile As Scripting.File
      Dim fsoFile2 As Scripting.File
      Dim oTxt As Object
      Dim oTxtGet As Object
      Dim sAPI As Long
      Dim FSO As Scripting.FileSystemObject
      Dim oFolder As Scripting.Folder
      Dim oFile As Scripting.File
      Dim sFolder As String
      Dim sUnar As String
      Dim sTesseract As String
      Dim sTxt As String
      Dim sNewPath As String
      Set FSO = CreateObject("Scripting.FileSystemObject")
      sFolder = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract\ReadPDFs"
      Set oFolder = FSO.GetFolder(sFolder)
      sUnar = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract/" & "unar.exe"
      sTesseract = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract" & "/Tesseract.exe"
      iPageCounter = 1
      For Each oFile In oFolder.Files
      sTxt = sFolder & oFile.Name & "convert"
      sNewPath = sFolder & "\" & oFile.Name & "pdf"
      '==============Convert PDF to TXT======================
      Call Shell(sUnar & " " & """" & oFile.Name & """", vbNormalFocus) 'Run Unar: PDF to JPG
      sAPI = FindWindow(vbNullString, sUnar)
      i = 0
      Do Until sAPI "0" Or i >= 50 'Catch the screen
      Sleep 50
      sAPI = FindWindow(vbNullString, sUnar)
      i = i + 1
      Loop
      i = 0
      Do Until sAPI = "0" Or i >= 50 'loop until the screen is away
      Sleep 500
      sAPI = FindWindow(vbNullString, sUnar)
      i = i + 1
      Loop
      'Check whether a folder is made by unar. If not; make one and copy the jpg to it
      If FSO.FolderExists(sNewPath) = False Then
      MkDir sNewPath
      Do Until FSO.FolderExists(sNewPath) = True
      Sleep 100
      Loop
      Dim SourceFile, DestinationFile As String
      SourceFile = sPath & oFile.Name 'Define source file name.
      DestinationFile = sNewPath & oFile.Name 'Define target file name.
      'Copy the jpg to the newly created folder
      Set oFolder = FSO.GetFolder(sPath)
      If Mid(oFile.Name, 1, 4) = "Page" Then
      FileCopy SourceFile, DestinationFile
      Sleep 500
      Kill sPath & oFile.Name
      Sleep 500
      Exit For
      End If
      End If
      Call Shell(Chr(34) & sTesseract & Chr(34) & " " & Chr(34) & sNewPath & oFile.Name & Chr(34) & " " & Chr(34) & sTxt & Chr(34), vbNormalFocus)
      sAPI = FindWindow(vbNullString, sTesseract)
      i = 0
      Do Until sAPI "0" Or i >= 50 'Catch the screen
      Sleep 50
      sAPI = FindWindow(vbNullString, sTesseract)
      i = i + 1
      Loop
      i = 0
      Do Until sAPI = "0" Or i >= 50 'Loop until the screen is away
      Sleep 500
      sAPI = FindWindow(vbNullString, sTesseract)
      i = i + 1
      Loop
      Next oFile
      '==================Put all the txt files into 1 & rename=======================
      Set oFolder = FSO.GetFolder(sNewPath)
      sTxt = sNewPath & sPDFname & ".txt"
      Set oTxt = FSO.CreateTextFile(sTxt, True) 'Create new txt file
      Do Until FSO.FileExists(sTxt) = True
      Sleep 100
      Loop
      For Each oTxtGet In oFolder.Files 'Loop through other files and copy text
      If CStr(oTxtGet) sTxt Then
      Set oTxtGet = FSO.OpenTextFile(oTxtGet, ForReading)
      Sleep 1500
      On Error Resume Next
      oTxt.WriteLine (oTxtGet.ReadAll)
      On Error GoTo 0
      oTxt.WriteLine (" _-_-_-_-_-_-_-_-_- " & "Page " & iPageCounter & " _-_-_-_-_-_-_-_-_- ")
      oTxtGet.Close
      iPageCounter = iPageCounter + 1
      End If
      Next oTxtGet
      oTxt.Close
      'Copy txt file to an output folder
      On Error Resume Next
      MkDir sPath & "Output\"
      On Error GoTo 0
      FileCopy sTxt, sPath & "Output\" & sPDFname & ".txt"
      On Error Resume Next
      Kill sNewPath & "*" 'delete all files in folder
      Do Until Err.Number = 0
      On Error GoTo 0
      On Error Resume Next
      Kill sNewPath & "*" 'delete all files in folder
      Sleep 200
      Loop
      On Error GoTo 0
      End Function

    • @kalkytron6385
      @kalkytron6385  Před 7 lety

      Hi Andre, I don't recommend walking through the code with F8 because the code is trying to find windows and handle them. You might interrupt this by always going back to the VBE.
      Also, the code you show starts with running Unar. So that seems OK....
      I do have a version in which I first try to read the PDF with a different console app (works in case of eg Word files saved as PDF). If this fails then I go for Unar and Tesseract. In any case, I don't think that I have this in the uploaded version on Dropbox.

  • @D3_Business_Analytics
    @D3_Business_Analytics Před 6 lety

    The link is not working dear