Jump to content

Please read the Forum Rules before posting.

Photo
- - - - -

Best way to ocr pdf


  • Please log in to reply
5 replies to this topic

#1 gad4

gad4

    e-Sword Fanatic

  • Veterans
  • PipPipPipPipPip
  • 163 posts
Offline

Posted 18 November 2021 - 10:23 AM

Some documents have Greek or Hebrew characters so I'm unsure what is best method when either is included.

Thanks

#2 APsit190

APsit190

    e-Sword Tools Developer

  • Members (T)
  • PipPipPipPipPip
  • 2,861 posts
  • LocationLand of the Long White Cloud (AKA New Zealand)
Offline

Posted 18 November 2021 - 03:39 PM

Some documents have Greek or Hebrew characters so I'm unsure what is best method when either is included.

Thanks

So am I.

 

Guess that after scanning, edit out what you don't want and keep the rest. Works fine that way with me.

 

Blessings,

Autograph.png

X (formerly Twitter)

 


#3 gad4

gad4

    e-Sword Fanatic

  • Veterans
  • PipPipPipPipPip
  • 163 posts
Offline

Posted 19 November 2021 - 03:59 PM

I appreciate the response. I was curious if there is a known ocr system that keeps the Greek and or Hebrew rather then editing it out.

Thanks

#4 Tj Higgins

Tj Higgins

    e-Sword Fanatic

  • Members (T)
  • PipPipPipPipPip
  • 1,448 posts
Offline

Posted 19 November 2021 - 05:19 PM

I appreciate the response. I was curious if there is a known ocr system that keeps the Greek and or Hebrew rather then editing it out.

Thanks

There are a number of ocr software packages available such as PDF Wondershare that should do what you want



#5 JPG

JPG

    Jon.

  • Moderators
  • 1,665 posts
Online

Posted 20 November 2021 - 12:53 AM

Do you have a document you can share for testing.

I have ABBYY FineReader 15.

Plain Greek or Polytonic Greek can be trained to work reasonable well.

Hebrew is more challenging if it has vowels. I have not got that working so good. Plain Hebrew works reasonably well with a good clear scan and some training and fixing.



#6 silverys

silverys

    New to Bible Support

  • Members
  • Pip
  • 7 posts
  • LocationCaribbean Sea
Offline

Posted 02 October 2023 - 05:40 PM

on linux we use a cli tool called tesseract-ocr, that support several languages, for example you will need the packages (on a debian system):

tesseract-ocr-script-hebr,
tesseract-ocr-heb

for old hebrew
and

tesseract-ocr-grc

for ancient greek
usually applied on a img file , but you can do it directly on a pdf file using
ocrmypdf

tool that uses Tesseract engine and support its features.

 

 

 

 

 

 

 

 

 






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users