Jump to content

Please read the Forum Rules before posting.

Photo
- - - - -

Talking about scanning


10 replies to this topic

#1 emilzalez

emilzalez

    e-Sword Addict

  • Veterans
  • PipPipPipPip
  • 64 posts
Offline

Posted 13 July 2014 - 07:39 PM

I've downloaded some material from archive, in Google. I am interested to convert them  to modules, but I've read about that is better to scann them before make modules of them. My questions: why is necessary to scan something that is pdf already? and how to do it? Have I to print it and scan it again?

Blessings.



#2 Josh Bond

Josh Bond

    Administrator

  • Administrators
  • PipPipPipPipPip
  • 2,890 posts
  • LocationGallatin, TN
Offline

Posted 13 July 2014 - 09:13 PM

When someone scans a book with a scanner, this results in a PDF. Each scanned page of the PDF is a digital picture.

 

To the computer, each scanned page is a picture, similar to a picture you would take with your digital camera or phone. The computer doesn't "know" that the digital photos in the PDF are really snapshots of text.

 

The OCR process recognizes the text of each page. This results in digitized text. Digitized text can then be used by e-Sword. Text that is used by e-Sword can be tooltipped, the text can be formatted for the page with different and seamless line widths, the text can be searched, etc.

 

If you Google OCR, you'll find various OCR software packages, like Abby Finereader. OCRing text can be very time consuming and it is the reason certain modules have never been made...

 

Josh



#3 emilzalez

emilzalez

    e-Sword Addict

  • Veterans
  • PipPipPipPip
  • 64 posts
Offline

Posted 13 July 2014 - 10:17 PM

Josh, thank you for help me to see. I have an idea about what happens in the process, but in case I decide to give my time to it, what would be my action with this books in pdf from archive? B'cause I think any scanner has an ocr software, right?



#4 bjohns

bjohns

    e-Sword Supporter

  • Members (T)
  • PipPipPip
  • 30 posts
Offline

Posted 13 July 2014 - 10:23 PM

I would recommend that you try a Freeware Program called FreeOCR.   I have version 5.02 on my system and it works great when using it with your scanner and also, you can load  a large PDF File into the program and it will "Scan" via OCR each page in the PDF.  Naturally, it isn't perfect but sure is a helpful program.   I use it on a regular basis and it might help you do what you want to do...

 

bjohns

 

coc.myfreewebhosting.net



#5 Josh Bond

Josh Bond

    Administrator

  • Administrators
  • PipPipPipPipPip
  • 2,890 posts
  • LocationGallatin, TN
Offline

Posted 13 July 2014 - 10:23 PM

Yes, most scanners probably have some type of OCR software.

 

Let me give you 2 pointers:

 

1. Not all scans are equal. You can scan a book at different resolutions. Higher resolutions = higher quality. Higher quality means the OCR process recognizes the text with fewer errors. Archive.org scans are lower quality....because they are trying to save disk space and bandwidth.

 

2. Not all OCR software is equal, especially when trying to recognize older books. A couple of titles are great. A few are mediocre. The rest are junk.

 

Scanned text that you just printed from Microsoft Word? Sure, many titles can handle this. But older text with all of its inconsistencies of the type-set world back then? I'd would only try Abby FineReader or Omnipage Pro if the project were very large at all, especially on Archive.org scans.



#6 bjohns

bjohns

    e-Sword Supporter

  • Members (T)
  • PipPipPip
  • 30 posts
Offline

Posted 13 July 2014 - 10:33 PM

You may find the program FreeOCR at the following website:

 

http://www.freeocr.net/

 

It will load digital (.PDF) documents from Google and scan each page and insert the text into the built-in word processor and I believe it will paste the text directly into Microsoft Word.

 

bjohns

 

coc.myfreewebhosting.net



#7 BaptizedBeliever

BaptizedBeliever

    Christian

  • Members (T)
  • PipPipPipPipPip
  • 924 posts
Offline

Posted 13 July 2014 - 10:46 PM

Having worked on several scan/pdf projects, I can say that if it is possible, it's best to work with your own scan of something.  That way, you control the quality of the pdf.  For example, Google and Archive have some books that are 600+ pages, and only 4.5 megabytes.  That's a really low-quality scan, and when you run an OCR program on it, you will find tons of errors in the conversion.  You'll likely find the word "arid" several times when it should be "and."  "n" and "u" are often mistaken for each other.  And if the book has footnotes, you can just plan ahead to re-type every one of them because the OCR program won't have a clue what those are supposed to be.

 

But if you can get a high-quality scan (which is easy to do if you're the one scanning), then most of those problems disappear, and you will have much less work to do in the proofreading process (which, honestly, is the part that takes the most time).

 

Have I used archive and Google books before for OCR projects?  Yep.  Some books they have are ones that I couldn't get a hold of in print without a substantial financial investment.  Others are just plain unavailable anywhere else.  But many of them have had to have entire pages re-typed by hand, because the scan was so bad.

 

I've never tried the OCR program that Bennie Johns mentions above.  I'm quite thrilled with Abby FineReader.  It's better than any other program I've used.  But it isn't cheap.

 

-Brad



#8 bjohns

bjohns

    e-Sword Supporter

  • Members (T)
  • PipPipPip
  • 30 posts
Offline

Posted 13 July 2014 - 10:47 PM

About FreeOCR

FreeOCR is a free Optical Character Recognition Software for Windows and supports scanning from most Twain scanners and can also open most scanned PDF's and multi page Tiff images as well as popular image file formats. FreeOCR outputs plain text and can export directly to Microsoft Word format.

 

 

Free OCR uses the latest Tesseract (v3.01) OCR engine. It includes a Windows installer and It is very simple to use and supports opening multi-page tiff documents, Adobe PDF and fax documents as well as most image types including compressed Tiff's which the Tesseract engine on its own cannot read .It now can scan using Twain and WIA scanning drivers.

FreeOCR V4 includes Tesseract V3 which increases accuracy and has page layout analysis so more accurate results can be achieved without using the zone selection tool.

 

Scanning Software

As well as OCR FreeOCR can scan and save images as JPG's and we are currently working on "Scan to PDF" capability with the option to save as searchable PDF

 

OCR Engine

The included Tesseract OCR PDF engine is an open source product released by Google. It was developed at Hewlett Packard Laboratories between 1985 and 1995. In 1995 it was one of the top 3 performers at the OCR accuracy contest organized by University of Nevada in Las Vegas. The Tesseract engine source code is now maintained by Google and the project can be found here: http://code.google.c.../tesseract-ocr/

  License

FreeOCR is a freeware OCR & scanning software and you can do what you like with it including commercial use. The included Tesseract OCR engine is distributed under the Apache V2.0 license.



#9 bjohns

bjohns

    e-Sword Supporter

  • Members (T)
  • PipPipPip
  • 30 posts
Offline

Posted 13 July 2014 - 10:52 PM

Bradley,

 

I am not Bennie Johns,   but believe it or not,  my Dad who passed away In January was related to Bennie.   His name was Clennie Johns.   Some kind of a play on words.   My dad  like Bennie was a preacher in the Church of Christ for about 50 years.....

 

bjohns (Mark Johns)

 

coc.myfreewebhosting.net


Edited by bjohns, 13 July 2014 - 10:58 PM.


#10 BaptizedBeliever

BaptizedBeliever

    Christian

  • Members (T)
  • PipPipPipPipPip
  • 924 posts
Offline

Posted 13 July 2014 - 11:46 PM

Sorry, I assumed that the BJohns and the talk about scanners could have only been Bennie.  Nice to meet you!





Reply to this topic



  


0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users




Similar Topics



Latest Blogs