Jump to content

Please read the Forum Rules before posting.

Photo

Volunteering to make e-Sword modules


  • Please log in to reply
13 replies to this topic

#11 dysert

dysert

    e-Sword Supporter

  • Veterans
  • PipPipPip
  • 42 posts
  • LocationIndiana, USA
Offline

Posted 10 December 2014 - 02:19 AM

There are a few "tricks". Larry is right, the bottom line is you have to OCR the text. And then proof that text against the original. The "trick" is the tools you use. Even with good tools, it's time consuming. With bad tools, it's unmanageable long term.

 

For the text I digitized, I bought the books myself and scanned them to get a higher resolution scan. Archive.org scans are ok for reading, but for digitizing text, you want a higher resolution scan IF the text is small (like with a commentary--not so much a 75 page devotional book with large print). A PDF scan is just a digital picture of each page. The better picture you take (high res scan) the easier the text will be to interpret.

 

The text interpretation is best done, in my experience, with Abby Finereader Pro. It's the most efficient with the least errors. I like how you can go page by page in a split screen mode, with the original PDF page on the left side and the digitized text on the right side. Words that Abby is uncertain of, or words that don't match a spell check are highlighted for your review. You can click a word in the digitized half of the screen, and that word's location is highlighted on the PDF side of the screen.

 

That's the trick: 1) scan resolution and 2) decent OCR software. You can do it without either, it's just a question of time--how much time it's going to take.

But isn't the bottom line that you end up re-typing the whole book into a word processing program? If you're re-typing everything to get an electronic copy, then the quality of your OCR software doesn't seem all that important to me. Or are you saying that the OCR image becomes the e-Sword module? I'm obviously missing some fundamental point.



#12 dysert

dysert

    e-Sword Supporter

  • Veterans
  • PipPipPip
  • 42 posts
  • LocationIndiana, USA
Offline

Posted 10 December 2014 - 06:44 AM

Ok. I'm awake now and I think I know what's going on. I have started on a .topx module but am getting an error when trying to save it. (Is there a topic for module makers and/or to report errors when running the ToolTip Tool?) The error is:

 

Report the following:
INDEX ERROR
TGen[46]     hdr←'prefix' Defines 'ReadTxt*' ⍝∇{*:SystemError ⋄ E1Pop ⋄ →0} 
                          ^
----------------------------------------
⍎⎕ELX                                      
TGen[46] *                                 
>[fmTxCtrl.mModules.mTopics.mSave2;Click]  
<[fmTxCtrl;Wait]                           
Start[70]                                  


#13 Josh Bond

Josh Bond

    Administrator

  • Administrators
  • PipPipPipPipPip
  • 2,891 posts
  • LocationGallatin, TN
Offline

Posted 10 December 2014 - 11:50 AM



But isn't the bottom line that you end up re-typing the whole book into a word processing program? If you're re-typing everything to get an electronic copy, then the quality of your OCR software doesn't seem all that important to me. Or are you saying that the OCR image becomes the e-Sword module? I'm obviously missing some fundamental point.

 

No, you don't have to retype the whole text. Usually, more than 99% of the text is accurate, if the original scan is good.

 

In the image below, the left side of the screen is the original PDF. The right side is the resulting electronic text. Suspect characters are highlighted in green. Potentially misspelled words have a red squiggly line.

 

I have the cursor over the i in the i6:8 reference (which is a mis-read). The 1 in 16:8 in the PDF has a small box around it to show you the original character.

 

(The text is from a New Testament commentary set called the The New American Commentary edited by Alvah Hovey in the late 1800s.)

 

Attached File  finereader.jpg   201.86K   20 downloads



#14 anh Mike

anh Mike

    e-Sword Fanatic

  • Veterans
  • PipPipPipPipPip
  • 194 posts
  • LocationCHICAGO, IL
Offline

Posted 15 January 2015 - 02:08 PM

you can cut/paste from google texts if they have the work.  This is bit time consuming but generally they have better OCR than archive and keep Italics, sm. caps.  Not all the archive OCR are poor quality you have to open text, select all copy paste.  Alot has to do with the printing of the orginal work as well.  after Downloading the text file from IA, MS word keeps asking for something when opening so I stopped this and went w. select all copy paste method. Buying the books can be expensive, storage, and time consuming to scan pages unless you have the right software and scanner.  Sometimes there are mulitple copies, usually U of Toronto are very good, and you can piece together a better copy.  Also you can ask the IA to redo a scan, cuz it looks like someone was asleep doing it or had a heart attack.  Scanning total of 5,000 pp. dual col. of mulit vol. dictionary like hastings does not excite too many people so in some cases you may have to cut/paste.

 

yes I have several works ready for module but sorry procrastination sets in and side tracking as well.

 

at this time unless logos beats me too it, whiich is still delayed.  I have the Best OCR on hastings dictionary of the bible (1-5).  2-4 need proof reading. 1 & 5 need colons inserted between ref # and some greek Jospehus Philo in footers.  Not perfect but better than OCR the work again.  Its the german references they get to me now and then.






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users




Similar Topics



Latest Blogs