Jump to content

Please read the Forum Rules before posting.

Photo
- - - - -

Converting large collection (newbie)


  • Please log in to reply
11 replies to this topic

#1 Carl Cerecke

Carl Cerecke

    New to Bible Support

  • Veterans
  • Pip
  • 6 posts
  • LocationAuckland, New Zealand
Offline

Posted 26 August 2012 - 04:42 AM

Hi,

I'm a newbie to bible software, but I have 20+ years experience writing software. This is a bit long, sorry, but I do have a question by the time you get to the end!

What I want to achieve:

Firstly, convert the books into a more suitable format. One which enables the essential non-linearity of much the information to be expressed naturally (for example, a side-bar on "judgement" might be useful to display in a number of books, or even a few times in a single book, but the side-bar information itself should only need to be written once: Any updates to that information should edit the canonical copy of that information, and everywhere that sidebar appears can simply 'point' to the one source.) This sort of non-linearity is probably best represented with a website, as the www is inherently non-linear.

Once the books have been dissected into their component parts, a "book" then can be specified by a sequential listing of the individual components of which that book comprises.

A tool can then be written which takes a "book" specification, and generates the output format that is to be desired, whether that is pdf, MS-word, e-sword module, or sword-project module, or something else. I'm yet to find really good low-level information on how these modules are formatted, which is essential if I'm to write such a tool.

The advantage of this system is that any change to the source material can simply be propagated to all the different works that include that material by a simply pushing a virtual button.

Question:
1. Is there a set of tools for doing this? (I'm guessing "no" here. At least, I haven't found anything obvious)
2. If I'm going to do this, is there any tools/advice/information that can help me? And what pitfalls are out there?

Thanks,
Carl.

#2 david psalms

david psalms

    Resource Builder

  • Moderators
  • 984 posts
  • LocationAndhra Pradesh, India
Offline

Posted 26 August 2012 - 05:45 AM

Hi,

Many of the members are there to help you. You can do the search on thsi site in eSword downloads to find out what is already available here. if you can let us know what is not available here, many of us will pitch in and help you to convert the modules. thanks for posting the information.

david

The first goal in life is to make ourselves acceptable to the LORD


#3 Carl Cerecke

Carl Cerecke

    New to Bible Support

  • Veterans
  • Pip
  • 6 posts
  • LocationAuckland, New Zealand
Offline

Posted 26 August 2012 - 06:44 PM

Many of the members are there to help you. You can do the search on thsi site in eSword downloads to find out what is already available here. if you can let us know what is not available here, many of us will pitch in and help you to convert the modules. thanks for posting the information.


Actually, this is the sort of thing I'm trying to avoid. I don't want to take a static book and then have someone make a module from it. I want to automate the creation of modules for e-sword and other bible software. That way, when there are any changes in the source material, the modules can be automatically recreated. I don't see any reason why, given suitably organised source material, an automated conversion would not work. I guess what I want to know is, is there a suitable super-format for theological literature from which automatic module creation can be performed for whatever bible software is required.

To me, this is the "right way" to do things for material which is still in a state of flux (rather than, say, a commentary from the 19th Century, which is, for all intents and purposes, a static document). If it's not the best way to do things, please, by all means, stop me from doing something that could be a large waste of time....

Cheers,
Carl.

#4 Josh Bond

Josh Bond

    Administrator

  • Administrators
  • PipPipPipPipPip
  • 2,890 posts
  • LocationGallatin, TN
Offline

Posted 26 August 2012 - 06:55 PM

Actually, this is the sort of thing I'm trying to avoid. I don't want to take a static book and then have someone make a module from it. I want to automate the creation of modules for e-sword and other bible software. That way, when there are any changes in the source material, the modules can be automatically recreated. I don't see any reason why, given suitably organised source material, an automated conversion would not work. I guess what I want to know is, is there a suitable super-format for theological literature from which automatic module creation can be performed for whatever bible software is required.

To me, this is the "right way" to do things for material which is still in a state of flux (rather than, say, a commentary from the 19th Century, which is, for all intents and purposes, a static document). If it's not the best way to do things, please, by all means, stop me from doing something that could be a large waste of time....

Cheers,
Carl.


There's no way to dynamically update content. I could come close with a set of regular expressions to transform documents into a format readable by ToolTip NT, the software used to create e-Sword modules. But it would not be perfect and would require manual intervention on occasion.

#5 Carl Cerecke

Carl Cerecke

    New to Bible Support

  • Veterans
  • Pip
  • 6 posts
  • LocationAuckland, New Zealand
Offline

Posted 26 August 2012 - 07:33 PM

There's no way to dynamically update content. I could come close with a set of regular expressions to transform documents into a format readable by ToolTip NT, the software used to create e-Sword modules. But it would not be perfect and would require manual intervention on occasion.


I was thinking of directly generating the appropriate sqlite DB from the source format without going through another program. I haven't had a look yet at the DB files, so not sure how feasible this is.

Don't want to reinvent the wheel though.

Cheers,
Carl.

Edited by Carl Cerecke, 26 August 2012 - 08:01 PM.


#6 david psalms

david psalms

    Resource Builder

  • Moderators
  • 984 posts
  • LocationAndhra Pradesh, India
Offline

Posted 27 August 2012 - 06:28 AM

I don't see any reason why, given suitably organised source material, an automated conversion would not work.


So this means the source material is to be in some particular format to get this automation work?. i doubt about this. we are not in control of source materials. these materials are gathered mostly from OCR text written long back. some sources were in PDF, text format. as they come from various authors and sources we can not control their format.

to know about eSword databases have a look at these 2 posts
http://www.biblesupport.com/topic/2240-computer-science-e-sword-and-databases-part-1/
http://www.biblesupport.com/topic/2270-computer-science-e-sword-and-databases-part-2/

i think your idea is still at 10,000 feet view. can you give low level detail
  • what is your technical plan?. what could be the particular format of the source?. XML/text/RTF/......
  • have you already worked in this direction. i mean similar projects. any POC was done?
  • what technologies do you like to use?. based on this we can see if any of our members knows them already and pitchin for help.
  • are you willing to give the tool freely to use after development?.

The first goal in life is to make ourselves acceptable to the LORD


#7 jonathon

jonathon

    e-Sword Fanatic

  • Contributors
  • PipPipPipPipPip
  • 753 posts
Offline

Posted 27 August 2012 - 09:53 AM

Firstly, convert the books into a more suitable format.


Do you want to preserve presentation markup, or semantic markup?

Maybe the first question should be "Is there any semantic markup that needs to be preserved?",  followed by "Is there any presentation markup that needs to be preserved?"

You don't say what file format the data currently is in.  That makes a major difference in how easy it will be to convert the content to USFM, OSIS, Z-XML, ThML, or other markup language.

One which enables the essential non-linearity of much the information to be expressed naturally


TeX.

A tool can then be written which takes a "book" specification, and generates the output format that is to be desired, whether that is pdf, MS-word, e-sword module, or sword-project module, or something else.


Keep things simple.
Write one tool for each target file format.

If semantic content is irrelevant, then HTML 5.0 with CSS 3.0 is the simplest file format to preserve content, taht also enables easy transformations to document file formats.

I'm yet to find really good low-level information on how these modules are formatted, which is essential if I'm to write such a tool.


For Biblical software file formats, the only reliable way of finding out how the modules are formatted, is to analyze half a dozen or more resources of each module type.

Formal specifications for ISO/IEC 29500:2008 can be obtained from http://www.iso.org/i...csnumber=51463.

Formal specifications for ISO 32000-1:2008 can be obtained from
http://www.iso.org/i...?csnumber=51502

Formal specifications for ISO 26300 can be obtained from
https://lists.oasis-...1/msg00001.html



1. Is there a set of tools for doing this?


Only if one wants to describe PERL, or Python, as your pre-existing set of tools.

2. If I'm going to do this, is there any tools/advice/information that can help me? And what pitfalls are out there?


Represenatives from Olive Tree, Libronix, OakTree Software, and Laridian have told me on several different occasions, that their tool chain to create resources has to be fine tuned for each specific resource. Sometimes the changes are minor.  Sometimes the changes are major. Either way, automatic conversion results in errors in the target resource.

jonathon

#8 Carl Cerecke

Carl Cerecke

    New to Bible Support

  • Veterans
  • Pip
  • 6 posts
  • LocationAuckland, New Zealand
Offline

Posted 28 August 2012 - 01:13 AM

So this means the source material is to be in some particular format to get this automation work?. i doubt about this. we are not in control of source materials. these materials are gathered mostly from OCR text written long back. some sources were in PDF, text format. as they come from various authors and sources we can not control their format.


I know. But I have access to a collection of books, and they are all in MS Word (That's what the authors knew, so that's what they used.)

i think your idea is still at 10,000 feet view. can you give low level detail

  • what is your technical plan?. what could be the particular format of the source?. XML/text/RTF/......
  • have you already worked in this direction. i mean similar projects. any POC was done?
  • what technologies do you like to use?. based on this we can see if any of our members knows them already and pitchin for help.
  • are you willing to give the tool freely to use after development?.


Yes, it is a high-level view; I have to start somewhere. But I am not ignorant of the low-level details.
  • Technical plan? Convert the collection to something. I haven't yet decided. It needs to be easily editable online by the authors - some sort of wiki-like thing. And easily processed by computer program. Bit vague about this yet. The current documents have some repetition which I would also like to eliminate. Follow the DRY principle (Don't Repeat Yourself): "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system."
  • Have I worked in this direction already? Not theological markup specifically, but information manipulation in other contexts. My PhD in Computer Science was in the area of parsing computer languages - markup languages are pretty easy in comparison.
  • Technologies? Whatever is handy for the job. Probably python to stick it all together.
  • Make the tool freely available after development? I'll go one step further - I'll make it freely available *before* development is finished. It would be free software (The FSF definition; beer and speech).
Cheers,
Carl.

#9 Carl Cerecke

Carl Cerecke

    New to Bible Support

  • Veterans
  • Pip
  • 6 posts
  • LocationAuckland, New Zealand
Offline

Posted 28 August 2012 - 01:38 AM

Do you want to preserve presentation markup, or semantic markup?


Semantic. Presentation without semantics would be a waste of time for my project. See previous comment in thread for some ideas.

You don't say what file format the data currently is in. That makes a major difference in how easy it will be to convert the content to USFM, OSIS, Z-XML, ThML, or other markup language.


MS Word files.

LaTeX could be an output type, on the way to generating nice pdf.

Keep things simple.
Write one tool for each target file format.


Yes, I would.
The critical hinge is to get the source information in a form that is both easily editable by the non-techy authors (some sort of wiki maybe), yet rich enough in semantic information that it is possible to write tools for automatically generating modules in different formats.


For Biblical software file formats, the only reliable way of finding out how the modules are formatted, is to analyze half a dozen or more resources of each module type.

Formal specifications for ISO/IEC 29500:2008 can be obtained from http://www.iso.org/i...csnumber=51463.

Formal specifications for ISO 32000-1:2008 can be obtained from
http://www.iso.org/i...?csnumber=51502

Formal specifications for ISO 26300 can be obtained from
https://lists.oasis-...1/msg00001.html

Thanks for the links.

1. Is there a set of tools for doing this? (I'm guessing "no" here. At least, I haven't found anything obvious)


Only if one wants to describe PERL, or Python, as your pre-existing set of tools.

I have over 10 years of python experience.

Represenatives from Olive Tree, Libronix, OakTree Software, and Laridian have told me on several different occasions, that their tool chain to create resources has to be fine tuned for each specific resource. Sometimes the changes are minor. Sometimes the changes are major. Either way, automatic conversion results in errors in the target resource.


Thanks. I'm hoping that by putting the effort into the first step - converting the MS Word files to a semantically marked up format - that my 'tool chain' will only have to deal with the one input resource type. I would only target non-proprietary output formats.

Thanks for your comments Jonathon.

Cheers,
Carl.

#10 jonathon

jonathon

    e-Sword Fanatic

  • Contributors
  • PipPipPipPipPip
  • 753 posts
Offline

Posted 28 August 2012 - 07:58 AM

Follow the DRY principle (Don't Repeat Yourself): "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.


OSIS comes close, but I shrudder at using it for content other than Bibles and commentaries
  • Technologies? Whatever is handy for the job. Probably python to stick it all together

Python has the additional virtue of being cross-platform. Which means that non-Windows users might be able to put together a third party tool chain to create resources.

jonathon




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users




Similar Topics



Latest Blogs