Portreve wrote:If the original material is in good condition and the scanning job is done without problems by the scanner you're using, then 600 or 300 dpi will not matter for purposes of OCR, because they're both above the minimum resolution for reliable interpretation.
Hi, Portreve. Thank you for this tip. I wasn't considering high-res (e.g. 600 dpi) solely for OCR purposes. But also just for quality of scan. I didn't want any regrets that my 300dpi scan of document X is hard to read, but would be readable (or at least easier to read/decipher) with a higher resolution.
Initially, as suggested above, I would probably scan everything to TIFF, if that's the only good option you have. Tagged Image File Format, which is what TIFF stands for, is in and of itself a non-compressed standard which later on (and now a very, very long time ago) incorporated basic non-lossy compression, such as Lempel/Ziff, and maybe even ZIP (not really sure any more) to bring down the file sizes. This is great because you ideally want no data loss initially as you're getting ready to set up your archive.
I've updated my OP to mention that I have access to two other scanners, with different output formats. Taking the three scanners together, I can output to Tiff, Pdf, Pdf/a, Compact Pdf, Jpeg, Xps, Compact Xps, OOXML (pptx).
I was online-chatting with an Adobe rep, and I thought they said that when rearranging Pdfs (like adding a page, or removing a page, or moving the last page to to the first page), there is no loss of quality. Is the agent wrong? Or did I just misunderstand what was said?
What I would suggest then is to take those TIFFs of text (which, frankly, 300 dpi should be adequate for) and get them OCR'd prior to creating your permanent archive.
Can Acrobat Pro OCR Tiff files? Please note that though my laptop and desktop are all Linux Mint, I was planning on using a Windows computer to use the Acrobat Pro/DC program to OCR. How do you OCR Tiff files?
On the other hand, I personally would scan actual photos at 600 dpi and then save them as PNGs, and not JPGs, simply on the basis that you don't sacrifice the image quality in PNG that you do with JPG. Moreover, JPG is still technically a proprietary format, whereas PNG is libre, and I'll go with the libre option, short of there being some imperative to the contrary, every time.
Sadly, none of the 3 scanners can output to PNG. I wish PNG was an option, but it isn't.
As for creating PDFs, well... I would suggest you use PDF where layout and appearance are significant factors (for example, a form of some kind, or something where the appearance of the layout is significant), straight text files where only the content matters, and something like ODT for those documents in the middle where exact layout is unimportant, but for some reason you wish to have at least a degree of formatting (font, for example, or emphasized text like italics or bold face, etc.).
I wasn't going to do any formatting after scanning (so I don't think I'll use ODT). I'm scanning because:
- I want to digitize my filing cabinet (reduce papers)
They will be potentially easier to search through (with OCR applied)
I don't have to be near filing cabinet to access a document (I can do so wherever I have internet access, as I plan to upload to the cloud -- Google Drive or OneDrive are my two considerations)
The only other thing I would add is that, even though it is a more manual process, you will benefit from a good and well thought-through organizational structure and a good file naming convention, because then you will never be dependent upon any kind of file management or organization software in the future.
Thank you for adding this tip. The Ricoh scanner's default filenaming is something like 201703171559.pdf (or .whatever). I haven't changed them at the outset, but can do so later.
I really didn't think I needed to bother with organization, coz I thought OCR would be the solution. (Kinda like how in gmail, though I do use labels, I rely on keyword searching.) Do you still think OCR is not enough on its own, and that organizing and file-naming are important?