Questions on scanning to PDFs, and Acrobat Pro/DC on Win/Mac

Chat about just about anything else
User avatar
pepperminty
Level 4
Level 4
Posts: 376
Joined: Thu Jun 23, 2011 10:51 pm

Questions on scanning to PDFs, and Acrobat Pro/DC on Win/Mac

Postby pepperminty » Fri Mar 17, 2017 5:47 pm

CLARIFICATION: What I originally wrote in this OP was about one business-class (i.e. multithousand-dollar) scanner I have access to, a Ricoh machine. This Ricoh can output in the following formats:
    Single page: Tiff, Jpeg, Pdf, Pdf/a
    Multipage: Tiff, Pdf, Pdf/a
The second machine I can access is a Konica-Minolta. It can output to TIFF, JPEG, PDF, Compact PDF, XPS, Compact XPS, OOXML (pptx).

The third machine I can access is a Sharp. It can can output to:
    Monochrome: TIFF, PDF, Encrypted PDF, XPS
    Color: Grayscale, Color TIFF, JPG, PDF, Encrypted PDF.
--- end of clarification---


I've taken a number of folders from my real-life filing cabinet and traveled to a three-foot-tall, hundred-pound scanner. Because the scanner doesn't have OCR, I plan to apply OCR technology on a different occasion. The PDFs I create will not be sent to anyone else. The PDFs are just for me.
Before I spend the time scanning, I wanted to ask PDF or Acrobat Pro/DC experts:

1. Let's say I scan 100 pages into one PDF. Later, in Acrobot Pro/DC, I decide to separate the PDFs into smaller groups (e.g. pages 1-6 is recipes.pdf, pages 7-14 are business.pdf, etc). Will doing so cause deterioration of image quality? Will sharp text now look fuzzy?

2. To be safe, I thought of scanning at the highest resolution of 600 dpi, and then later, when I'm reducing dpi to a level that won't sacrifice sharpness but will have a smaller filesize. When I decrease DPI in software like Acrobat Pro/DC, will there be any negative effects?

3a. Should I choose PDF/A? What are reasons to choose PDF/A? What are reasons not to?
3b. Can a PDF/A file have OCR applied to it? (In other words, to be made searchable?)

4. Any reason to choose TIFF as file type?

5. If in this 100-page stack of various subjects, is it safe to choose "Auto Color Select"? The scanner will determine if:
    Black&White text;
    B&W Text + Line Art;
    B&W Text + Photo;
    B&W Photo;
    Gray Scale;
    Full color Text + Photo
    Full Color: Glossy Photo

6. In this 100-page stack, is it safe to choose "Auto Detection of Scan/Paper size", or is better to manually choose the size?
Last edited by pepperminty on Sat Mar 18, 2017 6:02 pm, edited 2 times in total.

User avatar
TooMuchTime
Level 4
Level 4
Posts: 404
Joined: Fri Mar 11, 2016 10:30 pm

Re: Questions on scanning to PDFs, and Acrobat Pro/DC on Win/Mac

Postby TooMuchTime » Fri Mar 17, 2017 10:00 pm

1) Not in my experience with Acrobat. Just make sure you extract the pages.

2) If you scan at 600 dpi, you should be able to reduce it to 300 without too much bad happening. Do some testing first.

3) I don't know.

4) You should be able to import a TIFF into a PDF at a later date but I don't know how well it will handle OCR; if at all.

5) You may want to sort out the color photos and scan them separately. That way you can get the exact look you want by tweaking the settings. If you have text and color photos in the same document, you may want to separate them too.

6) Any scanning software should be able to auto-detect the paper size. Test a few mixed pages to be sure.

BenTrabetere
Level 4
Level 4
Posts: 236
Joined: Sat Jul 19, 2014 12:04 am
Location: Mississippi, USA

Re: Questions on scanning to PDFs, and Acrobat Pro/DC on Win/Mac

Postby BenTrabetere » Sat Mar 18, 2017 2:58 am

I do not know if I qualify as Acrobat Pro expert, but I have worked with PDFs and Acrobat Pro for a long time.

1. I would scan as individual TIFFs and convert to PDF afterwards with the ImageMagick Convert tool. If you scan to a multi-page PDF, I would use either PDF-Shuffler, PDF Mod or PDFtk to extract pages into smaller PDFs. You could use Acrobat Pro (or Master PDF Editor), but I find the simpler tools are faster, easier to use and better suited for the task.

Extracting pages should not result in a loss of quality.

2. The ONLY documents I would scan 600 dpi are ones of historic importance or ones that need a lot of correcting. The files are just too big for normal use.

My default scanning resolution is 300 dpi. If the document has any font that is less than 10pts, I may increase the resolution to 400 dpi. For low quality originals like faxes or documents printed on a dot matrix or thermal printer, I lower the resolution to 150 dpi.

3. I do not have any experience with PDF/A, but I am pretty sure it does not apply to scanned documents. It may be applicable after you run it through OCR software.

4. I prefer to scan to TIFF and later convert to PDF, mainly because TIFFs are easier to clean up with image editing software. OCR software should be able to handle TIFFs without any problem.

5. I have never had much success with Auto Color - too often for me it defaults to Color, when Gray Scale or Black & White would be more appropriate. I suggest doing a test run with a handful of mixed documents to see how well the scanner/software handles the task. Otherwise, I suggest sorting the documents and scan “like” documents in batches.

6. The scanner should be able to detect the paper size.

User avatar
Portreve
Level 5
Level 5
Posts: 519
Joined: Mon Apr 18, 2011 12:03 am
Location: Florida

Re: Questions on scanning to PDFs, and Acrobat Pro/DC on Win/Mac

Postby Portreve » Sat Mar 18, 2017 11:40 am

If the original material is in good condition and the scanning job is done without problems by the scanner you're using, then 600 or 300 dpi will not matter for purposes of OCR, because they're both above the minimum resolution for reliable interpretation.

Initially, as suggested above, I would probably scan everything to TIFF, if that's the only good option you have. Tagged Image File Format, which is what TIFF stands for, is in and of itself a non-compressed standard which later on (and now a very, very long time ago) incorporated basic non-lossy compression, such as Lempel/Ziff, and maybe even ZIP (not really sure any more) to bring down the file sizes. This is great because you ideally want no data loss initially as you're getting ready to set up your archive.

What I would suggest then is to take those TIFFs of text (which, frankly, 300 dpi should be adequate for) and get them OCR'd prior to creating your permanent archive. On the other hand, I personally would scan actual photos at 600 dpi and then save them as PNGs, and not JPGs, simply on the basis that you don't sacrifice the image quality in PNG that you do with JPG. Moreover, JPG is still technically a proprietary format, whereas PNG is libre, and I'll go with the libre option, short of there being some imperative to the contrary, every time.

As for creating PDFs, well... I would suggest you use PDF where layout and appearance are significant factors (for example, a form of some kind, or something where the appearance of the layout is significant), straight text files where only the content matters, and something like ODT for those documents in the middle where exact layout is unimportant, but for some reason you wish to have at least a degree of formatting (font, for example, or emphasized text like italics or bold face, etc.).

The only other thing I would add is that, even though it is a more manual process, you will benefit from a good and well thought-through organizational structure and a good file naming convention, because then you will never be dependent upon any kind of file management or organization software in the future.
Everything is in hand. With this tapestry... and with patience, there is nothing one cannot achieve.

No hamsters were harmed in the authoring of this post.

BenTrabetere
Level 4
Level 4
Posts: 236
Joined: Sat Jul 19, 2014 12:04 am
Location: Mississippi, USA

Re: Questions on scanning to PDFs, and Acrobat Pro/DC on Win/Mac

Postby BenTrabetere » Sat Mar 18, 2017 1:07 pm

BenTrabetere wrote:I would scan as individual TIFFs and convert to PDF afterwards with the ImageMagick Convert tool.


I intended to mention libtiff-tools - tiff2pdf can be used to convert a single TIF to a PDF, and tiffcp
can be used to combine one or more files. Output tends to be smaller than with the ImageMagick Convert tool.

You can also use convert and tiffcp to create a multi-page TIFF. The image and document viewers I have tried only recognize one image in the collection - xreader (and I assume Atril, etc) treats it as collection of images. An album of sorts. It might be useful to obfuscate some documents without resorting to encryption.

User avatar
pepperminty
Level 4
Level 4
Posts: 376
Joined: Thu Jun 23, 2011 10:51 pm

Isn't Pdf file a lossless type? Why I'm scanning. Does OCRing override need for organizing? (and more!)

Postby pepperminty » Sat Mar 18, 2017 6:19 pm

Portreve wrote:If the original material is in good condition and the scanning job is done without problems by the scanner you're using, then 600 or 300 dpi will not matter for purposes of OCR, because they're both above the minimum resolution for reliable interpretation.

Hi, Portreve. Thank you for this tip. I wasn't considering high-res (e.g. 600 dpi) solely for OCR purposes. But also just for quality of scan. I didn't want any regrets that my 300dpi scan of document X is hard to read, but would be readable (or at least easier to read/decipher) with a higher resolution.

Portreve wrote:Initially, as suggested above, I would probably scan everything to TIFF, if that's the only good option you have. Tagged Image File Format, which is what TIFF stands for, is in and of itself a non-compressed standard which later on (and now a very, very long time ago) incorporated basic non-lossy compression, such as Lempel/Ziff, and maybe even ZIP (not really sure any more) to bring down the file sizes. This is great because you ideally want no data loss initially as you're getting ready to set up your archive.


I've updated my OP to mention that I have access to two other scanners, with different output formats. Taking the three scanners together, I can output to Tiff, Pdf, Pdf/a, Compact Pdf, Jpeg, Xps, Compact Xps, OOXML (pptx).
I was online-chatting with an Adobe rep, and I thought they said that when rearranging Pdfs (like adding a page, or removing a page, or moving the last page to to the first page), there is no loss of quality. Is the agent wrong? Or did I just misunderstand what was said?
Portreve wrote:What I would suggest then is to take those TIFFs of text (which, frankly, 300 dpi should be adequate for) and get them OCR'd prior to creating your permanent archive.

Can Acrobat Pro OCR Tiff files? Please note that though my laptop and desktop are all Linux Mint, I was planning on using a Windows computer to use the Acrobat Pro/DC program to OCR. How do you OCR Tiff files?

Portreve wrote: On the other hand, I personally would scan actual photos at 600 dpi and then save them as PNGs, and not JPGs, simply on the basis that you don't sacrifice the image quality in PNG that you do with JPG. Moreover, JPG is still technically a proprietary format, whereas PNG is libre, and I'll go with the libre option, short of there being some imperative to the contrary, every time.

Sadly, none of the 3 scanners can output to PNG. I wish PNG was an option, but it isn't.

Portreve wrote:
As for creating PDFs, well... I would suggest you use PDF where layout and appearance are significant factors (for example, a form of some kind, or something where the appearance of the layout is significant), straight text files where only the content matters, and something like ODT for those documents in the middle where exact layout is unimportant, but for some reason you wish to have at least a degree of formatting (font, for example, or emphasized text like italics or bold face, etc.).

I wasn't going to do any formatting after scanning (so I don't think I'll use ODT). I'm scanning because:
    I want to digitize my filing cabinet (reduce papers)
    They will be potentially easier to search through (with OCR applied)
    I don't have to be near filing cabinet to access a document (I can do so wherever I have internet access, as I plan to upload to the cloud -- Google Drive or OneDrive are my two considerations)


Portreve wrote:The only other thing I would add is that, even though it is a more manual process, you will benefit from a good and well thought-through organizational structure and a good file naming convention, because then you will never be dependent upon any kind of file management or organization software in the future.

Thank you for adding this tip. The Ricoh scanner's default filenaming is something like 201703171559.pdf (or .whatever). I haven't changed them at the outset, but can do so later.
I really didn't think I needed to bother with organization, coz I thought OCR would be the solution. (Kinda like how in gmail, though I do use labels, I rely on keyword searching.) Do you still think OCR is not enough on its own, and that organizing and file-naming are important?

User avatar
pepperminty
Level 4
Level 4
Posts: 376
Joined: Thu Jun 23, 2011 10:51 pm

In Linux Mint, can you search through a folderfull of OCRed PDFs by in-text keyword?

Postby pepperminty » Sat Mar 18, 2017 6:26 pm

Thank you to TooMuchTime, BenTrabetere, and Portreve who have replied.

I know that with OCR tech, you can load up a PDF (in a program like Xreader on Mint or Acrobat Reader on Windows), press Control-F and search for text located inside the body.

But is there a way to search through a folder full of OCRed PDFs? Let's say you have a folder containing 100 OCRed PDFs. I want to know which PDFs have the word foo. Is there one terminal command (or GUI counterpart) that can search through all of them at one go? Google Drive can do this. I wonder if Linux Mint can too.

User avatar
Portreve
Level 5
Level 5
Posts: 519
Joined: Mon Apr 18, 2011 12:03 am
Location: Florida

Re: Isn't Pdf file a lossless type? Why I'm scanning. Does OCRing override need for organizing? (and more!)

Postby Portreve » Sun Mar 19, 2017 9:00 pm

pepperminty wrote:Hi, Portreve.

Couple things...

Acrobat Pro, as I understand it, does have built-in OCR capabilities. It's probably sensible to use it for that purpose because Adobe most likely has incorporated image massaging capabilities into it, and that can be useful if your original has problems.

Scanners do not output into any of those formats. The data received by the scanning software is output into one of those formats. Therefore, pick something non-lossy like TIFF, then just convert them using GIMP.

You must OCR text data by some means for it to be searchable. Therefore, simply "scanning to PDF" is a waste of time, because those will just be images stuck in a PDF. Why would you want to do that?

Also, why would you want to combine different things together? I understand that you might want a multi-page document to be one single file, but you would not want to have lots of different things in one single file.

I cannot for the life of me understand — this is not directed at you personally — why someone would want to have an archive of files which were all named some kind of meaningless gibberish. That is the exact sort of crap that gets you dependent on a file management and backup system. I get that you would ideally wish to have a facility for searching, but when the searchable type files already have searchable content to them (i.e. text) then any decent setup, even more arcane and harder-to-use terminal-based ones, could search the file archive. Therefore, any other system worthy of your consideration can just as readily search through it.
Everything is in hand. With this tapestry... and with patience, there is nothing one cannot achieve.

No hamsters were harmed in the authoring of this post.

User avatar
jimallyn
Level 14
Level 14
Posts: 5055
Joined: Thu Jun 05, 2014 7:34 pm
Location: Wenatchee, WA USA

Re: In Linux Mint, can you search through a folderfull of OCRed PDFs by in-text keyword?

Postby jimallyn » Sun Mar 19, 2017 9:17 pm

pepperminty wrote:But is there a way to search through a folder full of OCRed PDFs? Let's say you have a folder containing 100 OCRed PDFs. I want to know which PDFs have the word foo. Is there one terminal command (or GUI counterpart) that can search through all of them at one go?

Look into catfish and recoll. I haven't yet used them, but somebody mentioned them recently and I made a note of it to look into later. I believe they will do what you want.
Image

“If the government were coming for your TVs and cars, then you'd be upset. But, as it is, they're only coming for your sons.” - Daniel Berrigan

User avatar
pepperminty
Level 4
Level 4
Posts: 376
Joined: Thu Jun 23, 2011 10:51 pm

Re: Isn't Pdf file a lossless type? Why I'm scanning. Does OCRing override need for organizing? (and more!)

Postby pepperminty » Mon Mar 20, 2017 1:29 pm

Portreve wrote:Acrobat Pro, as I understand it, does have built-in OCR capabilities. It's probably sensible to use it for that purpose because Adobe most likely has incorporated image massaging capabilities into it, and that can be useful if your original has problems.

I don't think there is an option (much less a free one) for Mint. Or is there?

Portreve wrote: Therefore, pick something non-lossy like TIFF, then just convert them using GIMP.


Okay, henceforth I'll choose Tiff over pdf.

Portreve wrote:You must OCR text data by some means for it to be searchable. Therefore, simply "scanning to PDF" is a waste of time, because those will just be images stuck in a PDF. Why would you want to do that?

Because my gameplan was to scan to image-based PDF on the machine, go to the Windows computer with Adobe Pro/DC to apply OCR technology, then OCR the PDFs to make them searchable.


Portreve wrote:Also, why would you want to combine different things together? I understand that you might want a multi-page document to be one single file, but you would not want to have lots of different things in one single file.

Glad you asked. The 3 big machines I have access to are not in my place of residence. So I thought, to save time, just scan quickly while using machines, then, when am away from the machines and am at leisure, I can rearrange (e.g. turn one PDF into three PDFs).

Portreve wrote:I cannot for the life of me understand — this is not directed at you personally — why someone would want to have an archive of files which were all named some kind of meaningless gibberish. That is the exact sort of crap that gets you dependent on a file management and backup system. I get that you would ideally wish to have a facility for searching, but when the searchable type files already have searchable content to them (i.e. text) then any decent setup, even more arcane and harder-to-use terminal-based ones, could search the file archive. Therefore, any other system worthy of your consideration can just as readily search through it.

You've convinced me. For now, I'll keep the machine's default filenames (e.g. 201703210930.pdf) and then, when I have leisure, make the name more descriptive (e.g. hospitalbills2017.pdf).

User avatar
pepperminty
Level 4
Level 4
Posts: 376
Joined: Thu Jun 23, 2011 10:51 pm

Re: In Linux Mint, can you search through a folderfull of OCRed PDFs by in-text keyword?

Postby pepperminty » Mon Mar 20, 2017 1:31 pm

jimallyn wrote:
pepperminty wrote:But is there a way to search through a folder full of OCRed PDFs? Let's say you have a folder containing 100 OCRed PDFs. I want to know which PDFs have the word foo. Is there one terminal command (or GUI counterpart) that can search through all of them at one go?

Look into catfish and recoll. I haven't yet used them, but somebody mentioned them recently and I made a note of it to look into later. I believe they will do what you want.

Thanks, Jim.
Looking through http://www.twotoasts.de/index.php/catfish/ and https://www.lesbonscomptes.com/recoll/

ColdBoot
Level 2
Level 2
Posts: 82
Joined: Thu Feb 16, 2017 10:40 pm

Re: Questions on scanning to PDFs, and Acrobat Pro/DC on Win/Mac

Postby ColdBoot » Tue Mar 21, 2017 6:49 am

Since you need to find keywords present in PDF files, I think Catfish is not suitable or I don't know if there is a plugin enabling it to do it inside that type of files. Anyways, it also does not support indexing so even if it can do it, searching is probably very slow for that amount of files. Try DocFetcher instead especially if you often have to do proximity searches also not supported in Catfish. It even has a preview window where with highlights you can check if the context you're looking for is right.

Oh well, OCR-ed documents... :mrgreen: Catfish can do it, for sure but mentioned utility beats it hands down anyways.
Linux Mint Cinnamon 18.1
Intel G1820, DDR3 8GB, Nvidia GT720(2GB)

User avatar
pepperminty
Level 4
Level 4
Posts: 376
Joined: Thu Jun 23, 2011 10:51 pm

docfetcher and recoll

Postby pepperminty » Tue Mar 21, 2017 6:53 pm

ColdBoot wrote:Since you need to find keywords present in PDF files, I think Catfish is not suitable

ColdBoot wrote:Try DocFetcher instead especially if you often have to do proximity searches also not supported in Catfish. It even has a preview window where with highlights you can check if the context you're looking for is right.

You're probably right about Catfish. Based on the screenshots on http://www.twotoasts.de/index.php/catfish/, it appears that Catfish only looks for keywords in the filename.

I'll look into your suggestion of DocFetcher, as well as JimAllynn's other suggestion of recoll.

ColdBoot wrote:Oh well, OCR-ed documents... :mrgreen: Catfish can do it, for sure but mentioned utility beats it hands down anyways.

I don't understand.

ColdBoot
Level 2
Level 2
Posts: 82
Joined: Thu Feb 16, 2017 10:40 pm

Re: Questions on scanning to PDFs, and Acrobat Pro/DC on Win/Mac

Postby ColdBoot » Tue Mar 21, 2017 8:45 pm

pepperminty wrote:I don't understand.


Nothing really, for a moment it slipped my attention that the PDF-s you were referring to are in fact scanned material, sorry.

Also, I've used Catfish to find file names since XFCE's file manager doesn't search for files in sub directories and it is not a tool that searches file contents. I'd recommend you try the portable version of DocFetcher as you can easily re index when you actually add files. The routine is quite fast so I don't see the need for yet another running daemon.
Linux Mint Cinnamon 18.1
Intel G1820, DDR3 8GB, Nvidia GT720(2GB)


Return to “Open chat”

Who is online

Users browsing this forum: No registered users and 3 guests