• Re: pdf grep?

    From Robert Heller@21:1/5 to dieterhansbritz@gmail.com on Wed Apr 3 14:03:37 2024
    Grep may sort of also work with pdf files. You might want to also use the strings command to get "clean" srings. Note: *some* pdf files are just images (no actual text). These would be PDFs created by scanning a document (not
    using OCR). Also, many typesetting programs (TeX/LaTex, word-processos, etc), might do some typesetting "magic" (eg ligitures, etc.) that might make things hard for grep.

    xpdf includes a text search button as part of its UI.

    At Wed, 3 Apr 2024 12:45:20 -0000 (UTC) db <dieterhansbritz@gmail.com> wrote:


    Under Linux, I can use grep to search a bunch of
    files for a character string. Is there an equivalent
    command for searching pdf files?



    --
    Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
    Deepwoods Software -- Custom Software Services
    http://www.deepsoft.com/ -- Linux Administration Services
    heller@deepsoft.com -- Webhosting Services

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Robert Heller on Wed Apr 3 14:17:22 2024
    Robert Heller <heller@deepsoft.com> wrote or quoted:
    might do some typesetting "magic" (eg ligitures, etc.) that might make things

    "ligatures"

    Text in PDFs is sometimes compressed. So one can either use
    programs like "Agent Ransack" to search for text in PDFs or
    tools like "pdftotext" to first create a text file for every
    PDF file and then grep those text files.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Stefan Ram on Wed Apr 3 14:29:40 2024
    ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
    Text in PDFs is sometimes compressed. So one can either use
    programs like "Agent Ransack" to search for text in PDFs or
    tools like "pdftotext" to first create a text file for every
    PDF file and then grep those text files.

    PS: "Agent Ransack" is Windows software. "pdftotext" is also
    available for Linux. Converting all PDFs to text files needs
    to be done only once, and then search operations on those
    text files are faster than scanning the PDF files for text
    on every search!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Landscheidt@21:1/5 to dieterhansbritz@gmail.com on Wed Apr 3 14:22:18 2024
    db <dieterhansbritz@gmail.com> wrote:

    Under Linux, I can use grep to search a bunch of
    files for a character string. Is there an equivalent
    command for searching pdf files?

    You can use pdfgrep (https://pdfgrep.org/) for that. It is
    available as a package in Fedora and Debian as well.

    Tim

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Flynn@21:1/5 to All on Thu Apr 4 16:57:49 2024
    On 04/04/2024 10:50, db wrote:
    [...]
    I installed pdfgrep in my Kubuntu system, but it is
    not happy. Although the man file is there, even help
    doesn't work:

    I just installed pdfgrep_2.1.2-1build1_amd64.deb in my Mint 20.1 and it
    seems to work OK. What version is the Kubuntu one?

    Peter

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)