• How to remove a link in a PDF that is found in a thousand pages

    From Andrew@21:1/5 to All on Fri May 24 00:04:52 2024
    XPost: comp.text.pdf, comp.editors

    I have a PDF with a link in it of the form:
    http://domain.com
    in a million places (usually at the top, bottom or middle of a page that is mostly empty - where all I want to do is delete it completely.

    I want to delete those links, and the only PDF editor I know of that will delete them easily is the Adobe Acrobat (writer) but it deletes them one by one. Yuck. I'm doing that, but is there a better way?

    Googling, I find that Calibre will delete them but oh my god, is that a complicated action, where you have do css rules and crazy stuff like that.

    You can't just search and replace for some godforsaken reason.

    Hence I implore you for help... where the PDF can be easily converted to
    any epub format if there's another way other than a PDF editor to do it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Andrew on Thu May 23 23:00:28 2024
    XPost: comp.text.pdf, comp.editors

    On 5/23/2024 8:04 PM, Andrew wrote:
    I have a PDF with a link in it of the form:
    http://domain.com
    in a million places (usually at the top, bottom or middle of a page that is mostly empty - where all I want to do is delete it completely.

    I want to delete those links, and the only PDF editor I know of that will delete them easily is the Adobe Acrobat (writer) but it deletes them one by one. Yuck. I'm doing that, but is there a better way?

    Googling, I find that Calibre will delete them but oh my god, is that a complicated action, where you have do css rules and crazy stuff like that.

    You can't just search and replace for some godforsaken reason.

    Hence I implore you for help... where the PDF can be easily converted to
    any epub format if there's another way other than a PDF editor to do it.


    PDF files are normally "binary" in appearance. But they can be
    translated to "ascii". Notice there is a gubbin near the top, which
    is not ASCII, and that continues to make the file binary. For example,
    some scripting you might do, might have an issue with the four binary characters. (That binary thing, could be different on a different
    version of PDF file.)

    I don't know if this file has integrity or not. It's just
    intended to show how simple the format could have been. (Normal files
    will NOT be simple, so you can forget that right now.)

    *********************** PDF in Text Mode ***********************
    %PDF-1.1
    %¥±ë

    1 0 obj
    << /Type /Catalog
    /Pages 2 0 R
    >>
    endobj

    2 0 obj
    << /Type /Pages
    /Kids [3 0 R]
    /Count 1
    /MediaBox [0 0 300 144]
    >>
    endobj

    3 0 obj
    << /Type /Page
    /Parent 2 0 R
    /Resources
    << /Font
    << /F1
    << /Type /Font
    /Subtype /Type1
    /BaseFont /Times-Roman
    >>
    >>
    >>
    /Contents 4 0 R
    >>
    endobj

    4 0 obj
    << /Length 55 >>
    stream
    BT
    /F1 18 Tf
    0 0 Td
    (Hello World) Tj
    ET
    endstream
    endobj

    xref
    0 5
    0000000000 65535 f
    0000000018 00000 n
    0000000077 00000 n
    0000000178 00000 n
    0000000457 00000 n
    trailer
    << /Root 1 0 R
    /Size 5
    >>
    startxref
    565
    %%EOF
    *********************** PDF in Text Mode ***********************

    If you just delete the string in question, it's going to say
    "this file is damaged".

    The document has consistency checks, and that's how it can
    tell the file has been edited.

    You can tell from this, they were just screwing with us. The
    format before this, PostScript, didn't have counters. When you
    found a section in PostScript that said "Do not delete this section",
    you just deleted it :-) Well, when they invented PDF, they messed
    with it a bit, in the bomb-squad sense.

    Adobe makes a "book" available about the PDF standard, and
    you could use that. But that's a learning experience.

    The only command of note, in my Notes file, is this, and I have
    not placed any comments to tell me what it does :-) This makes
    the ASCII-like flavor of file.

    mutool.exe convert -F pdf -O decompress,clean -o output.pdf input.pdf

    And when we talk of "binary to ascii", there is DEFINITELY binary
    still in there. The commercial fonts can be encoded somehow, and they are
    still transferred as a binary blob. If not handled properly, you will break
    the fonts. This puts some constraints on how you work on the file, for sure.
    I could use HxD for example, while keeping another tool open to better
    be able to read the file as the ASCII portion.

    There are various ways to obscure text in the document. Even in
    "ASCII mode", nothing says you will see "https://www.something.com".
    You might see bunches of numbers instead. If this string of yours
    is intended as a watermark, then of course the file will be augmented
    for maximum annoyance. A lot of the watermarks we played with as kids,
    they were not hardened. You might have concluded nobody cared to do
    a good job. I can assure you that some commercial tools, definitely
    take their watermark design seriously.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Andrew on Fri May 24 03:45:51 2024
    XPost: comp.text.pdf, comp.editors

    On Fri, 24 May 2024 00:04:52 -0000 (UTC), Andrew wrote:

    ... is there a better way?

    Write a program using a PDF-manipulation toolkit.

    I have had good results writing Python code using pikepdf <https://github.com/pikepdf/pikepdf>.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kingfisher@21:1/5 to Andrew on Thu May 23 23:00:35 2024
    XPost: comp.text.pdf, comp.editors

    On 5/23/24 17:04, Andrew wrote:
    I have a PDF with a link in it of the form:
    http://domain.com
    in a million places (usually at the top, bottom or middle of a page that is mostly empty - where all I want to do is delete it completely.

    I want to delete those links, and the only PDF editor I know of that will delete them easily is the Adobe Acrobat (writer) but it deletes them one by one. Yuck. I'm doing that, but is there a better way?

    Googling, I find that Calibre will delete them but oh my god, is that a complicated action, where you have do css rules and crazy stuff like that.

    You can't just search and replace for some godforsaken reason.

    Hence I implore you for help... where the PDF can be easily converted to
    any epub format if there's another way other than a PDF editor to do it.

    LibreOffice Writer will open PDF, edit, and export as PDF. It has a Find
    and Replace function that can get all the links in one shot.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Herbert Kleebauer@21:1/5 to Andrew on Fri May 24 08:40:21 2024
    XPost: comp.text.pdf, comp.editors

    On 24.05.2024 02:04, Andrew wrote:
    I have a PDF with a link in it of the form:
    http://domain.com
    in a million places (usually at the top, bottom or middle of a page that is mostly empty - where all I want to do is delete it completely.

    I want to delete those links, and the only PDF editor I know of that will delete them easily is the Adobe Acrobat (writer) but it deletes them one by one. Yuck. I'm doing that, but is there a better way?

    If you have Acrobat, save the file as uncompressed pdf. If you are
    lucky, you will find "http://domain.com" as simple text in the file.
    Replace any occurrence with exactly the same number of blanks. But
    you have to use an Editor which preserves the few binary bytes at
    the beginning of the file.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Johnson@21:1/5 to All on Fri May 24 16:04:00 2024
    XPost: comp.text.pdf, comp.editors

    On Fri, 24 May 2024 00:04:52 -0000 (UTC), Andrew <andrew@spam.net>
    wrote:

    I have a PDF with a link in it of the form:
    http://domain.com
    in a million places (usually at the top, bottom or middle of a page that is >mostly empty - where all I want to do is delete it completely.

    I want to delete those links, and the only PDF editor I know of that will >delete them easily is the Adobe Acrobat (writer) but it deletes them one by >one. Yuck. I'm doing that, but is there a better way?

    Googling, I find that Calibre will delete them but oh my god, is that a >complicated action, where you have do css rules and crazy stuff like that.

    You can't just search and replace for some godforsaken reason.

    Hence I implore you for help... where the PDF can be easily converted to
    any epub format if there's another way other than a PDF editor to do it.

    How important is the formatting?
    You could extract the text into a Word (or similar) file, run
    find/exchange on it and then create a new PDF. Which might or might
    not change the formatting, but you could probably fix that before you
    created the new PDF.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)