Forum: >>> Magnum BBS <<<

How to remove a link in a PDF that is found in a thousand pages

From Andrew@21:1/5 to All on Fri May 24 00:04:52 2024

XPost: comp.text.pdf, comp.editors

I have a PDF with a link in it of the form:
http://domain.com
in a million places (usually at the top, bottom or middle of a page that is mostly empty - where all I want to do is delete it completely.

I want to delete those links, and the only PDF editor I know of that will delete them easily is the Adobe Acrobat (writer) but it deletes them one by one. Yuck. I'm doing that, but is there a better way?

Googling, I find that Calibre will delete them but oh my god, is that a complicated action, where you have do css rules and crazy stuff like that.

You can't just search and replace for some godforsaken reason.

Hence I implore you for help... where the PDF can be easily converted to
any epub format if there's another way other than a PDF editor to do it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul@21:1/5 to Andrew on Thu May 23 23:00:28 2024

XPost: comp.text.pdf, comp.editors

On 5/23/2024 8:04 PM, Andrew wrote:

I have a PDF with a link in it of the form:
http://domain.com
in a million places (usually at the top, bottom or middle of a page that is mostly empty - where all I want to do is delete it completely.

I want to delete those links, and the only PDF editor I know of that will delete them easily is the Adobe Acrobat (writer) but it deletes them one by one. Yuck. I'm doing that, but is there a better way?

Googling, I find that Calibre will delete them but oh my god, is that a complicated action, where you have do css rules and crazy stuff like that.

You can't just search and replace for some godforsaken reason.

Hence I implore you for help... where the PDF can be easily converted to
any epub format if there's another way other than a PDF editor to do it.

PDF files are normally "binary" in appearance. But they can be
translated to "ascii". Notice there is a gubbin near the top, which
is not ASCII, and that continues to make the file binary. For example,
some scripting you might do, might have an issue with the four binary characters. (That binary thing, could be different on a different
version of PDF file.)

I don't know if this file has integrity or not. It's just
intended to show how simple the format could have been. (Normal files
will NOT be simple, so you can forget that right now.)

*********************** PDF in Text Mode ***********************
%PDF-1.1
%¥±ë

1 0 obj
<< /Type /Catalog
/Pages 2 0 R
>>
endobj

2 0 obj
<< /Type /Pages
/Kids [3 0 R]
/Count 1
/MediaBox [0 0 300 144]
>>
endobj

3 0 obj
<< /Type /Page
/Parent 2 0 R
/Resources
<< /Font
<< /F1
<< /Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
>>
>>
>>
/Contents 4 0 R
>>
endobj

4 0 obj
<< /Length 55 >>
stream
BT
/F1 18 Tf
0 0 Td
(Hello World) Tj
ET
endstream
endobj

xref
0 5
0000000000 65535 f
0000000018 00000 n
0000000077 00000 n
0000000178 00000 n
0000000457 00000 n
trailer
<< /Root 1 0 R
/Size 5
>>
startxref
565
%%EOF
*********************** PDF in Text Mode ***********************

If you just delete the string in question, it's going to say
"this file is damaged".

The document has consistency checks, and that's how it can
tell the file has been edited.

You can tell from this, they were just screwing with us. The
format before this, PostScript, didn't have counters. When you
found a section in PostScript that said "Do not delete this section",
you just deleted it :-) Well, when they invented PDF, they messed
with it a bit, in the bomb-squad sense.

Adobe makes a "book" available about the PDF standard, and
you could use that. But that's a learning experience.

The only command of note, in my Notes file, is this, and I have
not placed any comments to tell me what it does :-) This makes
the ASCII-like flavor of file.

mutool.exe convert -F pdf -O decompress,clean -o output.pdf input.pdf

And when we talk of "binary to ascii", there is DEFINITELY binary
still in there. The commercial fonts can be encoded somehow, and they are
still transferred as a binary blob. If not handled properly, you will break
the fonts. This puts some constraints on how you work on the file, for sure.
I could use HxD for example, while keeping another tool open to better
be able to read the file as the ASCII portion.

There are various ways to obscure text in the document. Even in
"ASCII mode", nothing says you will see "https://www.something.com".
You might see bunches of numbers instead. If this string of yours
is intended as a watermark, then of course the file will be augmented
for maximum annoyance. A lot of the watermarks we played with as kids,
they were not hardened. You might have concluded nobody cared to do
a good job. I can assure you that some commercial tools, definitely
take their watermark design seriously.

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Andrew on Fri May 24 03:45:51 2024

XPost: comp.text.pdf, comp.editors

On Fri, 24 May 2024 00:04:52 -0000 (UTC), Andrew wrote:

... is there a better way?

Write a program using a PDF-manipulation toolkit.

I have had good results writing Python code using pikepdf <https://github.com/pikepdf/pikepdf>.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kingfisher@21:1/5 to Andrew on Thu May 23 23:00:35 2024

XPost: comp.text.pdf, comp.editors

On 5/23/24 17:04, Andrew wrote:

I have a PDF with a link in it of the form:
http://domain.com
in a million places (usually at the top, bottom or middle of a page that is mostly empty - where all I want to do is delete it completely.

I want to delete those links, and the only PDF editor I know of that will delete them easily is the Adobe Acrobat (writer) but it deletes them one by one. Yuck. I'm doing that, but is there a better way?

Googling, I find that Calibre will delete them but oh my god, is that a complicated action, where you have do css rules and crazy stuff like that.

You can't just search and replace for some godforsaken reason.

Hence I implore you for help... where the PDF can be easily converted to
any epub format if there's another way other than a PDF editor to do it.

LibreOffice Writer will open PDF, edit, and export as PDF. It has a Find
and Replace function that can get all the links in one shot.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Herbert Kleebauer@21:1/5 to Andrew on Fri May 24 08:40:21 2024

XPost: comp.text.pdf, comp.editors

On 24.05.2024 02:04, Andrew wrote:

I have a PDF with a link in it of the form:
http://domain.com
in a million places (usually at the top, bottom or middle of a page that is mostly empty - where all I want to do is delete it completely.

I want to delete those links, and the only PDF editor I know of that will delete them easily is the Adobe Acrobat (writer) but it deletes them one by one. Yuck. I'm doing that, but is there a better way?

If you have Acrobat, save the file as uncompressed pdf. If you are
lucky, you will find "http://domain.com" as simple text in the file.
Replace any occurrence with exactly the same number of blanks. But
you have to use an Editor which preserves the few binary bytes at
the beginning of the file.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter Johnson@21:1/5 to All on Fri May 24 16:04:00 2024

XPost: comp.text.pdf, comp.editors

On Fri, 24 May 2024 00:04:52 -0000 (UTC), Andrew <andrew@spam.net>
wrote:

I have a PDF with a link in it of the form:
http://domain.com
in a million places (usually at the top, bottom or middle of a page that is >mostly empty - where all I want to do is delete it completely.

I want to delete those links, and the only PDF editor I know of that will >delete them easily is the Adobe Acrobat (writer) but it deletes them one by >one. Yuck. I'm doing that, but is there a better way?

Googling, I find that Calibre will delete them but oh my god, is that a >complicated action, where you have do css rules and crazy stuff like that.

You can't just search and replace for some godforsaken reason.

Hence I implore you for help... where the PDF can be easily converted to
any epub format if there's another way other than a PDF editor to do it.

How important is the formatting?
You could extract the text into a Word (or similar) file, run
find/exchange on it and then create a new PDF. Which might or might
not change the formatting, but you could probably fix that before you
created the new PDF.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	361
Nodes:	16 (2 / 14)
Uptime:	123:16:58
Calls:	7,716
Files:	12,861
Messages:	5,727,955

How to remove a link in a PDF that is found in a thousand pages

Who's Online

System Info