• [gentoo-user] How to compress lots of tarballs

    From Peter Humphrey@21:1/5 to All on Sun Sep 26 13:00:04 2021
    Hello list,

    I have an external USB-3 drive with various system backups. There are 350 .tar files (not .tar.gz etc.), amounting to 2.5TB. I was sure I wouldn't need to compress them, so I didn't, but now I think I'm going to have to. Is there a reasonably efficient way to do this? I have 500GB spare space on /dev/sda, and the machine runs constantly.

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Thelen@21:1/5 to All on Sun Sep 26 13:40:02 2021
    [2021-09-26 11:57] Peter Humphrey <peter@prh.myzen.co.uk>
    part text/plain 382
    Hello list,
    Hi,

    I have an external USB-3 drive with various system backups. There are 350 .tar
    files (not .tar.gz etc.), amounting to 2.5TB. I was sure I wouldn't need to compress them, so I didn't, but now I think I'm going to have to. Is there a reasonably efficient way to do this? I have 500GB spare space on /dev/sda, and
    the machine runs constantly.
    Pick your favorite of gzip, bzip2, xz or lzip (I recommend lzip) and
    then:
    mount USB-3 /mnt; cd /mnt; lzip *

    The archiver you chose will compress the file and add the appropriate
    extension all on its own and tar will use that (and the file magic) to
    find the appropriate decompresser when you want to extract files later
    (you can use `tar tf' to test if you want).

    --
    Simon Thelen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ramon Fischer@21:1/5 to Simon Thelen on Sun Sep 26 14:30:01 2021
    --IA7aFqfm89nQUHmYEKjJBHNa0cM9XNnO5
    Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable
    Content-Language: en-US

    In addition to this, you may want to use the parallel implementations of "gzip", "xz", "bzip2" or the new "zstd" (zstandard), which are
    "pigz"[1], "pixz"[2], "pbzip2"[3], or "zstmt" (within package "app-arch/zstd")[4] in order to increase performance:

    $ cd <path_to_mounted_backup_partition>
    $ for tar_archive in *.tar; do pixz "${tar_archive}"; done

    -Ramon

    [1]
    * https://www.zlib.net/pigz/

    [2]
    * https://github.com/vasi/pixz

    [3]
    * https://launchpad.net/pbzip2
    * http://compression.ca/pbzip2/

    [4]
    * https://facebook.github.io/zstd/


    On 26/09/2021 13:36, Simon Thelen wrote:
    [2021-09-26 11:57] Peter Humphrey <peter@prh.myzen.co.uk>
    part text/plain 382
    Hello list,
    Hi,

    I have an external USB-3 drive with various system backups. There are 350 .tar
    files (not .tar.gz etc.), amounting to 2.5TB. I was sure I wouldn't need to >> compress them, so I didn't, but now I think I'm going to have to. Is there a >> reasonably efficient way to do this? I have 500GB spare space on /dev/sda, and
    the machine runs constantly.
    Pick your favorite of gzip, bzip2, xz or lzip (I recommend lzip) and
    then:
    mount USB-3 /mnt; cd /mnt; lzip *

    The archiver you chose will compress the file and add the appropriate extension all on its own and tar will use that (and the file magic) to
    find the appropriate decompresser when you want to extract files later
    (you can use `tar tf' to test if you want).

    --
    Simon Thelen


    --
    GPG public key: 5983 98DA 5F4D A464 38FD CF87 155B E264 13E6 99BF



    --IA7aFqfm89nQUHmYEKjJBHNa0cM9XNnO5--

    -----BEGIN PGP SIGNATURE-----

    wsF5BAABCAAjFiEEWYOY2l9NpGQ4/c+HFVviZBPmmb8FAmFQZiUFAwAAAAAACgkQFVviZBPmmb85 Bg/+IlcvzP5EfQfvUD5k+LmlR5tkxSv5ed3He6NZtwxKzk75JX6CJTNDU8CfyfjelNRaaId8XCr6 7oCsq8I8Lgh82MO5E5LfkdyYQgIqIsdoQVpmYo58J5HRYQLhgU+7VVsuriusyQ0KP37Z2X9iHywx 90vnJaRYQ32Mpl6/NDeoK61+YZzPFkcg8uxouz2eAL2XiD5gqHAT5bqP/3P+XtA9fQTW/vlAfhOr T1uAxIbs525t6/QwXDy7Gi+0zmD7KOdr54pYwZZ/vPefb4Bmram1ES/oYeIpV/+FLU2onLRsprd8 Ouuy/iLOTBiCrJepC4G7PmmioRV2tzjT/QzFNN7wDnT07ofjhTuy85J7JTNeGQaIyg8GNUFtUZBr 1U+rbHorBhSnZDXRJV8Li4jKRjCyIvd8ONsYggDvOxNWA321ZRAwUrMozirLQqeBD80D8UIGQdON WjR41KfIXmBb6bcIFpB45V+Z6r6QnIuL5F/DHSuVLNC0uLCvlDDhcXzz+lFG4ZsheKX9SqpTkY6z QlTg67OJQoFxe3nYZ+8Ww/A4V8WAwKQpE7E5Tj09DUrIREvCANzyrR8YVUvobQ186xCD2pi9rVXp M6IEe9I9fExZUT4FIq80oQvHV4R5grhIoTY2dtYmp0LMkutB4h+wDN5U32OuVQzsl+J1f+bJH8He uQw=
    =3ctA
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usen
  • From Ramon Fischer@21:1/5 to Ramon Fischer on Sun Sep 26 14:30:02 2021
    --fQs3oRqqyGUKJZw1YpiokbeCYgyjhZRTY
    Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable
    Content-Language: en-US

    Addendum:

    To complete the list. Here the parallel implementation of "lzip":

    "plzip": https://www.nongnu.org/lzip/plzip.html

    -Ramon

    On 26/09/2021 14:23, Ramon Fischer wrote:
    In addition to this, you may want to use the parallel implementations
    of "gzip", "xz", "bzip2" or the new "zstd" (zstandard), which are
    "pigz"[1], "pixz"[2], "pbzip2"[3], or "zstmt" (within package "app-arch/zstd")[4] in order to increase performance:

       $ cd <path_to_mounted_backup_partition>
       $ for tar_archive in *.tar; do pixz "${tar_archive}"; done

    -Ramon

    [1]
    * https://www.zlib.net/pigz/

    [2]
    * https://github.com/vasi/pixz

    [3]
    * https://launchpad.net/pbzip2
    * http://compression.ca/pbzip2/

    [4]
    * https://facebook.github.io/zstd/


    On 26/09/2021 13:36, Simon Thelen wrote:
    [2021-09-26 11:57] Peter Humphrey <peter@prh.myzen.co.uk>
    part       text/plain 382
    Hello list,
    Hi,

    I have an external USB-3 drive with various system backups. There
    are 350 .tar
    files (not .tar.gz etc.), amounting to 2.5TB. I was sure I wouldn't
    need to
    compress them, so I didn't, but now I think I'm going to have to. Is
    there a
    reasonably efficient way to do this? I have 500GB spare space on
    /dev/sda, and
    the machine runs constantly.
    Pick your favorite of gzip, bzip2, xz or lzip (I recommend lzip) and
    then:
    mount USB-3 /mnt; cd /mnt; lzip *

    The archiver you chose will compress the file and add the appropriate
    extension all on its own and tar will use that (and the file magic) to
    find the appropriate decompresser when you want to extract files later
    (you can use `tar tf' to test if you want).

    --
    Simon Thelen



    --
    GPG public key: 5983 98DA 5F4D A464 38FD CF87 155B E264 13E6 99BF



    --fQs3oRqqyGUKJZw1YpiokbeCYgyjhZRTY--

    -----BEGIN PGP SIGNATURE-----

    wsF5BAABCAAjFiEEWYOY2l9NpGQ4/c+HFVviZBPmmb8FAmFQZrUFAwAAAAAACgkQFVviZBPmmb/W MQ/7BoKpjyXFUpQ59byeOfd7+s4iHZBPA95GCVg7G2TUKuyouucFIVfGwywDs6wgNnk7CjtEFD7q +GQdlXMgmJD0hCUZl4zYF5z+TyLaAfm+JXQFsKWv7N1GEOJHSDd4tWu4WrQP4WMNedANX8ELHKrm NdJBsw/J+m96ePDdxYje9mQrlQMB8hWSJm/9ctcvQZiQSZ+JHhUEOsid8+QCg8KWaBcffC+UMKXj bV6ME0N8KDck8PkoPGc3zqQiOpVKAUQ7VqWbXsV3UfKxYayw4CfvI2fszALX4GjCEysCLvkG0ODF K+NHNP181/Celn5ekl2fMtPZYe91JI+n76H9TZUo0OypDpF0H03PNyWTbk0p/Of/MbkLh2GUFXv0 DAzH5gppGUxFrk6Olebr0S2BLTlbnZfA7uQmSJhkH2lkbj/SspgGBGZsikV+U1F/YFqn7JZjE1pU GXs3kvxEAQxR3gIe7CBJ40E+on+JTC8DcTNMsuDS5UTTjeX6o0QuZ561v/f/via/Ky0zcNBouZoD HqYYpOv4QbxokkVPtE/VOoYc0c617j3AUTBOAkU0UaY3MZG/WAeZiL4HmbCtW0IVJ6HuzOSzPTS9 ePxeGZPLuzLpzr82gRKvon+iB+/TC4C/y6j+YXOBUwZX6hgBf7KY4eXCtflfnNL9MJD0QhxdQ5Rb hsU=
    =nJvp
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usen
  • From Peter Humphrey@21:1/5 to All on Sun Sep 26 17:40:02 2021
    On Sunday, 26 September 2021 13:25:24 BST Ramon Fischer wrote:
    Addendum:

    To complete the list. Here the parallel implementation of "lzip":

    "plzip": https://www.nongnu.org/lzip/plzip.html

    -Ramon

    On 26/09/2021 14:23, Ramon Fischer wrote:
    In addition to this, you may want to use the parallel implementations
    of "gzip", "xz", "bzip2" or the new "zstd" (zstandard), which are "pigz"[1], "pixz"[2], "pbzip2"[3], or "zstmt" (within package "app-arch/zstd")[4] in order to increase performance:

    $ cd <path_to_mounted_backup_partition>
    $ for tar_archive in *.tar; do pixz "${tar_archive}"; done

    -Ramon

    [1]
    * https://www.zlib.net/pigz/

    [2]
    * https://github.com/vasi/pixz

    [3]
    * https://launchpad.net/pbzip2
    * http://compression.ca/pbzip2/

    [4]
    * https://facebook.github.io/zstd/

    On 26/09/2021 13:36, Simon Thelen wrote:
    [2021-09-26 11:57] Peter Humphrey <peter@prh.myzen.co.uk>

    part text/plain 382
    Hello list,

    Hi,

    I have an external USB-3 drive with various system backups. There
    are 350 .tar
    files (not .tar.gz etc.), amounting to 2.5TB. I was sure I wouldn't
    need to
    compress them, so I didn't, but now I think I'm going to have to. Is
    there a
    reasonably efficient way to do this? I have 500GB spare space on
    /dev/sda, and
    the machine runs constantly.

    Pick your favorite of gzip, bzip2, xz or lzip (I recommend lzip) and
    then:
    mount USB-3 /mnt; cd /mnt; lzip *

    The archiver you chose will compress the file and add the appropriate
    extension all on its own and tar will use that (and the file magic) to
    find the appropriate decompresser when you want to extract files later
    (you can use `tar tf' to test if you want).

    Thank you both. Now, as it's a single USB-3 drive, what advantage would a parallel implementation confer? I assume I'd be better compressing from external to SATA, then writing back, or is that wrong?

    Or, I could connect a second USB-3 drive to a different interface, then read from one and write to the other, with or without the SATA between.

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From antlists@21:1/5 to Peter Humphrey on Sun Sep 26 19:40:01 2021
    On 26/09/2021 16:38, Peter Humphrey wrote:
    Or, I could connect a second USB-3 drive to a different interface, then read from one and write to the other, with or without the SATA between.

    If you've got a second drive, consider changing your strategy ...

    First of all, you want eSATA or USB3 for the speed ...

    Format the drive with lvm, and create an lv-partition big enough to hold
    your backup, but not much more.

    Work out the syntax for an in-place rsync backup (sorry I haven't done
    it, I can't help.

    Every time you make a backup, snapshot the lv before you do it.

    That way, the inplace rsync will only actually write data that has
    changed. Your backup volume will grow at an incremental rate, but you'll actually have full backups.

    The only downside is if the backup gets damaged, it will corrupt every
    copy of the files affected at one stroke, bit if you are using said
    second drive, you can repurpose your first drive if you can back up
    those tar files to dvd or whatever (or throw them away if they've served
    their purpose, but I guess they haven't ...). And by alternating the
    backup drives, you've got two distinct copies.

    Cheers,
    Wol

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam Carter@21:1/5 to All on Mon Sep 27 03:40:02 2021
    On Sun, Sep 26, 2021 at 8:57 PM Peter Humphrey <peter@prh.myzen.co.uk>
    wrote:

    Hello list,

    I have an external USB-3 drive with various system backups. There are 350 .tar
    files (not .tar.gz etc.), amounting to 2.5TB. I was sure I wouldn't need
    to
    compress them, so I didn't, but now I think I'm going to have to. Is there
    a
    reasonably efficient way to do this?


    find <mountpoint> -name \*tar -exec zstd -TN {} \;

    Where N is the number of cores you want to allocate. zstd -T0 (or just
    zstdmt) if you want to use all the available cores. I use zstd for
    everything now as it's as good as or better than all the others in the
    general case.

    Parallel means it uses more than one core, so on a modern machine it is
    much faster.

    <div dir="ltr"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Sep 26, 2021 at 8:57 PM Peter Humphrey &lt;<a href="mailto:peter@prh.myzen.co.uk">peter@prh.myzen.co.uk</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="
    margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello list,<br>

    I have an external USB-3 drive with various system backups. There are 350 .tar <br>
    files (not .tar.gz etc.), amounting to 2.5TB. I was sure I wouldn&#39;t need to <br>
    compress them, so I didn&#39;t, but now I think I&#39;m going to have to. Is there a <br>
    reasonably efficient way to do this? <br></blockquote><div><br></div><div>find &lt;mountpoint&gt; -name \*tar -exec zstd -TN {} \;</div><div><br></div><div>Where N is the number of cores you want to allocate. zstd -T0 (or just zstdmt) if you want to use
    all the available cores. I use zstd for everything now as it&#39;s as good as or better than all the others in the general case. <br></div><div><br></div><div>Parallel means it uses more than one core, so on a modern machine it is much faster.<br></div></
    </div>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Humphrey@21:1/5 to All on Mon Sep 27 15:40:01 2021
    On Monday, 27 September 2021 02:39:19 BST Adam Carter wrote:
    On Sun, Sep 26, 2021 at 8:57 PM Peter Humphrey <peter@prh.myzen.co.uk>

    wrote:
    Hello list,

    I have an external USB-3 drive with various system backups. There are 350 .tar files (not .tar.gz etc.), amounting to 2.5TB. I was sure I wouldn't need to compress them, so I didn't, but now I think I'm going to have to. Is there a reasonably efficient way to do this?

    find <mountpoint> -name \*tar -exec zstd -TN {} \;

    Where N is the number of cores you want to allocate. zstd -T0 (or just zstdmt) if you want to use all the available cores. I use zstd for
    everything now as it's as good as or better than all the others in the general case.

    Parallel means it uses more than one core, so on a modern machine it is
    much faster.

    Thanks to all who've helped. I can't avoid feeling, though, that the main bottleneck has been missed: that I have to read and write on a USB-3 drive. It's just taken 23 minutes to copy the current system backup from USB-3 to
    SATA SSD: 108GB in 8 .tar files.

    Perhaps I have things out of proportion.

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Humphrey@21:1/5 to All on Mon Sep 27 16:20:01 2021
    On Monday, 27 September 2021 14:30:36 BST Peter Humphrey wrote:
    On Monday, 27 September 2021 02:39:19 BST Adam Carter wrote:
    On Sun, Sep 26, 2021 at 8:57 PM Peter Humphrey
    <peter@prh.myzen.co.uk>

    wrote:
    Hello list,

    I have an external USB-3 drive with various system backups. There are
    350
    .tar files (not .tar.gz etc.), amounting to 2.5TB. I was sure I wouldn't need to compress them, so I didn't, but now I think I'm going to have
    to.
    Is there a reasonably efficient way to do this?

    find <mountpoint> -name \*tar -exec zstd -TN {} \;

    Where N is the number of cores you want to allocate. zstd -T0 (or just zstdmt) if you want to use all the available cores. I use zstd for everything now as it's as good as or better than all the others in the general case.

    Parallel means it uses more than one core, so on a modern machine it is much faster.

    Thanks to all who've helped. I can't avoid feeling, though, that the main bottleneck has been missed: that I have to read and write on a USB-3 drive. It's just taken 23 minutes to copy the current system backup from USB-3 to SATA SSD: 108GB in 8 .tar files.

    I was premature. In contrast to the 23 minutes to copy the files from USB-3 to internal SSD, zstd -T0 took 3:22 to compress them onto another internal SSD. I watched /bin/top and didn't see more than 250% CPU (this is a 24-CPU box) with next-to-nothing else running. The result was 65G of .tar.zst files.

    So, at negligible cost in CPU load*, I can achieve a 40% saving in space. Of course, I'll have to manage the process myself, and I still have to copy the compressed files back to USB-3 - but then I am retired, so what else do I have to do? :)

    Thanks again, all who've helped.

    * ...so I can continue running my 5 BOINC projects at the same time.

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Freeman@21:1/5 to peter@prh.myzen.co.uk on Tue Sep 28 13:40:02 2021
    On Mon, Sep 27, 2021 at 9:30 AM Peter Humphrey <peter@prh.myzen.co.uk> wrote:

    Thanks to all who've helped. I can't avoid feeling, though, that the main bottleneck has been missed: that I have to read and write on a USB-3 drive. It's just taken 23 minutes to copy the current system backup from USB-3 to SATA SSD: 108GB in 8 .tar files.

    You keep mentioning USB3, but I think the main factor here is that the
    external drive is probably a spinning hard drive (I don't think you
    explicitly mentioned this but it seems likely esp with the volume of
    data). That math works out to 78MB/s. Hard drive transfer speeds
    depend on the drive itself and especially whether there is more than
    one IO task to be performed, so I can't be entirely sure, but I'm
    guessing that the USB3 interface itself is having almost no adverse
    impact on the transfer rate.

    The main thing to avoid is doing other sustained read/writes from the
    drive at the same time.

    It looks like you ended up doing the bulk of the compression on an
    SSD, and obviously those don't care nearly as much about IOPS.

    I've been playing around with lizardfs for bulk storage and found that
    USB3 hard drives actually work very well, as long as you're mindful
    about what physical ports are on what USB hosts and so on. A USB3
    host can basically handle two hard drives with no loss of performance.
    I'm not dealing with a ton of IO though so I can probably stack more
    drives with pretty minimal impact unless there is a rebuild (in which
    case the gigabit ethernet is probably still the larger bottleneck).
    Even a Raspberry Pi 4 has two USB3 hosts, which means you could stack
    4 hard drives on one and get basically the same performance as SATA.
    When you couple that with the tendency of manufacturers to charge less
    for USB3 drives than SATA drives of the same performance it just
    becomes a much simpler solution than messing with HBAs and so on and
    limiting yourself to hardware that can actually work with an HBA.

    --
    Rich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Humphrey@21:1/5 to All on Tue Sep 28 13:30:02 2021
    On Sunday, 26 September 2021 11:57:43 BST Peter Humphrey wrote:
    Hello list,

    I have an external USB-3 drive with various system backups. There are 350 .tar files (not .tar.gz etc.), amounting to 2.5TB. I was sure I wouldn't
    need to compress them, so I didn't, but now I think I'm going to have to.
    Is there a reasonably efficient way to do this? I have 500GB spare space on /dev/sda, and the machine runs constantly.

    To complete the topic, and in case anyone's interested, I've settled on the following method, which could have been made more general, but this way I
    have a complete set of compressed tarballs in two places in case I need to recover something.

    This is the command to compress one week's backups:

    # (cd /mnt/sdc/wstn/main/vvv.old && time (for tar_archive in *.tar; do zstd - T0 --rm "${tar_archive}" -o /mnt/bu-space/vvv.old/"${tar_archive}.zst"; done
    && cp /mnt/bu-space/vvv.old/* /mnt/sdc/wstn/main/vvv.old/ && sync))

    This was 36GB and took 20 minutes.

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Humphrey@21:1/5 to All on Tue Sep 28 15:10:01 2021
    On Tuesday, 28 September 2021 12:38:42 BST Rich Freeman wrote:

    You keep mentioning USB3, but I think the main factor here is that the external drive is probably a spinning hard drive (I don't think you explicitly mentioned this but it seems likely esp with the volume of
    data). That math works out to 78MB/s. Hard drive transfer speeds
    depend on the drive itself and especially whether there is more than
    one IO task to be performed, so I can't be entirely sure, but I'm
    guessing that the USB3 interface itself is having almost no adverse
    impact on the transfer rate.

    I'm sure you're right Rich, and yes, this is 2.5" drive. I'm seeing about 110MB/s reading from USB feeding into zstd.

    The main thing to avoid is doing other sustained read/writes from the
    drive at the same time.

    Quite so.

    It looks like you ended up doing the bulk of the compression on an
    SSD, and obviously those don't care nearly as much about IOPS.

    Yes, input from USB and output to SSD.

    I've been playing around with lizardfs for bulk storage and found that
    USB3 hard drives actually work very well, as long as you're mindful
    about what physical ports are on what USB hosts and so on. A USB3
    host can basically handle two hard drives with no loss of performance.
    I'm not dealing with a ton of IO though so I can probably stack more
    drives with pretty minimal impact unless there is a rebuild (in which
    case the gigabit ethernet is probably still the larger bottleneck).
    Even a Raspberry Pi 4 has two USB3 hosts, which means you could stack
    4 hard drives on one and get basically the same performance as SATA.
    When you couple that with the tendency of manufacturers to charge less
    for USB3 drives than SATA drives of the same performance it just
    becomes a much simpler solution than messing with HBAs and so on and
    limiting yourself to hardware that can actually work with an HBA.

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Humphrey@21:1/5 to All on Wed Sep 29 10:30:02 2021
    On Tuesday, 28 September 2021 18:43:06 BST Laurence Perkins wrote:

    There are also backup tools which will handle the compression step for you.

    app-backup/duplicity uses a similar tar file and index system with periodic full and then incremental chains. Plus it keeps a condensed list of file hashes from previous runs so it doesn't have to re-read the entire archive
    to determine what changed the way rsync does.

    app-backup/borgbackup is more complex, but is very, very good at deduplicating file data, which saves even more space. Furthermore, it can store backups for multiple systems and deduplicate between them, so if you have any other machines you can have backups there as well, potentially at negligble space cost if you have a lot of redundancy.

    Thanks Laurence. I've looked at borg before, wondering whether I needed a
    more sophisticated tool than just tar, but it looked like too much work for little gain. I didn't know about duplicity, but I'm used to my weekly routine and it seems reliable, so I'll stick with it pro tem. I've been keeping a
    daily KMail archive since the bad old days, and five weekly backups of the whole system, together with 12 monthly backups and, recently an annual
    backup. That last may be overkill, I dare say.

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Freeman@21:1/5 to peter@prh.myzen.co.uk on Wed Sep 29 17:40:02 2021
    On Wed, Sep 29, 2021 at 4:27 AM Peter Humphrey <peter@prh.myzen.co.uk> wrote:

    Thanks Laurence. I've looked at borg before, wondering whether I needed a more sophisticated tool than just tar, but it looked like too much work for little gain. I didn't know about duplicity, but I'm used to my weekly routine and it seems reliable, so I'll stick with it pro tem. I've been keeping a daily KMail archive since the bad old days, and five weekly backups of the whole system, together with 12 monthly backups and, recently an annual backup. That last may be overkill, I dare say.

    I think Restic might be gaining some ground on duplicity. I use
    duplicity and it is fine, so I haven't had much need to look at
    anything else. Big advantages of duplicity over tar are:

    1. It will do all the compression/encryption/etc stuff for you - all
    controlled via options.
    2. It uses librsync, which means if one byte in the middle of a 10GB
    file changes, you end up with a few bytes in your archive and not 10GB (pre-compression).
    3. It has a ton of cloud/remote backends, so it is real easy to store
    the data on AWS/Google/whatever. When operating this way it can keep
    local copies of the metadata, and if for some reason those are lost it
    can just pull that only down from the cloud to resync without a huge
    bill.
    4. It can do all the backup rotation logic (fulls, incrementals,
    retention, etc).
    5. It can prefix files so that on something like AWS you can have the
    big data archive files go to glacier (cheap to store, expensive to
    restore), and the small metadata stays in a data class that is cheap
    to access.
    6. By default local metadata is kept unencrypted, and anything on the
    cloud is encrypted. This means that you can just keep a public key in
    your keyring for completely unattended backups, without fear of access
    to the private key. Obviously if you need to restore your metadata
    from the cloud you'll need the private key for that.

    If you like the more tar-like process another tool you might want to
    look at is dar. It basically is a near-drop-in replacement for tar
    but it stores indexes at the end of every file, which means that you
    can view archive contents/etc or restore individual files without
    scanning the whole archive. tar was really designed for tape where
    random access is not possible.

    --
    Rich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dale@21:1/5 to Rich Freeman on Wed Sep 29 22:10:01 2021
    Rich Freeman wrote:
    On Wed, Sep 29, 2021 at 4:27 AM Peter Humphrey <peter@prh.myzen.co.uk> wrote:
    Thanks Laurence. I've looked at borg before, wondering whether I needed a
    more sophisticated tool than just tar, but it looked like too much work for >> little gain. I didn't know about duplicity, but I'm used to my weekly routine
    and it seems reliable, so I'll stick with it pro tem. I've been keeping a
    daily KMail archive since the bad old days, and five weekly backups of the >> whole system, together with 12 monthly backups and, recently an annual
    backup. That last may be overkill, I dare say.
    I think Restic might be gaining some ground on duplicity. I use
    duplicity and it is fine, so I haven't had much need to look at
    anything else. Big advantages of duplicity over tar are:

    1. It will do all the compression/encryption/etc stuff for you - all controlled via options.
    2. It uses librsync, which means if one byte in the middle of a 10GB
    file changes, you end up with a few bytes in your archive and not 10GB (pre-compression).
    3. It has a ton of cloud/remote backends, so it is real easy to store
    the data on AWS/Google/whatever. When operating this way it can keep
    local copies of the metadata, and if for some reason those are lost it
    can just pull that only down from the cloud to resync without a huge
    bill.
    4. It can do all the backup rotation logic (fulls, incrementals,
    retention, etc).
    5. It can prefix files so that on something like AWS you can have the
    big data archive files go to glacier (cheap to store, expensive to
    restore), and the small metadata stays in a data class that is cheap
    to access.
    6. By default local metadata is kept unencrypted, and anything on the
    cloud is encrypted. This means that you can just keep a public key in
    your keyring for completely unattended backups, without fear of access
    to the private key. Obviously if you need to restore your metadata
    from the cloud you'll need the private key for that.

    If you like the more tar-like process another tool you might want to
    look at is dar. It basically is a near-drop-in replacement for tar
    but it stores indexes at the end of every file, which means that you
    can view archive contents/etc or restore individual files without
    scanning the whole archive. tar was really designed for tape where
    random access is not possible.



    Curious question here.  As you may recall, I backup to a external hard drive.  Would it make sense to use that software for a external hard
    drive?  Right now, I'm just doing file updates with rsync and the drive
    is encrypted.  Thing is, I'm going to have to split into three drives
    soon.  So, compressing may help.  Since it is video files, it may not
    help much but I'm not sure about that.  Just curious. 

    Dale

    :-)  :-) 

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dale@21:1/5 to Laurence Perkins on Wed Sep 29 23:00:02 2021
    Laurence Perkins wrote:

    Curious question here. As you may recall, I backup to a external hard drive. Would it make sense to use that software for a external hard drive? Right now, I'm just doing file updates with rsync and the drive is encrypted. Thing is, I'm going to
    have to split into three drives soon. So, compressing may help. Since it is video files, it may not help much but I'm not sure about that. Just curious.

    Dale

    :-) :-)


    If I understand correctly you're using rsync+tar and then keeping a set of copies of various ages.

    Actually, it is uncompressed and just stores one version and one copy. 



    If you lose a single file that you want to restore and have to go hunting for it, with tar you can only list the files in the archive by reading the entire thing into memory and only extract by reading from the beginning until you stumble across the
    matching filename. So with large archives to hunt through, that could take... a while...

    dar is compatible with tar (Pretty sure, would have to look again, but I remember that being one of its main selling-points) but adds an index at the end of the file allowing listing of the contents and jumping to particular files without having to
    read the entire thing. Won't help with your space shortage, but will make searching and single-file restores much faster.

    Duplicity and similar has the indices, and additionally a full+incremental scheme. So searching is reasonably quick, and restoring likewise doesn't have to grovel over all the data. It can be slower than tar or dar for restore though because it has
    to restore first from the full, and then walk through however many incrementals are necessary to get the version you want. This comes with a substantial space savings though as each set of archive files after the full contains only the pieces which
    actually changed. Coupled with compression, that might solve your space issues for a while longer.

    Borg and similar break the files into variable-size chunks and store each chunk indexed by its content hash. So each chunk gets stored exactly once regardless of how many times it may occur in the data set. Backups then become simply lists of file
    attributes and what chunks they contain. This results both in storing only changes between backup runs and in deduplication of commonly-occurring data chunks across the entire backup. The database-like structure also means that all backups can be
    searched and restored from in roughly equal amounts of time and that backup sets can be deleted in any order. Many of them (Borg included) also allow mounting backup sets via FUSE. The disadvantage is that restore requires a compatible version of the
    backup tool rather than just a generic utility.

    LMP


    I guess that is the downside of not having just plain uncompressed
    files.  Thing is, so far, I've never needed to restore a single file or
    even several files.  So it's not a big deal for me.  If I accidentally
    delete something tho, that could be a problem, if it has left the trash already. 

    Since the drive also uses LVM, someone mentioned using snapshots.  Still
    not real clear on those even tho I've read a bit about them.  Some of
    the backup technics are confusing to me.  I get plain files, even
    incremental to a extent but some of the new stuff just muddies the water. 

    I really need to just build a file server, RAID or something.  :/

    Dale

    :-)  :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Wols Lists@21:1/5 to Dale on Wed Sep 29 23:50:01 2021
    On 29/09/2021 21:58, Dale wrote:
    Since the drive also uses LVM, someone mentioned using snapshots.

    Me?

    Still
    not real clear on those even tho I've read a bit about them.  Some of
    the backup technics are confusing to me.  I get plain files, even incremental to a extent but some of the new stuff just muddies the water.

    An LVM snapshot creates a "copy on write" image. I'm just beginning to
    dig into it myself, but I agree it's a bit confusing.

    I really need to just build a file server, RAID or something.  :/

    Once you've got your logical volume, what I think you do is ask yourself
    how much has changed since the last backup. Then create a snapshot with
    enough space to hold that, before you do an "in place" rsync.

    That updates only the stuff that has changed, moving the original data
    into the backup snapshot you've just made.

    If you need to restore, you can just mount the backup, and copy stuff
    out of it.

    That way, you've now got two full backups, but the old backup is
    actually just storing a diff from the new one - a bit like git actually
    only stores the latest and diffs, recreating older versions on demand.

    From what I gather, you can also revert, which just merges the snapshot
    back in deleting all the new stuff.

    The one worry, as far as I can tell, is that if a snapshot overflows it
    just gets dropped and lost, so sizing is crucial. Although if you're
    backing up, you may be able to over-provision, and then shrink it once
    you've done the backup.

    For what I want it for - snapshotting before I run an emerge - guessing
    the size is far more important because I don't want the snapshot to be
    dropped half-way through the emerge!

    Cheers,
    Wol

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Freeman@21:1/5 to antlists@youngman.org.uk on Thu Sep 30 01:20:01 2021
    On Wed, Sep 29, 2021 at 5:48 PM Wols Lists <antlists@youngman.org.uk> wrote:

    An LVM snapshot creates a "copy on write" image. I'm just beginning to
    dig into it myself, but I agree it's a bit confusing.

    So, snapshots in general are a solution for making backups atomic.
    That is, they allow a backup to look as if the entire backup was taken
    in an instant.

    The simplest way to accomplish that is via offline backups. Unmount
    the drive, mount it read-only, then perform a backup. That guarantees
    that nothing changes between the time the backup starts/stops. Of
    course, it also can mean that for many hours you can't really use the
    drive.

    Snapshots let you cheat. They create two views of the drive - one
    that can be used normally, and one which is a moment-in-time snapshot
    of what the drive looked like. You backup the snapshot, and you can
    use the regular drive.

    With something like LVM you probably want to unmount the filesystem
    before snapshotting it. Otherwise it is a bit like just pulling the
    plug on the PC before doing an offline backup - the snapshot will be
    in a dirty state with things in various state of modification. This
    is because LVM knows nothing about your filesystem - when you run the
    snapshot it just captures the state of every block on the disk in that
    instant, which means dirty caches/etc (the filesystem caches or file
    buffers wouldn't be covered, because those are above the LVM layer,
    but caches below LVM would be covered, like the physical disk cache).

    Some filesystems like btrfs and zfs also support snapshotting. Those
    are much safer to snapshot while online, since the filesystem is aware
    of everything going on. Of course even this will miss file buffers.
    The cleanest way to snapshot anything is to ensure the filesystem is
    quiescent. This need only be for a few seconds while snapshotting
    though, and then you can go right back to using it.

    If you do a full backup on a snapshot, then if you restore that backup
    you'll get the contents of the filesystem at the moment the snapshot
    was taken.

    You can also use snapshots as a sort-of backup as well. Of course
    those are on the same physical disk so it only protects you against
    some types of failures. Often there is a way to serialize a snapshot,
    perhaps incrementally - I know zfs and btrfs both allow this. This
    can be a super-efficient way to create backup files since the
    filesystem metadata can be used to determine exactly what blocks to
    back up with perfect reliability. The downside of this is that you
    can only restore to the same filesystem - it isn't a simple file-based
    solution like with tar/etc.

    So as to not do a second reply I want to address an earlier question,
    regarding backups that span multiple volumes. This is a need I have,
    and I haven't found a simple tool that does it well. I am VERY
    interested in suggestions here. Tar I think can support something
    like this, since it is common for tape backups to span multiple
    volumes. I have no idea how clean that is. Most tools want you to
    use LVM or a union filesystem or whatever to combine all the drives
    into a single mountpoint and let linux manage the spanning. I'm using
    bacula which can sort-of handle multiple volumes, but not very
    cleanly, and it is an overly complex tool for backing up one system.
    I wouldn't really endorse it wholeheartedly, but it can be made to
    work.

    --
    Rich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From antlists@21:1/5 to Rich Freeman on Thu Sep 30 19:20:01 2021
    On 30/09/2021 00:17, Rich Freeman wrote:
    On Wed, Sep 29, 2021 at 5:48 PM Wols Lists<antlists@youngman.org.uk> wrote:
    An LVM snapshot creates a "copy on write" image. I'm just beginning to
    dig into it myself, but I agree it's a bit confusing.

    So, snapshots in general are a solution for making backups atomic.
    That is, they allow a backup to look as if the entire backup was taken
    in an instant.

    The simplest way to accomplish that is via offline backups. Unmount
    the drive, mount it read-only, then perform a backup. That guarantees
    that nothing changes between the time the backup starts/stops. Of
    course, it also can mean that for many hours you can't really use the
    drive.

    Snapshots let you cheat. They create two views of the drive - one
    that can be used normally, and one which is a moment-in-time snapshot
    of what the drive looked like. You backup the snapshot, and you can
    use the regular drive.

    Yup. I'm planning to configure systemd to do most of this for me. As a
    desktop system it goes up and down, so the plan is a trigger will fire
    midnight fri/sat, and the first time it gets booted after that, a
    snapshot will be taken before fstab is run.

    Then I'll have backups of /home and /. I won't keep many root backups,
    but I'll keep /home until I run out of space.

    And that's why I suggested if you want a separate backup rather than a collection of snapshots, you snapshot the backup and use in-place rsync.
    Of course that still means you need to quiesce the bit you're copying,
    but you could back it up piecemeal.

    Cheers,
    Wol

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Steinmetzger@21:1/5 to All on Sat Oct 2 00:40:02 2021
    Am Wed, Sep 29, 2021 at 03:04:41PM -0500 schrieb Dale:

    Curious question here.  As you may recall, I backup to a external hard drive.  Would it make sense to use that software for a external hard
    drive?

    Since you are using LVM for everything IIRC, it would be a very efficient
    way for you to make incremental backups with snapshots. But I have no
    knowledge in that area to give you hints.


    But I do use Borg. It’s been my primary backup tool for my systems for
    almost two years now. Before that I used rsnapshot (i.e. rsync with
    versioning through hard links) for my home partion and simple rsync for the data partition. Rsnapshot is quite slow, because it has to compare at least
    the inodes of all files on the source and destination. Borg uses a cache,
    which speeds things up drastically.

    I have one Borg repo for the root fs, one for ~ and one for the data
    partition, and each repo receives the partition from two different hosts,
    but which have most of their data mirrored daily with Unison. A tool like
    Borg can deduplicate all of that and create snapshots of it. This saves
    oogles of space, but also allows me to restore an entire host with a simple rsync from a mounted Borg repo. (only downside: no hardlink support, AFAIK).

    Borg saves its data in 500 MB files, which makes it very SMR-friendly. Rsnapshot will create little holes in the backup FS over time with the
    deletion of old snapshots. And as we all know, this will bring SMR drives
    down to a crawl. If you back-up only big video files, then this may not be a huge problem. But it will be with the ~ partition, with its thousands of
    little files. In Borg, little changes do not trickle down to many random writes. If a data file becomes partially obsolete, it is rewritten into a
    new file and the old one deleted as a whole. Thanks to that, I have no worry using 2.5″ 4 TB drives as main backup drive (as we all know, everything 2.5″
    above 1 TB is SMR).

    Those big data files also make it very efficient to copy a Borg repo (for example to mirror the backup drive to another drive for off-site storage), because it uses a very small number of files itself:

    $ borg info data
    ... ------------------------------------------------------------------------------
    Original size Compressed size Deduplicated size All archives: 18.09 TB 17.60 TB 1.23 TB

    Unique chunks Total chunks
    Chunk index: 722096 10888890

    $ find data -type f | wc -l
    2498

    I have 21 snapshots in that repo, which amount to 18 TB of backed-up data, deduped down to 1.23 TB, spread over only 2498 files including metadata.

    Right now, I'm just doing file updates with rsync and the drive
    is encrypted.

    While Borg has an encryption feature, I chose not to use it and rely on the underlying LUKS. Because then I can use KDE GUI stuff to mount the drive and run my Borg wrapper script without ever having to enter a passphrase.

    Thing is, I'm going to have to split into three drives soon.  So, compressing may help. Since it is video files, it may not help much but
    I'm not sure about that.

    Of my PC’s data partition, almost 50 % is music, 20 % is my JPEG pictures library, 15 % is video files and the rest is misc stuff like Kerbal Space Program, compressed archives of OpenStreetMap files and VM images.
    This is the statistics of my last snapshot:

    Original size Compressed size Deduplicated size
    730.80 GB 698.76 GB 1.95 MB

    Compression gain is around 4 %. Much of which probably comes from empty
    areas in VM images and 4 GB of pdf and html files. On my laptop, whose data partition has fewer VM stuff, but a lot more videos, it looks thus:

    Original size Compressed size Deduplicated size
    1.01 TB 1.00 TB 1.67 MB

    So only around 1 % of savings. However, compression is done using lz4 (by default, you can choose other algos), which is extremely fast but not very strong. In fact, Borg tries to compress all chunks, but if it finds that compressing a chunk doesn’t yield enough benefit, it actually discards it
    and uses the uncompressed data to save on CPU load later on.

    --
    Grüße | Greetings | Qapla’
    Please do not share anything from, with or about me on any social network.

    Some people are so tired, they can’t even stay awake until falling asleep.

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEVbE9o2D2lE5fhoVsizG+tUDUMMoFAmFXjE4ACgkQizG+tUDU MMp98hAAuLwC81b23i9jV1Fi7zX8jkN+R5klDiFfAMeVt+2PPtF+UJdSqb4Fn39K 3y6QEjOT293+bs7HSG+fkJlbxA2KCUQVVK6LxZ4rSgX0oumXX2x8Eibp883nji+p UUudgz4V9Ii02Zc7cGyxmkP2PPWlnMO+uC1M3hLQz+GCvUHBYBeMymWyiS4YPu5D N8Oov04Gh9c+ulxOkJZIzZoKKlJhm8BZ/Us+P6FT7nDl5GCAAr+36BjrqQ3xXybx kRk9kz8Q0XWTXcSvD1uTRKkE33CxlbhBVidjhOJPSVkUCM52bYKvPgtilBWFEazE WcKovnXTFBFsp8q4q4QCnLCXCnuAWYYGUIOMY6/gTKHKsj8z8gDmS4mAuZByAkl9 EjcaOLwUOz4hzQWQ/JXsI5KmgujL4VT5gZjSSjS3UpJW9H3G+uFRGqyTr6TmSM1S +IOoJLUM1m4r8q6gAUuhU1mj8iQvuLyUVxApA0yyjoQ4E86djcCFzrET3g1sACxC sBAmtQ8AWDeN+Sj1wKU1AlZxHzErG1E0dbJOMsXFoD7Qi+ekIs3BPTdF9isIGGKG 1YkUdLKzygftMZ5o/b6opdXySbTocRK70j023SVJxRpYnLayObKko21mGw8aKVe4 0rkQbrCslBKUyliqoeK5+iFw4NP7i5uODINlyJvS8IBIR7QA5iM=
    =mO//
    -----END