• archiver for a rolling history of a file with sparse changes

    From Phil Carmody@21:1/5 to All on Thu Aug 15 17:50:22 2019
    I'm generating a couple of megs of (html) data per day and the data
    really doesn't change that much from day to day. Is there an archiver
    which will store the complete history of a file, taking advantage of
    the knowledge of the previous contents the file?

    I was hoping ZPAQ would do the job, as it's designed to archive files'
    full histories, but I'm convinced it doesn't use this knowledge.
    e.g. one 1.2MB file has a diff from day to day of ~200-600KB, of which
    half is removed stuff, so effectively noise, so about ~100-300KB of new
    data. Every subsequent day's changes I've added to the ZPAQ archive has expanded it by almost exactly the same size as it was from the 1st day.
    I'm sure deltas that are 1/10-1/4 of the size should be compressed to
    1/10-1/4 of the size, as they're effectively the same type of data.

    Any ideas what would be a suitable program to use?

    FOSS on linux preferred, happy to compile from source.

    Phil
    --
    We are no longer hunters and nomads. No longer awed and frightened, as we have gained some understanding of the world in which we live. As such, we can cast aside childish remnants from the dawn of our civilization.
    -- NotSanguine on SoylentNews, after Eugen Weber in /The Western Tradition/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Phil Carmody on Thu Aug 15 10:55:57 2019
    Phil Carmody <pc+usenet@asdf.org> writes:
    I'm generating a couple of megs of (html) data per day and the data
    really doesn't change that much from day to day. Is there an archiver
    which will store the complete history of a file, taking advantage of
    the knowledge of the previous contents the file?

    Any decent source control system (Git, CVS, RCS, etc.) should do the
    job. CVS or RCS would do fairly well if the changes can be represented compactly as line-oriented diffs.

    --
    Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food.
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Phil Carmody@21:1/5 to Keith Thompson on Mon Aug 19 18:59:48 2019
    Keith Thompson <kst-u@mib.org> writes:
    Phil Carmody <pc+usenet@asdf.org> writes:
    I'm generating a couple of megs of (html) data per day and the data
    really doesn't change that much from day to day. Is there an archiver
    which will store the complete history of a file, taking advantage of
    the knowledge of the previous contents the file?

    Any decent source control system (Git, CVS, RCS, etc.) should do the
    job. CVS or RCS would do fairly well if the changes can be represented compactly as line-oriented diffs.

    I'm giving git a go, and after the occasional git gc it does seem to be
    neck and neck with zpaq -m4, but I -m5 would clearly beat it. zpaq's not
    able to make use of any similarities between the new and the old files,
    when I change the fragment parameter for deduplication, compression gets
    worse. A quick comparison of zipping up a set of hand-mangled (to remove anything I know shouldn't be necessary to reproduce each version)
    patches is worse than what git does internally, so is probably a dead
    end. I'll keep both the zpaq and git running daily until one appears to
    be a clear winner.

    Cheers,
    Phil
    --
    We are no longer hunters and nomads. No longer awed and frightened, as we have gained some understanding of the world in which we live. As such, we can cast aside childish remnants from the dawn of our civilization.
    -- NotSanguine on SoylentNews, after Eugen Weber in /The Western Tradition/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)