• djb2 hash using awk

    From Michael Sanders@21:1/5 to All on Fri Sep 22 12:50:09 2023
    Greetings folks, just sharing the knowledge (& test posting using google groups).

    Beware wordwrap...

    function djb2(str, hash, i, char) {

    # initialize hash to an arbitrary prime number
    hash = 5381
    n = length(str)

    # iterate over characters and compute hash
    for(i = 1; i <= n; i++) {
    char = substr(str, i, 1)
    # modulo simulates 32-bit unsigned integer
    hash = (hash * 33 + ord(char)) % 2147483647
    }

    return hash
    }

    function ord(char) {

    return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))

    }

    BEGIN {

    for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i

    print djb2("my key is...")

    }

    notes:

    https://stackoverflow.com/questions/1579721/why-are-5381-and-33-so-important-in-the-djb2-algorithm

    https://stackoverflow.com/questions/10696223/reason-for-the-number-5381-in-the-djb-hash-function

    https://groups.google.com/g/comp.lang.c/c/lSKWXiuNOAk

    https://en.wikipedia.org/wiki/Daniel_J._Bernstein

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Michael Sanders on Sat Sep 23 10:45:06 2023
    On 22.09.2023 21:50, Michael Sanders wrote:

    Greetings folks, just sharing the knowledge (& test posting using google groups).

    Hello,

    the post through Google groups seems technically fine - it certainly
    wasn't in the past -, only Usenet posting conventions seem violated
    (line length).

    But the Awk code rises a couple questions...

    Beware wordwrap...

    function djb2(str, hash, i, char) {

    # initialize hash to an arbitrary prime number
    hash = 5381
    n = length(str)

    Variable 'n' not declared as local (just for good measure, and in
    case you want to use the code in other code contexts).


    # iterate over characters and compute hash
    for(i = 1; i <= n; i++) {
    char = substr(str, i, 1)
    # modulo simulates 32-bit unsigned integer

    The previous comment suggests that the subsequent code should use
    the constant 2147483648 if you want a 32-bit unsigned integer. On
    the other hand such formulas often use primes (and 2147483647 is
    one). So it leaves me with uncertainty what is actually intended.

    hash = (hash * 33 + ord(char)) % 2147483647

    Is it guaranteed that the expression evaluates correctly in all
    awks? The intermediate results can become as large as 70866960411 (0x108000001B, which is larger than 32 bit), which needs to be
    representable in awk (not sure what POSIX awk defines/requires
    here).

    }

    return hash
    }

    function ord(char) {

    return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))

    Why use this costly pattern match and conditional expression
    if you can anyway simply access the ord[] array where all data
    is already present?

    (I would also try to avoid name clashes like ord() and ord[],
    BTW. Here it's simple, because the function is superfluous;
    just replace the ord(char) at the calling side by ord[char]
    and remove the function completely.)


    }

    BEGIN {

    for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i

    What will the code produce with, say, "extended" ASCII, or the
    common ISO 8859-x family of character sets? Is there any reason
    to restrict it here?

    What are the criteria for the chosen character subset? How will
    or how should a TAB character in the data change the result?


    print djb2("my key is...")

    }

    As I said, just a couple of more or less obvious questions.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Manuel Collado@21:1/5 to All on Sat Sep 23 10:54:25 2023
    There is something wrong:

    $ LC_ALL=C gawk 'function ord(char) {
    return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
    }

    BEGIN {print ord("x")}'
    gawk: cmd. line:3: error: function `ord' called with space between name
    and `(',
    or used as a variable or an array

    Please note ord[char] vs. ord(char)

    HTH

    El 22/9/23 a las 21:50, Michael Sanders escribió:

    Greetings folks, just sharing the knowledge (& test posting using google groups).

    Beware wordwrap...

    function djb2(str, hash, i, char) {

    # initialize hash to an arbitrary prime number
    hash = 5381
    n = length(str)

    # iterate over characters and compute hash
    for(i = 1; i <= n; i++) {
    char = substr(str, i, 1)
    # modulo simulates 32-bit unsigned integer
    hash = (hash * 33 + ord(char)) % 2147483647
    }

    return hash
    }

    function ord(char) {

    return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))

    }

    BEGIN {

    for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i

    print djb2("my key is...")

    }

    notes:

    https://stackoverflow.com/questions/1579721/why-are-5381-and-33-so-important-in-the-djb2-algorithm

    https://stackoverflow.com/questions/10696223/reason-for-the-number-5381-in-the-djb-hash-function

    https://groups.google.com/g/comp.lang.c/c/lSKWXiuNOAk

    https://en.wikipedia.org/wiki/Daniel_J._Bernstein

    --
    Manuel Collado - http://mcollado.z15.es

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Janis Papanagnou on Sat Sep 23 10:48:57 2023
    On 23.09.2023 10:45, Janis Papanagnou wrote:
    On 22.09.2023 21:50, Michael Sanders wrote:
    [...]
    # modulo simulates 32-bit unsigned integer

    The previous comment suggests that the subsequent code should use
    the constant 2147483648 if you want a 32-bit unsigned integer. [...]

    Erm..., _unsigned_ 32 bit integer would of course mean 4294967296.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to Janis Papanagnou on Sat Sep 23 04:39:05 2023
    On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote:

    Variable 'n' not declared as local (just for good measure, and in
    case you want to use the code in other code contexts).

    Aye, I spotted this *after* I posted (of course...). Updated
    function signature below:

    function djb2(str, hash, n, i, char)

    Is it guaranteed that the expression evaluates correctly in all
    awks? The intermediate results can become as large as 70866960411 (0x108000001B, which is larger than 32 bit), which needs to be
    representable in awk (not sure what POSIX awk defines/requires
    here).

    Huh? From my initial post in this thread:

    # modulo simulates 32-bit unsigned integer
    hash = (hash * 33 + ord(char)) % 2147483647

    Modulo wraps at 2147483647, it'll never outgrow
    it limits. Maybe I'm not grokking what you meant.

    And down-thread you remarked:

    Erm..., _unsigned_ 32 bit integer would of course mean 4294967296

    That's the max, but certainly not the only possibility...

    [...]

    (I would also try to avoid name clashes like ord() and ord[],
    BTW. Here it's simple, because the function is superfluous;
    just replace the ord(char) at the calling side by ord[char]
    and remove the function completely.)

    And yet, this rational assumes an extension to the language,
    which implies *only* gawk though right? What about the *standard*
    awks (tested dbj2() using only mawk & busybox's awk thus far -
    the smallest implementations out there). Still, have you by chance
    invoked gawk as:

    gawk --traditional

    I cant on my end as the hard drive died on porkchop.bsd,
    my OpenBSD box...

    What will the code produce with, say, "extended" ASCII, or the
    common ISO 8859-x family of character sets? Is there any reason
    to restrict it here?

    What are the criteria for the chosen character subset? How will
    or how should a TAB character in the data change the result?

    No issues with TAB (see my function ord()), & non 7bit character
    sets? Dunno... only need lower ASCII here, hence the hard-coded
    limit:

    min: 32, max: 126

    If so compelled, test other sets & post your findings here.
    Would be interested to read what you come up with.

    As I said, just a couple of more or less obvious questions.

    Thanks Janis, as always, appreciate your insights. Hoping to get
    back to using tin as my news reader soon, the google groups
    interface its perhaps not my cup of tea. :/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to Manuel Collado on Sat Sep 23 04:42:52 2023
    Hi Manuel.

    Question, have you tried: --traditional or --posix options?

    See also:

    https://www.gnu.org/software/gawk/manual/html_node/Compatibility-Mode.html

    On Saturday, September 23, 2023 at 3:54:29 AM UTC-5, Manuel Collado wrote:

    There is something wrong:

    $ LC_ALL=C gawk 'function ord(char) {
    return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
    }
    BEGIN {print ord("x")}'
    gawk: cmd. line:3: error: function `ord' called with space between name
    and `(',
    or used as a variable or an array

    Please note ord[char] vs. ord(char)

    HTH

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to Janis Papanagnou on Sat Sep 23 05:25:39 2023
    On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote:

    (I would also try to avoid name clashes like ord() and ord[],
    BTW. Here it's simple, because the function is superfluous;
    just replace the ord(char) at the calling side by ord[char]
    and remove the function completely.)

    One more quick question... does gawk have an ord() builtin
    function or not?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Michael Sanders on Sat Sep 23 15:21:10 2023
    On 23.09.2023 14:25, Michael Sanders wrote:
    On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote:

    (I would also try to avoid name clashes like ord() and ord[],
    BTW. Here it's simple, because the function is superfluous;
    just replace the ord(char) at the calling side by ord[char]
    and remove the function completely.)

    One more quick question... does gawk have an ord() builtin
    function or not?

    Not that I know of. - I seem to recall that when I once needed that
    I implemented an array (as you did). - There's a chance that GNU Awk
    has something in some extension libraries (but I'm too lazy to check).

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Michael Sanders on Sat Sep 23 15:17:32 2023
    On 23.09.2023 13:39, Michael Sanders wrote:
    On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote: [...]
    Is it guaranteed that the expression evaluates correctly in all
    awks? The intermediate results can become as large as 70866960411
    (0x108000001B, which is larger than 32 bit), which needs to be
    representable in awk (not sure what POSIX awk defines/requires
    here).

    Huh? From my initial post in this thread:

    # modulo simulates 32-bit unsigned integer
    hash = (hash * 33 + ord(char)) % 2147483647

    Modulo wraps at 2147483647, it'll never outgrow
    it limits. Maybe I'm not grokking what you meant.

    With this formula the hash variable may (because of the modulus)
    get values up to 2147483647-1, so within the expression the value
    may grow beyond that limit to (2147483647-1)*33+126 because of
    the previous (potentially large) hash value calculated.

    (There's a good chance that modern systems and languages calculate
    that without problem, but if we just assume "32-bit integer" then
    it will (typically implicitly) overflow and produce wrong results.
    A comment in the code may be useful and is often sufficient here,
    so that folks using/porting the code to their environment may have
    a visible caveat and may check that. - Myself I haven't inspected
    the POSIX specs on that, that's why I formulated it as question.)

    [...]

    (I would also try to avoid name clashes like ord() and ord[],
    BTW. Here it's simple, because the function is superfluous;
    just replace the ord(char) at the calling side by ord[char]
    and remove the function completely.)

    And yet, this rational assumes an extension to the language,
    which implies *only* gawk though right? [...]

    Maybe I misunderstand you. My suggestion here is not relying on
    extensions; it's basically just a change of the expression '#<<'

    for(i = 1; i <= n; i++) {
    char = substr(str, i, 1)
    hash = (hash * 33 + ord[char]) % 2147483647 #<<
    }

    # function ord(char) ## deleted

    # once in the BEGIN section (as you've done it)
    for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i


    What will the code produce with, say, "extended" ASCII, or the
    common ISO 8859-x family of character sets? Is there any reason
    to restrict it here?

    What are the criteria for the chosen character subset? How will
    or how should a TAB character in the data change the result?

    No issues with TAB (see my function ord()), & non 7bit character
    sets? Dunno... only need lower ASCII here, hence the hard-coded
    limit:

    min: 32, max: 126

    I understand that. My question with TAB was aiming at what this
    sample text should calculate

    Hello<Blank>brave<Blank>new<Tab>World!<Newline>

    Shall the Blank be part of the hash but the Tab not? - Appears to
    me to be inconsistent, and if it is "as designed" it should have
    an explicit and clear comment on that difference (and also that
    (and maybe also why) e.g. [other] control characters [deliberately]
    evaluate to 0).


    If so compelled, test other sets & post your findings here.

    Not sure what that means - by "sets" you mean "code-sets"? - ...

    Would be interested to read what you come up with.

    ...if so then it would be simple; I'd include the whole range of
    8-bit characters (including control characters), i.e. 0..255 to
    cover all octet streams (and take also means to not exclude the
    \n, or RS and FS in Awk parlance).

    As I said, just a couple of more or less obvious questions.

    Thanks Janis, as always, appreciate your insights. Hoping to get
    back to using tin as my news reader soon, the google groups
    interface its perhaps not my cup of tea. :/

    In your private environment at least you are free to choose. The
    horror is when a company forces you to switch from Unix to Windows,
    when all applications are getting Web based, and if you're not
    allowed to install "local" tools.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Michael Sanders on Sat Sep 23 22:29:34 2023
    Michael Sanders <chrome.tty@gmail.com> writes:

    Greetings folks, just sharing the knowledge (& test posting using
    google groups).

    Beware wordwrap...

    function djb2(str, hash, i, char) {

    # initialize hash to an arbitrary prime number
    hash = 5381
    n = length(str)

    # iterate over characters and compute hash
    for(i = 1; i <= n; i++) {
    char = substr(str, i, 1)
    # modulo simulates 32-bit unsigned integer
    hash = (hash * 33 + ord(char)) % 2147483647

    This modulus does nor do what you claim you want to do (as has been
    pointed out). To simulate 32-bit unsigned arithmetic you need

    hash = (hash * 33 + ord(char)) % 4294967296

    here. Mind you, there is no formal definition of what the DJB2 hash is.
    The only reliable reference is a C function that uses unsigned long int calculations. This could be 32-bits wide, but is often in wider.

    }

    return hash
    }

    function ord(char) {

    return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))

    What version of AWK, run with what flags, accepts this? I can't get it
    past gawk, nawk, mawk nor original-awk.

    }

    BEGIN {

    for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i

    print djb2("my key is...")

    }

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to Ben Bacarisse on Sat Sep 23 19:20:51 2023
    On Saturday, September 23, 2023 at 4:29:40 PM UTC-5, Ben Bacarisse wrote:

    This modulus does nor do what you claim you want to do (as has been
    pointed out).

    No sir, it does, in fact.

    # modulo simulates 32-bit unsigned integer

    https://en.wikipedia.org/wiki/2,147,483,647

    https://www.bing.com/search?FORM=U523DF&PC=U523&q=is+this+2147483647+a+32-bit+number

    https://www.google.com/search?q=is+this+2147483647+a+32-bit+number

    What version of AWK, run with what flags, accepts this? I can't get it
    past gawk, nawk, mawk nor original-awk.

    https://gnuwin32.sourceforge.net/packages/mawk.htm

    https://frippery.org/busybox/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to Michael Sanders on Sat Sep 23 19:50:41 2023
    On Friday, September 22, 2023 at 2:50:10 PM UTC-5, Michael Sanders wrote:

    Greetings folks, just sharing the knowledge (& test posting using google groups).

    Note to self: built a version for gawk & tested online here:

    https://ideone.com/ZZVQl2

    script follows:

    function djb2gawk(str, n, i, hash, ascii) {

    hash = 5381
    n = length(str)

    for(i = 1; i <= n; ++i) {
    char = substr(str, i, 1)
    ascii = alphanum(char)
    hash = (hash * 33 + ascii) % ((2^32) - 1)
    }

    return hash
    }

    function alphanum(char) {

    return index("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", char)
    }

    BEGIN {
    str = "comp.lang.awk"
    print djb2gawk(str)
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Michael Sanders on Sun Sep 24 06:38:59 2023
    On 24.09.2023 04:20, Michael Sanders wrote:
    On Saturday, September 23, 2023 at 4:29:40 PM UTC-5, Ben Bacarisse wrote:

    This modulus does nor do what you claim you want to do (as has been
    pointed out).

    No sir, it does, in fact.

    # modulo simulates 32-bit unsigned integer

    For unsigned ints with a range that is implementable by binary words
    x % 4294967296 (modulo) is equivalent to x & 4294967295 (bitwise-and),
    x % 2147483648 (modulo) is equivalent to x & 2147483647 (bitwise-and).
    (Mind the differing last digit of modulo and bitwise-and expressions!)
    The expressions on the first line are for 32-bit unsigned integer and
    the expressions on the second line are for 31-bit unsigned integer calculations.

    Your sample (the fact I pointed out in an earlier post and Ben repeated
    here) uses x % 2147483647 which doesn't match a "32-bit simulation".
    (Compare it with the four forms above and think about it.)

    To add something new with this post I want to point out that a modulus
    with primes (like 2147483647, a Mersenne prime) is regularly done e.g.
    in random number generators or in cryptography. In these areas you may
    have large numbers x and a comparable small prime modulus. Since you
    don't have such large registers for the x you can iterate over parts
    of the number sequentially (using a loop) to perform the operation.
    This insight can also be applied in case of the issue you have with
    your original code where internal overflows can arise; in case that
    the prime number _2147483647_ was *intended* (as opposed to 31/32 bit
    unsigned integer operations) you can avoid these computation errors
    by that iterative method (here, in fact, a simple split of x in only
    two parts would suffice for the "iteration").

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Manuel Collado@21:1/5 to All on Sun Sep 24 10:27:43 2023
    El 23/9/23 a las 15:21, Janis Papanagnou escribió:
    On 23.09.2023 14:25, Michael Sanders wrote:
    On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote:

    One more quick question... does gawk have an ord() builtin
    function or not?

    Not that I know of. - I seem to recall that when I once needed that
    I implemented an array (as you did). - There's a chance that GNU Awk
    has something in some extension libraries (but I'm too lazy to check).

    Yes, ord() and chr() are included in the gawk distribution as a dynamic extension.

    gawk -l ordchr 'BEGIN{print ord("x")}'
    120

    HTH
    --
    Manuel Collado - http://mcollado.z15.es

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to Janis Papanagnou on Sun Sep 24 03:30:26 2023
    On Saturday, September 23, 2023 at 11:39:04 PM UTC-5, Janis Papanagnou wrote:

    Your sample (the fact I pointed out in an earlier post and Ben repeated
    here) uses x % 2147483647 which doesn't match a "32-bit simulation".

    That's not correct & there is is no error using 2147483647, it works just
    as I intended. But elsewhere... I do think my choice of naming ord()
    & ord[] were poor, as these may be intrinsic to some versions of
    a given awk's namespace.

    (Mind the differing last digit of modulo and bitwise-and expressions!)

    See the dbj2gawk() version I posted elsewhere in this thread. But of
    course the older awks lack bitwise...

    Since you don't have such large registers...

    Exactly, hence the signed int masquerading as an unsigned int.
    We dont want to get sidetracked on that point, there are lots of
    other 32-bit numbers that can be chosen in any event (& ought to
    be to for security reasons). I took one that works with mawk,
    busybox awk. Just think about this & you'll see what I mean.

    in case that the prime number _2147483647_ was *intended*...

    Yes, intended.

    Thanks Janis.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to Manuel Collado on Sun Sep 24 03:31:44 2023
    On Sunday, September 24, 2023 at 3:27:47 AM UTC-5, Manuel Collado wrote:

    Yes it does help. Earnest thanks Manuel!

    Yes, ord() and chr() are included in the gawk distribution as a dynamic extension.

    gawk -l ordchr 'BEGIN{print ord("x")}'
    120

    HTH

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Michael Sanders on Sun Sep 24 15:28:02 2023
    On 24.09.2023 12:30, Michael Sanders wrote:
    On Saturday, September 23, 2023 at 11:39:04 PM UTC-5, Janis Papanagnou wrote:

    Your sample (the fact I pointed out in an earlier post and Ben repeated
    here) uses x % 2147483647 which doesn't match a "32-bit simulation".

    That's not correct & there is is no error using 2147483647, it works just
    as I intended.

    I cannot see how "it works" if it is not "working" for the valid
    31-bit number 2147483647; the modulus 2147483647 % 2147483647 == 0
    thus it doesn't "simulate" 31-bit arithmetic (and also not 32-bit
    arithmetic) with that modulus. But maybe you just have a different understanding when speaking about unsigned integer "simulation".

    For an operation x % 2147483647 you are (as I previously mentioned,
    similar as in some coding theory applications) doing an arithmetic
    reduction (or "folding") of longer numbers to smaller ones. That's
    fine per se, but you are not "simulating" unsigned integer math. If
    you wanted to say that you are restricting computation and numeric
    output to 31 bit values then it's okay. With the already mentioned
    flaw that the expressions exceeds 31 and 32 bit in the expression,
    so the math processor needs to support larger integers or FP math
    with sufficient resolution in the mantissa. Which is another point
    why your expression is not [reliably] working [correctly] within a
    32 bit range. To illustrate let's assume that the expression will
    (as shown to be possible) overflow the 32 bit, then the result will
    effectively have implicitly undergone an implicit x % 2147483648
    operation and then an explicit x % 2147483647 operation; together
    res = x % 2147483648 % 2147483647. By this there is algorithmically
    a discontinuity introduced which is almost always not desired.

    But elsewhere... I do think my choice of naming ord()
    & ord[] were poor, as these may be intrinsic to some versions of
    a given awk's namespace.

    (Mind the differing last digit of modulo and bitwise-and expressions!)

    See the dbj2gawk() version I posted elsewhere in this thread. But of
    course the older awks lack bitwise...

    Yes, that's why you'd use the modulus; but x % 2147483648 are
    necessary to handle all 31 bit unsigned integer values and not
    x % 2147483647.

    Notwithstanding that the use the latter may be algorithmically
    needed. Such algorithms may choose also other primes p in x % p.
    But still note the undesired introduced discontinuity mentioned
    above in case of the overflows within expressions.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Janis Papanagnou on Sun Sep 24 15:37:20 2023
    On 24.09.2023 06:38, Janis Papanagnou wrote:

    To add something new with this post I want to point out that a modulus
    with primes (like 2147483647, a Mersenne prime) is regularly done e.g.
    in random number generators or in cryptography. In these areas you may
    have large numbers x and a comparable small prime modulus. Since you
    don't have such large registers for the x you can iterate over parts
    of the number sequentially (using a loop) to perform the operation. [...]

    I just found on an old disk some Awk code where I used this method
    to compute and check IBANs some years ago. Here's the diff of two
    approaches, one using external multiple-precision arithmetic using
    'bc' co-process (GNU Awk) and one working on chunks of the numeric
    data string.

    $ diff iban1 iban2
    59a60,72
    function mod(n, s, l, z, m) {
    while (length(s) > 4) {
    z = substr(s, 1, 4)
    m = z % n
    s = m substr(s, 5)
    }
    return s % n
    }

    function mod97(s) {
    return mod(97, s)
    }

    66,67c79
    < print "98 - ( " iban " % 97 )" |& "bc"
    < "bc" |& getline checksum
    ---
    checksum = 98 - mod97(iban)
    72,73c84
    < print iban " % 97 == 1" |& "bc"
    < "bc" |& getline ok
    ---
    ok = mod97(iban) == 1

    (Just for illustration purposes in case someone's interested in that
    method.)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Michael Sanders on Sun Sep 24 20:45:30 2023
    Michael Sanders <chrome.tty@gmail.com> writes:

    On Sunday, September 24, 2023 at 3:27:47 AM UTC-5, Manuel Collado wrote:

    Yes it does help. Earnest thanks Manuel!

    Note that the gawk library extension 'ord' will not do what your ord
    function does, particularly with digits. But treating digits
    differently to other characters is not what DJB2 does, so the library
    extension ord will fix a bug.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Michael Sanders on Sun Sep 24 20:39:44 2023
    Michael Sanders <chrome.tty@gmail.com> writes:

    On Saturday, September 23, 2023 at 4:29:40 PM UTC-5, Ben Bacarisse wrote:

    This modulus does nor do what you claim you want to do (as has been
    pointed out).

    No sir, it does, in fact.

    I can only repeat that it does not. A simple test shows that it does
    not, so it is probable that you don't know what "simulates 32-bit
    unsigned integer" means.

    Positive integers are not "unsigned". 32-bit unsigned arithmetic has no
    sign representation at all, and the operations are all done modulo
    2**32.

    Even if you interpreted "32-bit unsigned integers" to mean that the
    range should be that of the positive (but signed) 32-bit integers, your
    modulus is still wrong. For that, you would need to use 2147483648 as
    the modulus, giving 0 to 2147483647 inclusive.

    # modulo simulates 32-bit unsigned integer

    https://en.wikipedia.org/wiki/2,147,483,647

    https://www.bing.com/search?FORM=U523DF&PC=U523&q=is+this+2147483647+a+32-bit+number

    https://www.google.com/search?q=is+this+2147483647+a+32-bit+number

    What do you think these links show other than the obvious and
    uncontested fact that the largest signed 32-bit number is a 32-bit
    number?

    What version of AWK, run with what flags, accepts this? I can't get it
    past gawk, nawk, mawk nor original-awk.

    https://gnuwin32.sourceforge.net/packages/mawk.htm

    What version of mawk is that? On my system:

    $ mawk -W version
    mawk 1.3.4 20200120
    Copyright 2008-2019,2020, Thomas E. Dickey
    Copyright 1991-1996,2014, Michael D. Brennan

    random-funcs: srandom/random
    regex-funcs: internal
    compiled limits:
    sprintf buffer 8192
    maximum-integer 2147483647
    $ mawk -f hash.awk
    mawk: hash.awk: line 19: syntax error at or near [
    mawk: hash.awk: line 25: syntax error at or near [

    Lines 19 and 25 have 'ord['... where ord has been previously defined as
    a function.

    https://frippery.org/busybox/

    Ah, yes, busybox awk accepts it. I wonder why? It's definitely an
    outlier. Don't use ord as the name of a function and of an array.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to All on Sun Sep 24 14:17:30 2023
    On Sunday, September 24, 2023 at 2:39:49 PM UTC-5, Ben Bacarisse wrote:

    [...]

    Ben, listen the number wraps (or folds as stated elsewhere)
    to zero at its max using modulo. Thats obvious. I, you, Janis
    (& likely others) see that.

    Now the thing is, the comment I placed in the script was the
    word 'simulates', if you want to fritter away your time debating
    that, have at it. Me? Too busy to deal with the snarky attitude.

    The scripts I posted, both dbj2() & dbj2gawk() work fine & correctly
    on my end & it seems rich to me that you cant get them to run but
    otherwise know that I'm wrong. Hmmm... None of this is the point,
    we simply want a large number (32bit - 1 as is the case) to lesson
    the chance of collisions in the hash & that's it... I cant believe
    you & Janis cant see that for what it is. Good grief, I should've known...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Michael Sanders on Mon Sep 25 00:45:50 2023
    On 24.09.2023 23:17, Michael Sanders wrote:
    [...]

    You are right, we all wasted our time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to janis_papanagnou+ng@hotmail.com on Sun Sep 24 23:36:49 2023
    In article <ueqe6u$1i163$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 24.09.2023 23:17, Michael Sanders wrote:
    [...]

    You are right, we all wasted our time.


    And you've got no one to blame but yourself.

    You & Ben.

    It was pretty clear OP wasn't interested in the usual "human compiler" treatment, but you guys went ahead with it anyway.

    --
    The book "1984" used to be a cautionary tale;
    Now it is a "how-to" manual.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kenny McCormack on Mon Sep 25 02:38:02 2023
    On 25.09.2023 01:36, Kenny McCormack wrote:
    In article <ueqe6u$1i163$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 24.09.2023 23:17, Michael Sanders wrote:
    [...]

    You are right, we all wasted our time.


    And you've got no one to blame but yourself.

    Sure.


    You & Ben.

    It was pretty clear OP wasn't interested in the usual "human compiler" treatment, but you guys went ahead with it anyway.


    The OP initially said about his interest it's "sharing the knowledge".
    Anyone tell us what that would be in this thread. - Rhetorical, don't
    bother.

    He could as well have contributed BEGIN { print 42 } . "It's working."
    And it's not rising all the questions the OP's code did and still does.

    And (with your contribution and my reply) we continue wasting our time,
    don't we, Kenny?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Michael Sanders on Mon Sep 25 02:27:45 2023
    Michael Sanders <chrome.tty@gmail.com> writes:

    On Sunday, September 24, 2023 at 2:39:49 PM UTC-5, Ben Bacarisse wrote:

    [...]

    Ben, listen the number wraps (or folds as stated elsewhere)
    to zero at its max using modulo. Thats obvious. I, you, Janis
    (& likely others) see that.

    Indeed we do all know that.

    Now the thing is, the comment I placed in the script was the
    word 'simulates', if you want to fritter away your time debating
    that, have at it. Me? Too busy to deal with the snarky attitude.

    No, I have no interest in debating it. I thought you might want to know
    how to implement the behaviour described by the comment you wrote,
    especially as that behaviour was a key part of how the original DJB hash
    was defined.

    The scripts I posted, both dbj2() & dbj2gawk() work fine & correctly
    on my end & it seems rich to me that you cant get them to run but
    otherwise know that I'm wrong.

    Of course I can get your code to run. It's trivial to fix the bug of
    using ord for two incompatible purposes, but I find it rich that you
    think it at all odd that someone could know your code is wrong without
    running it. That's what programmers do every day.

    Hmmm... None of this is the point,
    we simply want a large number

    Both the subject line of the thread, and the links you gave in the
    original post, suggest you wanted to implement a specific hash function:
    DJB's 32-bit "multiply by 33" hash. But your code does not do that.
    For example, the DJB hash of the string "abcdef" is 4048079738, but your
    code gives 1900599327. Your hash can go no higher than 2**31 - 2, but
    the original one uses the full range of 32-bit unsigned int.

    (32bit - 1 as is the case) to lesson
    the chance of collisions in the hash & that's it...

    This is why halving the range by using only 31 bits is not a good idea.
    It's not a /terrible/ idea because 31 bits is still a huge range, but
    since 32-bit wrapping is free on most hardware, there is absolutely no
    upside to using modulo 2147483647 arithmetic for a hash.

    And your saying "32bit - 1 as is the case" means you still don't know
    what the code you wrote is really doing.

    I cant believe
    you & Janis cant see that for what it is. Good grief, I should've
    known...

    Well now you know!

    1) DJB's has hash usually uses 32-bit unsigned arithmetic. (Very old implementation might use 16-bit arithmetic, and some newer ones might
    use 64 bits.)

    2) You can do that in many modern awks with modulo 4294967296
    arithmetic.

    3) In gawk you could use its logical 'and' function along with its
    hexadecimal numbers to make it very clear what's going on

    hash = and(hash * 33 + ord(c), 0xFFFFFFFF)

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to Ben Bacarisse on Mon Sep 25 00:26:58 2023
    On Sunday, September 24, 2023 at 8:27:50 PM UTC-5, Ben Bacarisse wrote:

    For example, the DJB hash of the string "abcdef" is 4048079738, but your
    code gives 1900599327. Your hash can go no higher than 2**31 - 2, but
    the original one uses the full range of 32-bit unsigned int.

    [...]

    This is why halving the range by using only 31 bits is not a good idea.
    It's not a /terrible/ idea because 31 bits is still a huge range, but
    since 32-bit wrapping is free on most hardware, there is absolutely no
    upside to using modulo 2147483647 arithmetic for a hash.

    And your saying "32bit - 1 as is the case" means you still don't know
    what the code you wrote is really doing.

    Reader's from the future, re-read the 1st paragraph...

    Here's where good ol' Ben has (unwittingly, chuckle) proven, & in his own words mind you, the importance of using of using a unique key number.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to Michael Sanders on Mon Sep 25 12:28:24 2023
    On Friday, September 22, 2023 at 2:50:10 PM UTC-5, Michael Sanders wrote:

    Greetings folks, just sharing the knowledge (& test posting using google groups).

    For the sake of completeness, a key generator (you supply a 32bit key):

    # Michael Sanders 2023: key generator for dbj2() & dbj2gawk()
    #
    # creates a key block (25 keys, 5x5 matrix) as shown below:
    #
    # 2147483647 2147483646 2147483645 2147483644 2147483643
    # 2147483642 2147483641 2147483640 2147483639 2147483638
    # 2147483637 2147483636 2147483635 2147483634 2147483633
    # 2147483632 2147483631 2147483630 2147483629 2147483628
    # 2147483627 2147483626 2147483625 2147483624 2147483623

    BEGIN {

    KEYMAX = 2147483647 # your key's max number
    BLOCKS = 5 # number of blocks to create

    for (x = KEYMAX; x >= 0; x--) {
    printf("%010d ", x);
    if ((KEYMAX - x) % 5 == 4) {
    printf("\n");
    if (++y % 5 == 0) {
    printf("\n")
    if (--BLOCKS < 1) exit
    }
    }
    }

    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Michael Sanders on Mon Sep 25 23:50:41 2023
    Michael Sanders <chrome.tty@gmail.com> writes:

    On Sunday, September 24, 2023 at 8:27:50 PM UTC-5, Ben Bacarisse wrote:

    For example, the DJB hash of the string "abcdef" is 4048079738, but your
    code gives 1900599327. Your hash can go no higher than 2**31 - 2, but
    the original one uses the full range of 32-bit unsigned int.

    [...]

    This is why halving the range by using only 31 bits is not a good idea.
    It's not a /terrible/ idea because 31 bits is still a huge range, but
    since 32-bit wrapping is free on most hardware, there is absolutely no
    upside to using modulo 2147483647 arithmetic for a hash.

    And your saying "32bit - 1 as is the case" means you still don't know
    what the code you wrote is really doing.

    Reader's from the future, re-read the 1st paragraph...

    Here's where good ol' Ben has (unwittingly, chuckle) proven, & in his own words mind you, the importance of using of using a unique key number.

    What's a key number?

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to Ben Bacarisse on Mon Sep 25 20:32:22 2023
    On Monday, September 25, 2023 at 5:50:45 PM UTC-5, Ben Bacarisse wrote:

    What's a key number?

    It doesn't matter Ben. I do appreciate your
    willingness to help despite our p*ssing contest.
    Moving forward, here's an intended usage example.

    Given a CSV file of: Smith, John, HASH

    And parsed as, oh something like:

    if ($1 == "Smith" && $2 == "John") {

    if ($3 == dbj2("policy_number")) {...}
    }

    We've gained simple a way to obfuscate small chucks
    of patient records.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Michael Sanders on Tue Sep 26 12:20:17 2023
    Michael Sanders <chrome.tty@gmail.com> writes:

    On Monday, September 25, 2023 at 5:50:45 PM UTC-5, Ben Bacarisse wrote:

    What's a key number?

    It doesn't matter Ben.

    OK. I thought you might want to be understood.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Sanders@21:1/5 to All on Tue Sep 26 19:11:30 2023
    Well, about to bring this little project to its conclusion =)
    A special thanks (in no particular order) to:

    Janis Papanagnou, Manuel Collado, & Ben Bacarisse.

    Great insights, advice & help, appreciate you all.

    tags: awk, sh, djb2, hash, keys, crypto

    # Michael Sanders 2023: djb2 - simple string hash for awk
    #
    # https://porkchop.neocities.org/notes/djb2.txt
    #
    # usage example...
    #
    # given a CSV file of: Smith, John, HASH
    #
    # you could deploy djb2() in the following manner:
    #
    # if ($1 == "Smith" && $2 == "John" && $3 == djb2("medical_condition")) {...}
    #
    # if a unique number is desired, use one of the key generators below

    function djb2(str, hash, x, y, ascii) {

    hash = 5381
    y = length(str)

    for(x = 1; x <= y; x++) {
    ascii = substr(str, x, 1)
    hash = (hash * 33 + alphanum(ascii)) % ((2^32) - 1) # 32-bit key or
    # hash = (hash * 33 + alphanum(ascii)) % 2147483647 # generated key
    }

    return hash

    }

    function alphanum(tmp) {

    return index("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", tmp)

    }

    BEGIN {

    print djb2("sensitive data")

    }

    # -----------------------------------------------------------------

    # Michael Sanders 2023: key generator for djb2.awk
    #
    # creates 1250 random & unique keys, where each block
    # contains 25 keys within a 5x5 matrix, sample output:
    #
    # 4044119583 2096446957 0146530746 2307162583 1145196028
    # 2074570158 0782926302 0126577871 1957612719 1264962473
    # 2768662034 3770197738 1165669763 3328823363 1160752685
    # 1609089706 2699782284 0621370497 3966710650 2024400675
    # 1453519246 1174954443 1588003602 3567230071 2877329117

    BEGIN {

    MIN = 0123456789 # min 10 digit num
    MAX = 4294967295 # max 10 digit num
    BLX = 50 # number of blocks

    seed = systime(); srand(seed)
    printf("\nSeed: %s\n\n", seed)

    for (x = 1; x <= BLX; x++) {
    for (y = 1; y <= 5; y++) {
    s = ""
    for (z = 1; z <= 5; z++) {
    do {key = int(rand() * (MAX - MIN + 1)) + MIN} while(key in keys)
    keys[key]
    s = (s == "" ? sprintf("%010d", key) : s " " sprintf("%010d", key))
    }
    printf("%s\n", s)
    }
    if (x < BLX) printf("%s", "\n")
    }
    }

    # -----------------------------------------------------------------

    # Michael Sanders 2023: key generator2 for older awks using dbj2.awk
    #
    # use this key generator instead if your awk lacks the abilty to
    # use either large integers or the systime() builtin
    #
    # creates key blocks (25 keys, 5x5 matrix) as shown below:
    #
    # 2147483647 2147483646 2147483645 2147483644 2147483643
    # 2147483642 2147483641 2147483640 2147483639 2147483638
    # 2147483637 2147483636 2147483635 2147483634 2147483633
    # 2147483632 2147483631 2147483630 2147483629 2147483628
    # 2147483627 2147483626 2147483625 2147483624 2147483623

    BEGIN {

    KEYMAX = 2147483647 # your 10 digit max number
    BLOCKS = 50 # number of blocks to create

    for (x = KEYMAX; x >= 0; x--) {
    printf("%010d ", x);
    if ((KEYMAX - x) % 5 == 4) {
    printf("\n");
    if (++y % 5 == 0) {
    printf("\n")
    if (--BLOCKS < 1) exit
    }
    }
    }

    }

    # -----------------------------------------------------------------

    notes:

    https://stackoverflow.com/questions/1579721/why-are-5381-and-33-so-important-in-the-djb2-algorithm

    https://stackoverflow.com/questions/10696223/reason-for-the-number-5381-in-the-djb-hash-function

    https://groups.google.com/g/comp.lang.c/c/lSKWXiuNOAk

    https://en.wikipedia.org/wiki/Daniel_J._Bernstein

    https://groups.google.com/g/comp.lang.awk/c/FZNpn8Sxf9k

    # eof

    --
    :wq
    Mike Sanders

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Sanders@21:1/5 to Ben Bacarisse on Tue Sep 26 19:21:19 2023
    Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

    OK. I thought you might want to be understood.

    Well, I wake up in a new world everyday it seems =)

    Have look up thread somewhere or another (subject
    contains '[Complete]'.

    --
    :wq
    Mike Sanders

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Mike Sanders on Tue Sep 26 14:31:40 2023
    porkchop@invalid.foo (Mike Sanders) writes:
    [...]
    function djb2(str, hash, x, y, ascii) {

    hash = 5381
    y = length(str)

    for(x = 1; x <= y; x++) {
    ascii = substr(str, x, 1)
    hash = (hash * 33 + alphanum(ascii)) % ((2^32) - 1) # 32-bit key or
    # hash = (hash * 33 + alphanum(ascii)) % 2147483647 # generated key
    }

    return hash

    }

    function alphanum(tmp) {

    return index("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", tmp)

    }
    [...]

    Why do you call this "djb2"?

    Here's a description of the djb2 hash function from <https://www.eecs.yorku.ca/~oz/hash.html>:

    this algorithm (k=33) was first reported by dan bernstein many years
    ago in comp.lang.c. another version of this algorithm (now favored
    by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the
    magic of number 33 (why it works better than many other constants,
    prime or not) has never been adequately explained.

    unsigned long
    hash(unsigned char *str)
    {
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
    hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
    }

    I see some similarities between the original djb and your hash function,
    and yours was undoubtedly inspired by djb2, but they're very much not the
    same thing. Would you agree that calling it "djb2" is misleading?

    I find it odd that your function treats all non-alphanumeric characters
    as the same. Is there a reason for that?

    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Will write code for food.
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Sanders@21:1/5 to Keith Thompson on Tue Sep 26 22:20:56 2023
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

    Why do you call this "djb2"?

    Well, because that is what its called.

    Here's a description of the djb2 hash function from <https://www.eecs.yorku.ca/~oz/hash.html>:

    Yes, one of the links in the notes section has a reference to that url.
    Did you not read the notes?

    this algorithm (k=33) was first reported by dan bernstein many years
    ago in comp.lang.c. another version of this algorithm (now favored
    by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the
    magic of number 33 (why it works better than many other constants,
    prime or not) has never been adequately explained.

    I know, intesting stuff I thought!

    unsigned long
    hash(unsigned char *str)
    {
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
    hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
    }


    Here's my take in virtually indentical take in c (uint32_t):

    uint32_t djb2(const char *str) {

    uint32_t hash = 5381;
    int c;
    while ((c = *str++))
    hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
    }

    I see some similarities between the original djb and your hash function,
    and yours was undoubtedly inspired by djb2, but they're very much not the same thing. Would you agree that calling it "djb2" is misleading?

    Nope, wont change it either. =) I've provided my notes, so others can study it, use it, extend it, trace it back to its source if so compelled... I make
    no claims as to being its originator. Heck the *cough* notes *cough* alone demonstrate that. I feel its 'nifty' for no reason than its short & sweet.
    awk (or least I) needed this functionality. Besides, I'm hard-pressed to think anyone will mistake it as anything misleading. I reckon no good deed will
    go unpunished. Such is life on usenet...

    I find it odd that your function treats all non-alphanumeric characters
    as the same. Is there a reason for that?

    Yes, just dealing (a-z, A-Z, 0-9) here. Personally don't want to imbue
    it with extra, uneeded baggage. Feel free to modify it to suit your tastes.

    --
    :wq
    Mike Sanders

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Mike Sanders on Tue Sep 26 15:38:47 2023
    porkchop@invalid.foo (Mike Sanders) writes:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Why do you call this "djb2"?

    Well, because that is what its called.

    You mean that's what you decided to call it. I'm asking why.

    Here's a description of the djb2 hash function from
    <https://www.eecs.yorku.ca/~oz/hash.html>:

    Yes, one of the links in the notes section has a reference to that url.
    Did you not read the notes?

    I did. That's how I noticed that you've implemented a different
    algorithm than any version of the original djb2.

    [...]

    I see some similarities between the original djb and your hash function,
    and yours was undoubtedly inspired by djb2, but they're very much not the
    same thing. Would you agree that calling it "djb2" is misleading?

    Nope, wont change it either. =)
    [...]

    Noted.

    What you've posted is not djb2. You shouldn't claim that it is. You've
    made it clear that you intend to continue to do so.

    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Will write code for food.
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Sanders@21:1/5 to Keith Thompson on Tue Sep 26 23:12:11 2023
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

    You mean that's what you decided to call it. I'm asking why.

    Well, I don't think I have an anwser that will satisfy you.

    What you've posted is not djb2. You shouldn't claim that it is. You've
    made it clear that you intend to continue to do so.

    What do you feel I ought to do to clarify the problem with this cavet:

    It wont be taken down or renamed.

    Perhaps add clarification to the document how my take only deals with alphanumeric chars? Let's work it out.

    --
    :wq
    Mike Sanders

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Mike Sanders on Tue Sep 26 16:53:41 2023
    porkchop@invalid.foo (Mike Sanders) writes:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    You mean that's what you decided to call it. I'm asking why.

    Well, I don't think I have an anwser that will satisfy you.

    Then post an answer that doesn't satisfy me. So far you've said nothing.

    What you've posted is not djb2. You shouldn't claim that it is. You've
    made it clear that you intend to continue to do so.

    What do you feel I ought to do to clarify the problem

    Rename it.

    with this cavet:
    It wont be taken down or renamed.

    If you wrote a hash function that yields a 160-bit result that differs
    from the result returned by the actual sha1 function, would you release
    it under the name "sha1"? Assuming your answer is no, how is this
    different? (Admittedly djb2 isn't as well established.)

    Perhaps add clarification to the document how my take only deals with alphanumeric chars? Let's work it out.

    Work it out yourself.

    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Will write code for food.
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Sanders@21:1/5 to Keith Thompson on Wed Sep 27 00:49:14 2023
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

    Work it out yourself.

    Will do.

    --
    :wq
    Mike Sanders

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Sanders@21:1/5 to Mike Sanders on Wed Sep 27 01:32:05 2023
    Mike Sanders <porkchop@invalid.foo> wrote:

    Will do.

    https://porkchop.neocities.org/notes/mHash.txt

    --
    :wq
    Mike Sanders

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Keith Thompson on Wed Sep 27 10:03:59 2023
    On 26.09.2023 23:31, Keith Thompson wrote:
    porkchop@invalid.foo (Mike Sanders) writes:
    [...]

    Why do you call this "djb2"?

    Here's a description of the djb2 hash function from <https://www.eecs.yorku.ca/~oz/hash.html>:
    [...]

    I see some similarities between the original djb and your hash function,
    and yours was undoubtedly inspired by djb2, but they're very much not the same thing. Would you agree that calling it "djb2" is misleading?

    I find it odd that your function treats all non-alphanumeric characters
    as the same. Is there a reason for that?


    Indeed. I as well was misleaded by the impression I've got that he
    was trying to implement (just in a wrong way) an existing algorithm.
    I noticed that (my) mistake not before I saw two versions of his code
    in one post, each using different character set encodings - which of
    course would lead to different results even if we compare only the
    OP's own implementations' results. This arbitrariness in conjunction
    with the inherent flaws the code versions had (and may still have;
    I don't bother any more) makes clear that we obviously have a "Not
    Invented Here" case (https://en.wikipedia.org/wiki/Not_invented_here),
    which most likely makes all in depth discussions void anyway. Glad
    it became obvious after all. (Bad that I noticed that too late. And I
    see that the OP just recently followed your suggestion to rename his
    hash function.)

    It's always good to get information about the _intention_ of a piece
    of posted code early and with the original post; that way bandworm
    threads and misunderstandings could be avoided.


    Let's come back to the actual application and application domain ...

    Inferred from the OP's posting where he wrote: dbj2("policy_number"))
    and "simple a way to obfuscate small chucks of patient records." ...

    In statistical medicine context there's demand to anonymize patient
    data. For statistical purposes and cause-effect insights it's usually
    necessary to be able to assign data to an "individual entity" without disclosing his identity. But a hash code does not support that; it
    can lead to "ID"-clashes that assigns different results to one same
    subject. You need to create (real) keys. - Remember Ben's confusion
    about (the OP's misnomer) "key number"! The OP may seem to require
    keys but he's creating a hash value.

    Of course if a hash value is sufficient - only the OP knows this! -
    then I'd have used some existing and proven algorithm instead of
    inventing my own (with possible flaws or non-obvious bad properties).
    (But this as well has the OP to decide and is not an Awk question
    anyway.)
    </OT>

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Sanders@21:1/5 to Janis Papanagnou on Wed Sep 27 10:45:18 2023
    Wow... I've simply attempted to learn & share that with others.
    I cant recall every seeing such, I dunno, at a loss for words.
    It would've taken me a few more iterations to bring it all togther
    but with this environment, its not okay.

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    Indeed. I as well was misleaded by the impression I've got that he
    was trying to implement (just in a wrong way) an existing algorithm.
    I noticed that (my) mistake not before I saw two versions of his code
    in one post, each using different character set encodings - which of
    course would lead to different results even if we compare only the
    OP's own implementations' results. This arbitrariness in conjunction
    with the inherent flaws the code versions had (and may still have;
    I don't bother any more) makes clear that we obviously have a "Not
    Invented Here" case (https://en.wikipedia.org/wiki/Not_invented_here),
    which most likely makes all in depth discussions void anyway. Glad
    it became obvious after all. (Bad that I noticed that too late. And I
    see that the OP just recently followed your suggestion to rename his
    hash function.)

    It's always good to get information about the _intention_ of a piece
    of posted code early and with the original post; that way bandworm
    threads and misunderstandings could be avoided.


    Let's come back to the actual application and application domain ...

    Inferred from the OP's posting where he wrote: dbj2("policy_number"))
    and "simple a way to obfuscate small chucks of patient records." ...

    In statistical medicine context there's demand to anonymize patient
    data. For statistical purposes and cause-effect insights it's usually necessary to be able to assign data to an "individual entity" without disclosing his identity. But a hash code does not support that; it
    can lead to "ID"-clashes that assigns different results to one same
    subject. You need to create (real) keys. - Remember Ben's confusion
    about (the OP's misnomer) "key number"! The OP may seem to require
    keys but he's creating a hash value.

    Of course if a hash value is sufficient - only the OP knows this! -
    then I'd have used some existing and proven algorithm instead of
    inventing my own (with possible flaws or non-obvious bad properties).
    (But this as well has the OP to decide and is not an Awk question
    anyway.)
    </OT>

    Janis



    --
    :wq
    Mike Sanders

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Sanders@21:1/5 to Janis Papanagnou on Wed Sep 27 11:13:21 2023
    On Wednesday, September 27, 2023 at 3:04:03 AM UTC-5, Janis Papanagnou wrote:

    Of course if a hash value is sufficient - only the OP knows this!

    key = 32bitmax - random_lesser

    (admin issues keys so no duplicates)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)