• why is this procedure is so slow?

    From Julieta Shem@21:1/5 to All on Thu Feb 8 13:47:51 2024
    First I define x to be a list of bytes of an NNTP article.

    (defvar x (fetch-article "local.test" "28"))

    Now x is a list of integers, the bytes of the article 28, which is a
    MIME message containing a PDF of 610 KiB.

    Now I want to split it line by line. I am able to do it, but it's very
    slow.

    * (time (length (split-sequence (list 13 10) x nil)))
    Evaluation took:
    65.710 seconds of real time
    65.671875 seconds of total run time (47.093750 user, 18.578125 system)
    [ Run times consist of 23.968 seconds GC time, and 41.704 seconds non-GC time. ]
    99.94% CPU
    170,322,749,168 processor cycles
    79,439,358,864 bytes consed

    11585
    *

    Less than 12k lines. The number 79,439,358,864 of bytes is way more
    than I would have expected for anything regarding this invocation above,
    but I have no idea what this number really measures here. I appreciate
    any help. Thank you.

    Here's the procedure:

    (defun split-sequence (delim ls acc &key limit (so-far 1))
    (let* ((len (length ls))
    (delim delim)
    (pos (search delim ls))
    (n-take (or pos len))
    (n-drop (if pos
    (+ n-take (length delim))
    n-take)))
    (cond ((zerop len) acc)
    ((and limit (= so-far limit)) (list ls))
    (t (split-sequence
    delim (drop n-drop ls)
    (cons (take n-take ls) acc)
    :limit limit
    :so-far (1+ so-far))))))

    (defun take (n seq) (subseq seq 0 n))
    (defun drop (n seq) (subseq seq n))

    (*) A sample of the article

    All lines in it should be pretty short.

    --8<---------------cut here---------------start------------->8---
    Message-Id: <tnkqcqnuujaljsvmzvuc@loop>
    Content-Type: multipart/mixed; boundary="------------PeB0GiqcER01ZhCmBvnP2yr6" Date: Wed, 7 Feb 2024 22:22:57 -0300
    Mime-Version: 1.0
    User-Agent: Mozilla Thunderbird
    Newsgroups: local.test
    Content-Language: en-US
    From: Someone <someone@somewhere.org>
    Subject: juris hartmanis

    This is a multi-part message in MIME format. JVBERi0xLjIKekdf1fnfSqQYt7AjczYfpmRSIEyEcx8KMSAwIG9iago8PAovVHlwZSAvQ2F0 YWxvZwovUGFnZXMgMyAwIFIKL091dGxpbmVzIDIgMCBSCj4+CmVuZG9iagoyIDAgb2JqCjw8
    [...]
    IDAwMDAwIG4gCjAwMDA2MjI5NTQgMDAwMDAgbiAKdHJhaWxlcgo8PAovU2l6ZSA0NgovUm9v dCAxIDAgUgovSW5mbyA0NSAwIFIKPj4Kc3RhcnR4cmVmCjYyMzI4NgolJUVPRgo=

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Julieta Shem on Thu Feb 8 19:24:57 2024
    On 2024-02-08, Julieta Shem <jshem@yaxenu.org> wrote:
    First I define x to be a list of bytes of an NNTP article.

    (defvar x (fetch-article "local.test" "28"))

    Now x is a list of integers, the bytes of the article 28, which is a
    MIME message containing a PDF of 610 KiB.

    Now I want to split it line by line. I am able to do it, but it's very
    slow.

    * (time (length (split-sequence (list 13 10) x nil)))
    Evaluation took:
    65.710 seconds of real time

    Your code has O(N^2) behavior because each recursive call performs
    operations that take a time proportional to the length of input
    list, such as taking its length.

    The operation (subseq list n) is required to produce a copy of the
    cell structure. You can skip n elements of a list with (nthcdr n list)
    without allocating memory. It still requires traversing the list
    from the beginning to find the n-th cell.

    This kind of function is best implemented separately for lists
    and vectors.

    To develop it reasonably efficiently in a generic way, what we can do is
    have some sort of abstract iteration primitives first: a way to get an
    iterator object over any kind of sequence, and then step it. The
    splitting is then done as a state machine. Getting successive items,
    collecting them and watching for the delimiter, collecting the items
    into sequences of the same kind as the input.

    To make it better, you'd really want to specialize this. If the input
    sequence is a list, dispatch code that does efficient list processing
    with minimal consing, and no redundant walking over the list (single
    pass only).

    Otherwise if the sequence is a vector-like sequence, then use subseq.
    But don't wastefully calculate the subseq of the rest of the list.
    Just extract the splits.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Julieta Shem@21:1/5 to Raymond Wiker on Fri Feb 9 10:24:05 2024
    I'd like to thank everyone in this thread. Valuable help. Thank you.

    Raymond Wiker <rwiker@gmail.com> writes:

    Julieta Shem <jshem@yaxenu.org> writes:

    First I define x to be a list of bytes of an NNTP article.

    (defvar x (fetch-article "local.test" "28"))

    Now x is a list of integers, the bytes of the article 28, which is a
    MIME message containing a PDF of 610 KiB.

    Now I want to split it line by line. I am able to do it, but it's very
    slow.

    * (time (length (split-sequence (list 13 10) x nil)))
    Evaluation took:
    65.710 seconds of real time
    65.671875 seconds of total run time (47.093750 user, 18.578125 system)
    [ Run times consist of 23.968 seconds GC time, and 41.704 seconds non-GC time. ]
    99.94% CPU
    170,322,749,168 processor cycles
    79,439,358,864 bytes consed

    11585
    *

    Less than 12k lines. The number 79,439,358,864 of bytes is way more
    than I would have expected for anything regarding this invocation above,
    but I have no idea what this number really measures here. I appreciate
    any help. Thank you.

    Here's the procedure:

    (defun split-sequence (delim ls acc &key limit (so-far 1))
    (let* ((len (length ls))
    (delim delim)
    (pos (search delim ls))
    (n-take (or pos len))
    (n-drop (if pos
    (+ n-take (length delim))
    n-take)))
    (cond ((zerop len) acc)
    ((and limit (= so-far limit)) (list ls))
    (t (split-sequence
    delim (drop n-drop ls)
    (cons (take n-take ls) acc)
    :limit limit
    :so-far (1+ so-far))))))

    (defun take (n seq) (subseq seq 0 n))
    (defun drop (n seq) (subseq seq n))

    It is slow because

    * You're repeatedly calling #'length on whatever is left of the
    list. Computing the length of a list is an O(n) operation; repeatedly
    calling #'length on the remainder of the list is then O(n^2).

    * You are generating new lists for both the initial segments and the
    rest of the list. In order to generate 12k lines you have to generate
    lists containing the lines, but also 12 lists containing the remainder
    (at various points) of the input.

    That's the take and drop, I think.

    * Each character of the input requires a cons cell, which is probably 16
    bytes (assuming a 64-bit Lisp implementation). So, your input now
    requires 610000*16, so about 9.76MB.

    So that's one of the advantages of using a vector, I think.

    * The remainder lists will consume about 0.5*610000*16*12000, so about
    55GB. This is not too far from the 74GB that time reports. (My
    calculations are probably off, but clearly not _too_ far off.)

    Impressive.

    * I'm going to assume that 74GB is going to be a significant chunk of
    the available memory in your computer. That means that there will be
    significant garbage-collection activity, and you may also end up
    swapping to disk.

    What do you mean by ``available memory;'''? I have 24 GiB of RAM---23.9
    GiB usable, says Windows. I don't know how to show this right now, but
    I think I don't have any swap space (on Windows). I tried to turn it
    off some time ago I think I succeeded.

    If you used strings (or byte arrays instead), you would

    * Require much less memory.

    * Computing #'length for a string or array is O(1), so the overall complexity
    becomes O(n).

    Because strings and arrays store their size somewhere in their data
    structure, I suppose, and lists do not.

    * You don't have to create new arrays containing the successive
    remainders; instead, you can use the :start keyword to #'search.

    I tried this. Here's how I rewrote it. First some user interface that
    gets the length just once, even though now it's probably okay to call
    length multiple times. (The /acc/umulator should be probably be
    optional here, but I was afraid of mixing &optional with &key. This is
    my first Common Lisp program.)

    (defun split-vector (delim v acc &key limit (so-far 1))
    (let ((len (length v)))
    (split-vector-helper delim v len acc limit so-far 0)))

    Now, find the position of /delim/ between /start/ and /len/. If you
    can't find it, we're done. If we've reached the limit of lines the
    caller requested, we're done. Now we found /delim/: copy the slice of
    the vector that goes from /start/ to the /pos/ition we found /delim/;
    save it in the accumulator. Update /start/ we along over the vector.
    Now repeat everything.

    (defun split-vector-helper (delim v len acc limit so-far start)
    (if (zerop len)
    acc
    (let ((pos (search delim v :start2 start :end2 len)))
    (cond ((or (not pos) (and limit (= so-far limit)))
    (nreverse (cons (subseq v start len) acc)))
    (t (split-vector-helper delim
    v
    len
    (cons (subseq v start (or pos len)) acc)
    limit
    (1+ so-far)
    (+ pos (length delim))))))))

    The result is better:

    * (time (length (split-vector (vector 13 10) x nil)))
    Evaluation took:
    0.011 seconds of real time
    0.000000 seconds of total run time (0.000000 user, 0.000000 system)
    0.00% CPU
    30,857,378 processor cycles
    7,000,064 bytes consed

    11576

    (Difficult to understand 0.011 real time atgainst ``total run time''.)

    (I call /length/ so I won't get the output of split-vector, which is
    long.)

    Question. When split-vector-helper is repeated, will the arguments be
    copied over to the new invokation, or will SBCL somehow pass on
    pointers?

    I haven't studied about the /loop/ macro yet.

    * Also, if you used strings, you could just create a string stream from
    the input string and repeatedly call #'read-line to collect the
    lines. The entire program would be something like this (untested):

    (defun split-article (article-as-string)
    (with-input-from-string (str article-as-string)
    (loop for line = (read-line str nil)
    while line
    collect line)))

    As Spiros Bousbouras remarked in the thread, I'd prefer not to try to understand article content because the user has the right to use
    whatever encoding they please in NNTP articles. So I've been trying
    handle them as byte sequences.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Julieta Shem@21:1/5 to Spiros Bousbouras on Sat Feb 10 11:47:23 2024
    Spiros Bousbouras <spibou@gmail.com> writes:

    On Fri, 09 Feb 2024 10:24:05 -0300
    Julieta Shem <jshem@yaxenu.org> wrote:

    I don't know if you have learned declaration syntax yet but some declarations would help make the code clearer and help the compiler do optimisations or catch errors. For example

    (declaim (ftype (function (vector vector list
    &key (:limit (or null integer)) (:so-far integer))
    list)
    split-vector))

    I have not. Very interesting. Sounds like ``gradual typing'', if
    ``gradual typing'' is what it sounds like. :)

    (defun split-vector (delim v acc &key limit (so-far 1))
    (let ((len (length v)))
    (split-vector-helper delim v len acc limit so-far 0)))

    Now, find the position of /delim/ between /start/ and /len/. If you
    can't find it, we're done. If we've reached the limit of lines the
    caller requested, we're done. Now we found /delim/: copy the slice of
    the vector that goes from /start/ to the /pos/ition we found /delim/;
    save it in the accumulator. Update /start/ we along over the vector.
    Now repeat everything.

    (defun split-vector-helper (delim v len acc limit so-far start)
    (if (zerop len)
    acc
    (let ((pos (search delim v :start2 start :end2 len)))
    (cond ((or (not pos) (and limit (= so-far limit)))
    (nreverse (cons (subseq v start len) acc)))
    (t (split-vector-helper delim
    v
    len
    (cons (subseq v start (or pos len)) acc)
    ^^^^^^^^^^
    If this branch is taken , you already know that pos is non NIL .

    Thanks!

    limit
    (1+ so-far)
    (+ pos (length delim))))))))

    Question. When split-vector-helper is repeated, will the arguments be
    copied over to the new invokation, or will SBCL somehow pass on
    pointers?

    This has been answered in a different post but you can also write the
    helper function within the body of split-vector in which case the
    helper will inherit the surrounding lexical environment so you won't
    need to pass so many arguments to it.

    Sounds very useful. Your ``&aux'' made me think that's possible.
    Great news.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Julieta Shem@21:1/5 to Spiros Bousbouras on Sat Feb 10 11:43:44 2024
    Spiros Bousbouras <spibou@gmail.com> writes:

    [...]

    I haven't studied about the /loop/ macro yet.

    Some people abhor loop, while others find it strangely attractive (or
    even addictive).

    For turning Julieta's recursion into a loop , DO will do just fine.

    Good to know. I haven't studied DO and DO* yet either. Lol. How am I
    writing code over here? Perhaps you'd all be horrified. It's hard to
    learn everything. It seems better to write with the little we have and
    then rewrite it as the years go by. I did learn to use DOLIST.

    [...]

    If it also wants to do spam filtering
    then yes , it will have to decode the body.

    Spam should be the problem of a spam filter---not my problem. :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Julieta Shem@21:1/5 to Raymond Wiker on Sat Feb 10 11:37:32 2024
    Raymond Wiker <rwiker@gmail.com> writes:

    Julieta Shem <jshem@yaxenu.org> writes:

    I'd like to thank everyone in this thread. Valuable help. Thank you.

    Raymond Wiker <rwiker@gmail.com> writes:

    Julieta Shem <jshem@yaxenu.org> writes:

    First I define x to be a list of bytes of an NNTP article.

    (defvar x (fetch-article "local.test" "28"))

    Now x is a list of integers, the bytes of the article 28, which is a
    MIME message containing a PDF of 610 KiB.

    Now I want to split it line by line. I am able to do it, but it's very >>>> slow.

    * (time (length (split-sequence (list 13 10) x nil)))
    Evaluation took:
    65.710 seconds of real time
    65.671875 seconds of total run time (47.093750 user, 18.578125 system) >>>> [ Run times consist of 23.968 seconds GC time, and 41.704 seconds non-GC time. ]
    99.94% CPU
    170,322,749,168 processor cycles
    79,439,358,864 bytes consed

    11585
    *

    Less than 12k lines. The number 79,439,358,864 of bytes is way more
    than I would have expected for anything regarding this invocation above, >>>> but I have no idea what this number really measures here. I appreciate >>>> any help. Thank you.

    Here's the procedure:

    (defun split-sequence (delim ls acc &key limit (so-far 1))
    (let* ((len (length ls))
    (delim delim)
    (pos (search delim ls))
    (n-take (or pos len))
    (n-drop (if pos
    (+ n-take (length delim))
    n-take)))
    (cond ((zerop len) acc)
    ((and limit (= so-far limit)) (list ls))
    (t (split-sequence
    delim (drop n-drop ls)
    (cons (take n-take ls) acc)
    :limit limit
    :so-far (1+ so-far))))))

    (defun take (n seq) (subseq seq 0 n))
    (defun drop (n seq) (subseq seq n))

    It is slow because

    * You're repeatedly calling #'length on whatever is left of the
    list. Computing the length of a list is an O(n) operation; repeatedly
    calling #'length on the remainder of the list is then O(n^2).

    * You are generating new lists for both the initial segments and the
    rest of the list. In order to generate 12k lines you have to generate
    lists containing the lines, but also 12 lists containing the remainder >>> (at various points) of the input.

    That's the take and drop, I think.

    Yup.

    * Each character of the input requires a cons cell, which is probably 16 >>> bytes (assuming a 64-bit Lisp implementation). So, your input now
    requires 610000*16, so about 9.76MB.

    So that's one of the advantages of using a vector, I think.

    Right. The other is that accessing any element in an array by position (index) is O(1); for a list, it's O(n).

    * The remainder lists will consume about 0.5*610000*16*12000, so about
    55GB. This is not too far from the 74GB that time reports. (My
    calculations are probably off, but clearly not _too_ far off.)

    Impressive.

    Maybe... but it also appears to be wrong :-)

    That's okay. :-)

    * I'm going to assume that 74GB is going to be a significant chunk of
    the available memory in your computer. That means that there will be
    significant garbage-collection activity, and you may also end up
    swapping to disk.

    What do you mean by ``available memory;'''? I have 24 GiB of RAM---23.9
    GiB usable, says Windows. I don't know how to show this right now, but
    I think I don't have any swap space (on Windows). I tried to turn it
    off some time ago I think I succeeded.

    What implementation of Common Lisp are you using? Is it a 32-bit or
    64-bit version?

    I believe it's a 64-bit version.

    %file sbcl.exe
    sbcl.exe: PE32+ executable (console) x86-64, for MS Windows

    I got it from

    https://www.sbcl.org/platform-table.html

    and I downloaded the MSI-file from column AMD64, version 2.3.2.

    %./sbcl.exe --version
    SBCL 2.3.2

    If you used strings (or byte arrays instead), you would

    * Require much less memory.

    * Computing #'length for a string or array is O(1), so the overall complexity
    becomes O(n).

    Because strings and arrays store their size somewhere in their data
    structure, I suppose, and lists do not.

    Right. Lists are literally composed of objects that contain two values
    (cons cells), where the second value is a pointer to the next cons cell.

    * You don't have to create new arrays containing the successive
    remainders; instead, you can use the :start keyword to #'search.

    I tried this. Here's how I rewrote it. First some user interface that
    gets the length just once, even though now it's probably okay to call
    length multiple times. (The /acc/umulator should be probably be
    optional here, but I was afraid of mixing &optional with &key. This is
    my first Common Lisp program.)

    (defun split-vector (delim v acc &key limit (so-far 1))
    (let ((len (length v)))
    (split-vector-helper delim v len acc limit so-far 0)))

    Now, find the position of /delim/ between /start/ and /len/. If you
    can't find it, we're done. If we've reached the limit of lines the
    caller requested, we're done. Now we found /delim/: copy the slice of
    the vector that goes from /start/ to the /pos/ition we found /delim/;
    save it in the accumulator. Update /start/ we along over the vector.
    Now repeat everything.

    (defun split-vector-helper (delim v len acc limit so-far start)
    (if (zerop len)
    acc
    (let ((pos (search delim v :start2 start :end2 len)))
    (cond ((or (not pos) (and limit (= so-far limit)))
    (nreverse (cons (subseq v start len) acc)))
    (t (split-vector-helper delim
    v
    len
    (cons (subseq v start (or pos len)) acc)
    limit
    (1+ so-far)
    (+ pos (length delim))))))))

    The result is better:

    * (time (length (split-vector (vector 13 10) x nil)))
    Evaluation took:
    0.011 seconds of real time
    0.000000 seconds of total run time (0.000000 user, 0.000000 system)
    0.00% CPU
    30,857,378 processor cycles
    7,000,064 bytes consed

    11576

    (Difficult to understand 0.011 real time atgainst ``total run time''.)

    That's a good second implementation, and a very nice improvement in run
    time and memory use!

    Thanks!

    Note that you do not actually need the "len" parameter: if you don't
    specify :end2 for #'search or the end position for #'subseq, they will
    assume the value nil, which means the end of the sequence.

    Noted.

    I haven't studied about the /loop/ macro yet.

    Some people abhor loop, while others find it strangely attractive (or
    even addictive).

    People say loop is hard to understand. I have a feeling I won't like it
    much: hard-to-understand things don't give us a feeling of control.

    ``The details of the interaction matter, ease of use matters, but I
    want more than correct details, more than a system that is easy to
    learn or to use: I want a system that is enjoyable to use. This is
    an important, dominating design philosophy, easier to say than to
    do. It implies developing systems that provide a strong sense of
    understanding and control. This means tools that reveal their
    underlying conceptual model and allow for interaction, tools that
    emphasize comfort, ease, and pleasure of use [...]. A major factor
    in this debate is the feeling of control that the user has over the
    operations that are being performed. A `powerful,' `intelligent'
    system can lead to the well documented problems of `overautomation,'
    causing the user to be a passive observer of operations, no longer
    in control of either what operations take place, or of how they are
    done. On the other hand, systems that are not sufficiently powerful
    or intelligent can leave too large a gap in the mappings from
    intention to action execution and from system state to psychological
    interpretation. The result is that operation and interpretation are
    complex and difficult, and the user again feels out of control,
    distanced from the system.'' -- ``User Centered System Design'',
    chapter 3, ``cognitive engineering'', ``on the quality of
    human-computer interaction'', pages 48--49, Donald A. Norman, CRC
    Press, 1986, ISBN 0-89859-872-9.

    * Also, if you used strings, you could just create a string stream from
    the input string and repeatedly call #'read-line to collect the
    lines. The entire program would be something like this (untested):

    (defun split-article (article-as-string)
    (with-input-from-string (str article-as-string)
    (loop for line = (read-line str nil)
    while line
    collect line)))

    As Spiros Bousbouras remarked in the thread, I'd prefer not to try to
    understand article content because the user has the right to use
    whatever encoding they please in NNTP articles. So I've been trying
    handle them as byte sequences.

    At some point you will probably have to encode the article as
    characters. In order to do so, you will have to parse the article
    headers to find what encoding is used. I'm not convinced that this is possible, but if you have a 16-bit character encoding, you may have an
    octet 13 from one character, and an octet 10 from the following
    character.

    That's a good question. I wonder what would happen. Spiros Bousbouras
    seems to say that the NNTP protocol is such that if someone uses an
    encoding that could encode a character as CRLF, it's their problem
    now---NNTP came first.

    So far what I have written is what Bousbouras says. I take an article
    as a sequence of bytes. I look for the first sequence of bytes

    13 10 13 10.

    That separates headers and body. The body is everything that follows
    this mark. Headers are everything that precedes it. I do have to
    decode some headers completely---such as /newsgroups/. To make my life
    easy and let me use string functions, I've been decoding all headers,
    though I would rewrite that if I could and let people put whatever
    encoding they want in the headers so long as they respect
    line-termination as CRLF and respect the header-value separator, which
    is the COLON.

    Perhaps this perspective turns the NNTP protocol into a binary protocol.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Julieta Shem@21:1/5 to Spiros Bousbouras on Sat Feb 10 14:18:13 2024
    Spiros Bousbouras <spibou@gmail.com> writes:

    On Sat, 10 Feb 2024 11:43:44 -0300
    Julieta Shem <jshem@yaxenu.org> wrote:
    Spiros Bousbouras <spibou@gmail.com> writes:

    [...]

    I haven't studied about the /loop/ macro yet.

    Some people abhor loop, while others find it strangely attractive (or
    even addictive).

    For turning Julieta's recursion into a loop , DO will do just fine.

    Good to know. I haven't studied DO and DO* yet either. Lol. How am I
    writing code over here? Perhaps you'd all be horrified. It's hard to
    learn everything. It seems better to write with the little we have and
    then rewrite it as the years go by. I did learn to use DOLIST.

    It will only take you 10 minutes tops to learn to use DO and it will make your life easier.

    That's what non-beginners think, which includes beginners that didn't
    begin yet---I've thought that too. Things are always more complicated
    than it appears. You have no idea how many stupid things I've done
    here. :) Here's a not-first attempt at using DO.

    (defun nntp-read-line-helper (bs)
    (let ((len (length bs))
    acc)
    (do ((i 0 (1+ i)))
    ((= i len) (nreverse acc))
    (let ((x (svref bs i)))
    (if (= x 10)
    (return (nreverse acc))
    (push x acc))))))

    CL-USER> (nntp-read-line-helper (vector 1 2 3 13 10 4 5 6 13 10))
    (1 2 3 13)

    CL-USER> (nntp-read-line-helper (vector 4 5 6 13 10))
    (4 5 6 13)

    CL-USER> (nntp-read-line-helper (vector))
    NIL

    It'd be used like that. (This one might be successful.) I haven't
    written nntp-read-line driver of this helper yet. (There are some thing
    that I consider non-obvious. Network connections can deliver partial
    lines. What do I do when I get a partial line? Also, for good
    performance, I likely need read-sequence here. So, my first version
    uses read-byte and now I have a slow reading procedure that shows its performance when you attach a file, say. To be honest, I'm not even
    going to allow files to be attached, but I would like to know what to do
    if I wanted a high-performance reader.)

    Vectors are not convenient when we want to build something one element
    at a time. I think coercing a list into a vector is O(n), but that
    seems inevitable because to make a vector we need to know the size and
    we don't know the size up front.

    It must be hard trying to write everything as recursion. Just be
    careful that the CLHS says do iterates over a group of statements
    while a test condition holds.

    which is the opposite than the correct behaviour which is that the iteration terminates when the test condition becomes true. That's an embarrassing mistake in the specification. All the implementations implement the correct behaviour.

    I don't read the CLHS at first. I first, say, consult Paul Graham's
    ANSI Common Lisp.

    [...]

    If it also wants to do spam filtering
    then yes , it will have to decode the body.

    Spam should be the problem of a spam filter---not my problem. :)

    That depends on whether you are planning to use your server "properly"
    i.e. put it online and peer with other servers , etc. People won't peer
    with you unless you have antispam measures in place. So perhaps spam filtering won't be implemented within the server but you probably should
    give some thought on how well your server would integrate with preexisting spam filters. I've seen peering policies specifying that in order to peer with a server , it has to use a piece of software called cleanfeed .

    I'm not going to join the USENET---that would likely be a lot of work.
    The idea of this server is to offer Lisp programmers a little private
    club in which they can play with things. The server will be
    closed---you need an account to join. Accounts can be created by any
    member, so any member can invite other members---giving members a sense
    they own the software that runs the community. (Instead of mailing
    lists, say, you can run this little NNTP guy.) That's the idea---a
    playground. It's also a small-community tool.

    I've got a skeleton of it running. It's been great fun. I've never had
    so much fun in programming. There was a time I was seriously
    considering stopping programming, but then I discovered Lisp. In
    choosing a dialect, I chose Racket and spent considerable time studying
    it. I've written small things in Racket (for the web). I loved their web-server based on continuations. But it seems that life is much
    easier and more fun in Common Lisp---not by a small delta. I always
    felt that pretty much any Lisp would be more or less the same fun, but
    that's not true.

    Big thanks to the people here who directly or indirectly made me
    consider Common Lisp. There's a thread here that began with Racket and
    was eventually renamed to ``Choosing between Lisps and Schemes''. I
    believe it was the POSIX discussion that gave me the idea that maybe I
    would find the Common Lisp world easier to understand than Racket's.
    Maybe that's what made me decide to give it a try. I'm very grateful to Racket, though, in that their books were what I needed to stop thinking
    so much like a C programmer and start thinking more like a functional programmer. The project now is to think like a language-oriented
    programmer and, ironically, I gave up on Racket for that. (Maybe I am
    the problem.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Bawden@21:1/5 to All on Sun Feb 11 02:00:48 2024
    Julieta Shem <jshem@yaxenu.org> writes:

    Vectors are not convenient when we want to build something one element
    at a time. I think coercing a list into a vector is O(n), but that
    seems inevitable because to make a vector we need to know the size and
    we don't know the size up front.

    This is why we have arrays with fill pointers. You should read up on
    how arrays (and vectors) really work, with an eye on eventually being
    able to use VECTOR-PUSH-EXTEND on a array that's specialized to contain
    just 8-bit bytes.

    - Alan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jeff Barnett@21:1/5 to Alan Bawden on Sun Feb 11 00:33:29 2024
    On 2/11/2024 12:00 AM, Alan Bawden wrote:
    Julieta Shem <jshem@yaxenu.org> writes:

    Vectors are not convenient when we want to build something one element
    at a time. I think coercing a list into a vector is O(n), but that
    seems inevitable because to make a vector we need to know the size and
    we don't know the size up front.

    This is why we have arrays with fill pointers. You should read up on
    how arrays (and vectors) really work, with an eye on eventually being
    able to use VECTOR-PUSH-EXTEND on a array that's specialized to contain
    just 8-bit bytes.

    N.B. The use of vector-push-extend, where extend happens many (for many definitions of "many") times, will consume O(n^2) time AND probably
    memory too. There is no free (or even cheap) lunch when you don't have a
    good estimate of final total size up front.

    I think the best you can do is some sort of balanced tree, e.g., a heap,
    that will consume total build time of O(n*log n) and just O(n) space.
    The average/median retrieval time of a single element by index: O(1) for vector; O(log n) for heap; O(n) for linked list.
    --
    Jeff Barnett

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Jeff Barnett on Sun Feb 11 10:38:45 2024
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 12:00 AM, Alan Bawden wrote:
    Julieta Shem <jshem@yaxenu.org> writes:
    Vectors are not convenient when we want to build something one
    element
    at a time. I think coercing a list into a vector is O(n), but that
    seems inevitable because to make a vector we need to know the size and >> we don't know the size up front.
    This is why we have arrays with fill pointers. You should read up on
    how arrays (and vectors) really work, with an eye on eventually being
    able to use VECTOR-PUSH-EXTEND on a array that's specialized to contain
    just 8-bit bytes.

    N.B. The use of vector-push-extend, where extend happens many (for many definitions of "many") times, will consume O(n^2) time AND probably memory too. There is no free (or even cheap) lunch when you don't have a good estimate of final total size up front.

    Is this true for real implementations? I would have thought they would typically use an exponential allocation strategy to give what is often
    called "amortised O(n)" growth.

    That is certainly what I found in a quick experiment using clisp:

    (defun build-vec (size)
    (let ((a (make-array 0 :adjustable t :fill-pointer t)))
    (dotimes (i size (length a)) (vector-push-extend i a))))

    shows linear growth in execution time.

    I think the best you can do is some sort of balanced tree, e.g., a heap,
    that will consume total build time of O(n*log n) and just O(n) space. The average/median retrieval time of a single element by index: O(1) for
    vector; O(log n) for heap; O(n) for linked list.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Spiros Bousbouras on Sun Feb 11 22:00:56 2024
    Spiros Bousbouras <spibou@gmail.com> writes:

    On Sun, 11 Feb 2024 10:38:45 +0000
    Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
    Jeff Barnett <jbb@notatt.com> writes:
    N.B. The use of vector-push-extend, where extend happens many (for many
    definitions of "many") times, will consume O(n^2) time AND probably memory >> > too. There is no free (or even cheap) lunch when you don't have a good
    estimate of final total size up front.

    Is this true for real implementations? I would have thought they would
    typically use an exponential allocation strategy to give what is often
    called "amortised O(n)" growth.

    That is certainly what I found in a quick experiment using clisp:

    (defun build-vec (size)
    (let ((a (make-array 0 :adjustable t :fill-pointer t)))
    (dotimes (i size (length a)) (vector-push-extend i a))))

    shows linear growth in execution time.

    What did you measure ?

    I did essentially this

    (dolist (i '(16 17 18 19 20 21 22 23)) (time (build-vec (expt 2 i))))

    though I was actually running it interactively to see what gave
    reliable timings.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Bawden@21:1/5 to Alan Bawden on Sun Feb 11 21:06:20 2024
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 12:00 AM, Alan Bawden wrote:
    > Julieta Shem <jshem@yaxenu.org> writes:
    >
    > Vectors are not convenient when we want to build something one element
    > at a time. I think coercing a list into a vector is O(n), but that
    > seems inevitable because to make a vector we need to know the size and
    > we don't know the size up front.
    >
    > This is why we have arrays with fill pointers. You should read up on
    > how arrays (and vectors) really work, with an eye on eventually being
    > able to use VECTOR-PUSH-EXTEND on a array that's specialized to contain
    > just 8-bit bytes.

    N.B. The use of vector-push-extend, where extend happens many (for many
    definitions of "many") times, will consume O(n^2) time AND probably memory
    too. There is no free (or even cheap) lunch when you don't have a good
    estimate of final total size up front.

    I think the best you can do is some sort of balanced tree, e.g., a heap,
    that will consume total build time of O(n*log n) and just O(n) space. The
    average/median retrieval time of a single element by index: O(1) for
    vector; O(log n) for heap; O(n) for linked list.

    No. (Well technically you're correct, but this is not what the OP needs
    to hear.)

    Look, we added fill pointers, and VECTOR-PUSH and VECTOR-PUSH-EXTEND to
    Common Lisp PRECISELY TO COVER SITUATIONS LIKE THIS. There's no need
    here for balanced trees or other complicated data structures.

    First you allocate an array with sufficient capacity for an estimated
    typical case. If you're reading bytes from a stream, a size of
    something like 16K is probably enough. Set the fill pointer to zero
    and start accumulating bytes using VECTOR-PUSH-EXTEND.

    Note that the third argument to VECTOR-PUSH-EXTEND is the amount to
    extend the buffer if you get to the point of having more than 16K (or
    whatever) elements. Pick something big enough so that extension is
    rare. Like another 16K for example.

    If you're doing this multiple times, just reset the fill pointer and
    reuse this same array for the next buffer load of input. This way even
    if you estimated wrong about the size of a typical buffer load, you'll
    only pay for as many extensions as are need for the largest buffer load
    you see.

    This is pretty much the way you would handle things in C, and there is
    no need to get any more complicated than that.

    Yes, technically this is O(n^2) -- if you set the initial buffer
    capacity to 1, and extend it by 1 element each time, this will matter.
    But if you set the initial buffer capacity and extension amount to
    something reasonable, you'll extend your buffer so few times that you'll
    never know it isn't O(n).

    - Alan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jeff Barnett@21:1/5 to Ben Bacarisse on Sun Feb 11 21:39:34 2024
    On 2/11/2024 3:38 AM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 12:00 AM, Alan Bawden wrote:
    Julieta Shem <jshem@yaxenu.org> writes:
    Vectors are not convenient when we want to build something one
    element
    at a time. I think coercing a list into a vector is O(n), but that >>> seems inevitable because to make a vector we need to know the size and >>> we don't know the size up front.
    This is why we have arrays with fill pointers. You should read up on
    how arrays (and vectors) really work, with an eye on eventually being
    able to use VECTOR-PUSH-EXTEND on a array that's specialized to contain
    just 8-bit bytes.

    N.B. The use of vector-push-extend, where extend happens many (for many
    definitions of "many") times, will consume O(n^2) time AND probably memory >> too. There is no free (or even cheap) lunch when you don't have a good
    estimate of final total size up front.

    Is this true for real implementations? I would have thought they would typically use an exponential allocation strategy to give what is often
    called "amortised O(n)" growth.

    That is certainly what I found in a quick experiment using clisp:

    (defun build-vec (size)
    (let ((a (make-array 0 :adjustable t :fill-pointer t)))
    (dotimes (i size (length a)) (vector-push-extend i a))))

    shows linear growth in execution time.

    Every time you do a rebuild initiated by a push-extend you end up
    copying all the current items in the vector; and each time you do the
    copy, more items are copied than for the last copy. This leads, finally,
    to O(n^2) complexity. And as to storage, you must look at all the
    allocations you have done and the overhead to either GC or tack them
    into your heap. Usually, you only notice this degradation if you push a
    whole lot of items - recall, O(.) is really an asymptotic measure. Lots
    of tricks will hide the problem for a while but one usually gets bit in
    the end.

    When you ran build-vec, did you let size be a few billion? What happened
    to run time? My guess is that most storage heap management schemes take
    about the same amount of time and storage to allocate a few bytes and a
    few thousand bytes. The growth effect, if it is real in the
    implementation you are using for testing, must build vectors many many
    times longer than the internal allocation chunk size to see the
    quadratic effect.

    I think the best you can do is some sort of balanced tree, e.g., a heap,
    that will consume total build time of O(n*log n) and just O(n) space. The
    average/median retrieval time of a single element by index: O(1) for
    vector; O(log n) for heap; O(n) for linked list.
    --
    Jeff Barnett

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Jeff Barnett on Mon Feb 12 20:53:05 2024
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 3:38 AM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 12:00 AM, Alan Bawden wrote:
    Julieta Shem <jshem@yaxenu.org> writes:
    Vectors are not convenient when we want to build something one
    element
    at a time. I think coercing a list into a vector is O(n), but that >>>> seems inevitable because to make a vector we need to know the size and
    we don't know the size up front.
    This is why we have arrays with fill pointers. You should read up on
    how arrays (and vectors) really work, with an eye on eventually being
    able to use VECTOR-PUSH-EXTEND on a array that's specialized to contain >>>> just 8-bit bytes.

    N.B. The use of vector-push-extend, where extend happens many (for many
    definitions of "many") times, will consume O(n^2) time AND probably memory >>> too. There is no free (or even cheap) lunch when you don't have a good
    estimate of final total size up front.
    Is this true for real implementations? I would have thought they would
    typically use an exponential allocation strategy to give what is often
    called "amortised O(n)" growth.
    That is certainly what I found in a quick experiment using clisp:
    (defun build-vec (size)
    (let ((a (make-array 0 :adjustable t :fill-pointer t)))
    (dotimes (i size (length a)) (vector-push-extend i a))))
    shows linear growth in execution time.

    Every time you do a rebuild initiated by a push-extend you end up copying
    all the current items in the vector;

    Why? Having to copy everything, even in a tight reallocation loop like
    the one I used, will be a problem.

    and each time you do the copy, more
    items are copied than for the last copy. This leads, finally, to O(n^2) complexity. And as to storage, you must look at all the allocations you
    have done and the overhead to either GC or tack them into your
    heap. Usually, you only notice this degradation if you push a whole lot of items - recall, O(.) is really an asymptotic measure. Lots of tricks will hide the problem for a while but one usually gets bit in the end.

    When you ran build-vec, did you let size be a few billion? What
    happened to run time?

    No, I could only go up to (expt 2 23) before clisp seg faulted (surely a
    bug?). That's only about 8 million items, but 134 million bytes of
    storage. The time taken was almost perfectly linear.

    My guess is that most storage heap management schemes take about
    the same amount of time and storage to allocate a few bytes and a few thousand bytes. The growth effect, if it is real in the implementation you are using for testing, must build vectors many many times longer than the internal allocation chunk size to see the quadratic effect.

    Can you post an example? I don't know how to get either clisp or SBCL
    to go much larger. They both have options which would seem to control
    this but I can't seem to get them to cooperate! Any help would be much appreciated.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jeff Barnett@21:1/5 to Ben Bacarisse on Mon Feb 12 21:13:56 2024
    On 2/12/2024 1:53 PM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 3:38 AM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 12:00 AM, Alan Bawden wrote:
    Julieta Shem <jshem@yaxenu.org> writes:
    Vectors are not convenient when we want to build something one >>>>> element
    at a time. I think coercing a list into a vector is O(n), but that >>>>> seems inevitable because to make a vector we need to know the size and
    we don't know the size up front.
    This is why we have arrays with fill pointers. You should read up on >>>>> how arrays (and vectors) really work, with an eye on eventually being >>>>> able to use VECTOR-PUSH-EXTEND on a array that's specialized to contain >>>>> just 8-bit bytes.

    N.B. The use of vector-push-extend, where extend happens many (for many >>>> definitions of "many") times, will consume O(n^2) time AND probably memory >>>> too. There is no free (or even cheap) lunch when you don't have a good >>>> estimate of final total size up front.
    Is this true for real implementations? I would have thought they would
    typically use an exponential allocation strategy to give what is often
    called "amortised O(n)" growth.
    That is certainly what I found in a quick experiment using clisp:
    (defun build-vec (size)
    (let ((a (make-array 0 :adjustable t :fill-pointer t)))
    (dotimes (i size (length a)) (vector-push-extend i a))))
    shows linear growth in execution time.

    Every time you do a rebuild initiated by a push-extend you end up copying
    all the current items in the vector;

    Why? Having to copy everything, even in a tight reallocation loop like
    the one I used, will be a problem.

    and each time you do the copy, more
    items are copied than for the last copy. This leads, finally, to O(n^2)
    complexity. And as to storage, you must look at all the allocations you
    have done and the overhead to either GC or tack them into your
    heap. Usually, you only notice this degradation if you push a whole lot of >> items - recall, O(.) is really an asymptotic measure. Lots of tricks will
    hide the problem for a while but one usually gets bit in the end.

    When you ran build-vec, did you let size be a few billion? What
    happened to run time?

    No, I could only go up to (expt 2 23) before clisp seg faulted (surely a bug?). That's only about 8 million items, but 134 million bytes of
    storage. The time taken was almost perfectly linear.

    My guess is that most storage heap management schemes take about
    the same amount of time and storage to allocate a few bytes and a few
    thousand bytes. The growth effect, if it is real in the implementation you >> are using for testing, must build vectors many many times longer than the
    internal allocation chunk size to see the quadratic effect.

    Can you post an example? I don't know how to get either clisp or SBCL
    to go much larger. They both have options which would seem to control
    this but I can't seem to get them to cooperate! Any help would be much appreciated.

    At the moment I have no system to build executable code. About a year
    ago, I contacted Franz Lisp to see what they wanted for a non commercial license - what they used to lease to me years ago as some sort of
    academic thing. Unfortunately, the price was more than I wanted to pay
    to putter around so I passed. At present, I'm working from the back of
    an envelope:

    I don't know if it will help but I'll tell you a few of the assumptions
    I made leading to what I wrote above.

    1. Most variants of push-extend want to maintain the use of indexing to
    find the nth entry. That basically means that you demand that all memory
    used be contiguous (in the address space). There are ways around this by
    using heap (as in heap sort) algorithms but retrieval times go from O(1)
    to O(log size).

    2. Many variants of Malloc/Free set a minimum size, under the table, for
    the blocks they actually maintain: a size like 256 bytes or the same
    size as the virtual address mechanism handles, e.g., 4k bytes. If this
    is the case, you need REALLY big test cases that get to the point where
    you go quadratic in these size chunks.

    3. If you have a good handle on the structure (its ultimate size, growth characteristics, and reference patterns) you can develop or find
    algorithms that are better than I'm estimating above. However and this
    is were the gotchas come in, using these specialized approaches in
    something like a general purpose Lisp push-extend primitive is all
    wrong: the overhead will eat little applications alive and cause one to
    wonder what's wrong with the system. Microseconds will turn into
    milliseconds, etc. Of course one could take the same approach taken by
    most Lisp hash packages: implement hash as an object type that can not
    only grow but morph its type as the table gets larger. For example, a
    hash table is originally implemented (under the table) as an assoc list;
    when the number of items crosses a threshold another set of methods
    provide hash functionality. And so on.
    --
    Jeff Barnett

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Jeff Barnett on Tue Feb 13 14:13:49 2024
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/12/2024 1:53 PM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 3:38 AM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 12:00 AM, Alan Bawden wrote:
    Julieta Shem <jshem@yaxenu.org> writes:
    Vectors are not convenient when we want to build something one >>>>>> element
    at a time. I think coercing a list into a vector is O(n), but that
    seems inevitable because to make a vector we need to know the size and
    we don't know the size up front.
    This is why we have arrays with fill pointers. You should read up on >>>>>> how arrays (and vectors) really work, with an eye on eventually being >>>>>> able to use VECTOR-PUSH-EXTEND on a array that's specialized to contain >>>>>> just 8-bit bytes.

    N.B. The use of vector-push-extend, where extend happens many (for many >>>>> definitions of "many") times, will consume O(n^2) time AND probably memory
    too. There is no free (or even cheap) lunch when you don't have a good >>>>> estimate of final total size up front.
    Is this true for real implementations? I would have thought they would >>>> typically use an exponential allocation strategy to give what is often >>>> called "amortised O(n)" growth.
    That is certainly what I found in a quick experiment using clisp:
    (defun build-vec (size)
    (let ((a (make-array 0 :adjustable t :fill-pointer t)))
    (dotimes (i size (length a)) (vector-push-extend i a))))
    shows linear growth in execution time.

    Every time you do a rebuild initiated by a push-extend you end up copying >>> all the current items in the vector;
    Why? Having to copy everything, even in a tight reallocation loop like
    the one I used, will be a problem.

    and each time you do the copy, more
    items are copied than for the last copy. This leads, finally, to O(n^2)
    complexity. And as to storage, you must look at all the allocations you
    have done and the overhead to either GC or tack them into your
    heap. Usually, you only notice this degradation if you push a whole lot of >>> items - recall, O(.) is really an asymptotic measure. Lots of tricks will >>> hide the problem for a while but one usually gets bit in the end.

    When you ran build-vec, did you let size be a few billion? What
    happened to run time?
    No, I could only go up to (expt 2 23) before clisp seg faulted (surely a
    bug?). That's only about 8 million items, but 134 million bytes of
    storage. The time taken was almost perfectly linear.

    My guess is that most storage heap management schemes take about
    the same amount of time and storage to allocate a few bytes and a few
    thousand bytes. The growth effect, if it is real in the implementation you >>> are using for testing, must build vectors many many times longer than the >>> internal allocation chunk size to see the quadratic effect.
    Can you post an example? I don't know how to get either clisp or SBCL
    to go much larger. They both have options which would seem to control
    this but I can't seem to get them to cooperate! Any help would be much
    appreciated.
    ...
    I don't know if it will help but I'll tell you a few of the assumptions I made leading to what I wrote above.

    Thanks for what you wrote, but I was hoping for some help to extend my
    tests so I could see the effects you think I should be seeing but which I
    was not. I saw linear growth up to the point where the implementations
    crapped out and I could not see how to extend them.

    By the way, I don't think one can assume that in-place reallocation with
    an exponential growth strategy in always going to be in place. I was
    just suggesting that's it's likely to be what is used in some
    implementations.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Spiros Bousbouras on Tue Feb 13 22:14:16 2024
    On Fri, 9 Feb 2024 21:13:18 -0000 (UTC), Spiros Bousbouras wrote:

    (declaim (ftype (function (vector vector list
    &key (:limit (or null integer)) (:so-far
    integer))
    list)
    split-vector))

    Not a fan of “parenthesis-pileup” layout. This helps make it clearer
    to me:

    (declaim
    (ftype
    (function
    (vector vector list &key (:limit (or null integer)) (:so-far integer))
    list
    )
    split-vector
    )
    )

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jeff Barnett@21:1/5 to Ben Bacarisse on Tue Feb 13 14:56:10 2024
    On 2/13/2024 7:13 AM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/12/2024 1:53 PM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 3:38 AM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 12:00 AM, Alan Bawden wrote:
    Julieta Shem <jshem@yaxenu.org> writes:
    Vectors are not convenient when we want to build something one >>>>>>> element
    at a time. I think coercing a list into a vector is O(n), but that
    seems inevitable because to make a vector we need to know the size and
    we don't know the size up front.
    This is why we have arrays with fill pointers. You should read up on >>>>>>> how arrays (and vectors) really work, with an eye on eventually being >>>>>>> able to use VECTOR-PUSH-EXTEND on a array that's specialized to contain >>>>>>> just 8-bit bytes.

    N.B. The use of vector-push-extend, where extend happens many (for many >>>>>> definitions of "many") times, will consume O(n^2) time AND probably memory
    too. There is no free (or even cheap) lunch when you don't have a good >>>>>> estimate of final total size up front.
    Is this true for real implementations? I would have thought they would >>>>> typically use an exponential allocation strategy to give what is often >>>>> called "amortised O(n)" growth.
    That is certainly what I found in a quick experiment using clisp:
    (defun build-vec (size)
    (let ((a (make-array 0 :adjustable t :fill-pointer t)))
    (dotimes (i size (length a)) (vector-push-extend i a))))
    shows linear growth in execution time.

    Every time you do a rebuild initiated by a push-extend you end up copying >>>> all the current items in the vector;
    Why? Having to copy everything, even in a tight reallocation loop like
    the one I used, will be a problem.

    and each time you do the copy, more
    items are copied than for the last copy. This leads, finally, to O(n^2) >>>> complexity. And as to storage, you must look at all the allocations you >>>> have done and the overhead to either GC or tack them into your
    heap. Usually, you only notice this degradation if you push a whole lot of >>>> items - recall, O(.) is really an asymptotic measure. Lots of tricks will >>>> hide the problem for a while but one usually gets bit in the end.

    When you ran build-vec, did you let size be a few billion? What
    happened to run time?
    No, I could only go up to (expt 2 23) before clisp seg faulted (surely a >>> bug?). That's only about 8 million items, but 134 million bytes of
    storage. The time taken was almost perfectly linear.

    My guess is that most storage heap management schemes take about
    the same amount of time and storage to allocate a few bytes and a few
    thousand bytes. The growth effect, if it is real in the implementation you >>>> are using for testing, must build vectors many many times longer than the >>>> internal allocation chunk size to see the quadratic effect.
    Can you post an example? I don't know how to get either clisp or SBCL
    to go much larger. They both have options which would seem to control
    this but I can't seem to get them to cooperate! Any help would be much
    appreciated.
    ...
    I don't know if it will help but I'll tell you a few of the assumptions I
    made leading to what I wrote above.

    Thanks for what you wrote, but I was hoping for some help to extend my
    tests so I could see the effects you think I should be seeing but which I
    was not. I saw linear growth up to the point where the implementations crapped out and I could not see how to extend them.

    By the way, I don't think one can assume that in-place reallocation with
    an exponential growth strategy in always going to be in place. I was
    just suggesting that's it's likely to be what is used in some implementations.

    I wasn't assuming an in-place reallocation. My thoughts were more along
    the following lines:

    An implementation of a vector with extend potential must still properly
    allow all vector methods (as in object methods) to meet their original contract. So, for example, the vector must present at least the illusion
    of elements being consecutive in the address space. In Lisp it also
    means that one can (after doing appropriate index validation) access the
    vector by computing its address in most implementations. One can do
    better by ignoring such "requirements" but then the implementation will
    display some kinks. It's just easier to assign new contiguous address
    space to the vector, copy the old contents, then put a magic forwarded
    pointer in the original header, and leave the rest of the original so
    the GC can have at.

    I'm not sure if I understood what you were trying to get at. If I
    flubbed it, please try again.

    PS I hope all goes well with you. It seems we've both been lurking in
    murky parts of USENET.
    --
    Jeff Barnett

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Jeff Barnett on Wed Feb 14 20:45:00 2024
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/13/2024 7:13 AM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/12/2024 1:53 PM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 3:38 AM, Ben Bacarisse wrote:
    Jeff Barnett <jbb@notatt.com> writes:

    On 2/11/2024 12:00 AM, Alan Bawden wrote:
    Julieta Shem <jshem@yaxenu.org> writes:
    Vectors are not convenient when we want to build something one >>>>>>>> element
    at a time. I think coercing a list into a vector is O(n), but that
    seems inevitable because to make a vector we need to know the size and
    we don't know the size up front.
    This is why we have arrays with fill pointers. You should read up on >>>>>>>> how arrays (and vectors) really work, with an eye on eventually being >>>>>>>> able to use VECTOR-PUSH-EXTEND on a array that's specialized to contain
    just 8-bit bytes.

    N.B. The use of vector-push-extend, where extend happens many (for many >>>>>>> definitions of "many") times, will consume O(n^2) time AND probably memory
    too. There is no free (or even cheap) lunch when you don't have a good >>>>>>> estimate of final total size up front.
    Is this true for real implementations? I would have thought they would >>>>>> typically use an exponential allocation strategy to give what is often >>>>>> called "amortised O(n)" growth.
    That is certainly what I found in a quick experiment using clisp:
    (defun build-vec (size)
    (let ((a (make-array 0 :adjustable t :fill-pointer t)))
    (dotimes (i size (length a)) (vector-push-extend i a))))
    shows linear growth in execution time.

    Every time you do a rebuild initiated by a push-extend you end up copying >>>>> all the current items in the vector;
    Why? Having to copy everything, even in a tight reallocation loop like >>>> the one I used, will be a problem.

    and each time you do the copy, more
    items are copied than for the last copy. This leads, finally, to O(n^2) >>>>> complexity. And as to storage, you must look at all the allocations you >>>>> have done and the overhead to either GC or tack them into your
    heap. Usually, you only notice this degradation if you push a whole lot of
    items - recall, O(.) is really an asymptotic measure. Lots of tricks will >>>>> hide the problem for a while but one usually gets bit in the end.

    When you ran build-vec, did you let size be a few billion? What
    happened to run time?
    No, I could only go up to (expt 2 23) before clisp seg faulted (surely a >>>> bug?). That's only about 8 million items, but 134 million bytes of
    storage. The time taken was almost perfectly linear.

    My guess is that most storage heap management schemes take about
    the same amount of time and storage to allocate a few bytes and a few >>>>> thousand bytes. The growth effect, if it is real in the implementation you
    are using for testing, must build vectors many many times longer than the >>>>> internal allocation chunk size to see the quadratic effect.
    Can you post an example? I don't know how to get either clisp or SBCL >>>> to go much larger. They both have options which would seem to control >>>> this but I can't seem to get them to cooperate! Any help would be much >>>> appreciated.
    ...
    I don't know if it will help but I'll tell you a few of the assumptions I >>> made leading to what I wrote above.
    Thanks for what you wrote, but I was hoping for some help to extend my
    tests so I could see the effects you think I should be seeing but which I
    was not. I saw linear growth up to the point where the implementations
    crapped out and I could not see how to extend them.
    By the way, I don't think one can assume that in-place reallocation with
    an exponential growth strategy in always going to be in place. I was
    just suggesting that's it's likely to be what is used in some
    implementations.

    I wasn't assuming an in-place reallocation.

    Yes, I know. I was just remarking that my own observations that just
    because as least two implementation appear to offer amortised O(n) cost,
    that result can't be relied upon as the language does not stipulate any underlying strategy.

    I'm not sure if I understood what you were trying to get at. If I flubbed
    it, please try again.

    I was simply reporting the result of a measurement up to the maximum
    size I could manage.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)