• GNU malloc and external fragmentation

    From Rainer Weikusat@21:1/5 to All on Tue Jul 13 20:18:20 2021
    I'm currently working on something supposed to enable matching names in
    DNS queries against a large "list of bad domains you don't want to be
    talking to". The test file I'm working with has 1,118,228
    entries. Reading all of the information in it into memory needs a little
    over 34M of RAM. A program using calloc to allocate the necessary
    storage ends up with an >49M heap due to memory wasted by GNU malloc on $something.

    In contrast to this, using mmap to allocate an area of a suitable size
    and putting the structures and domain names there just wastes 2400
    bytes.

    The code will possibly also run on a modern "embedded system"/ SOHO
    router. While these are pretty powerful nowadays, 15M of RAM getting
    sucked into malloc for no particular purpose seems excessive to
    me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tavis Ormandy@21:1/5 to Rainer Weikusat on Tue Jul 13 21:14:51 2021
    On 2021-07-13, Rainer Weikusat wrote:
    I'm currently working on something supposed to enable matching names in
    DNS queries against a large "list of bad domains you don't want to be
    talking to". The test file I'm working with has 1,118,228
    entries. Reading all of the information in it into memory needs a little
    over 34M of RAM. A program using calloc to allocate the necessary
    storage ends up with an >49M heap due to memory wasted by GNU malloc on $something.

    You mean each entry is a calloc(), or you do calloc(1118228, sizeof(entry))?

    If it's the former, it sounds about right. I don't think this is
    fragmentation, just the allocator chunk metadata overhead.

    Each chunk needs some overhead to hold the size/flags and prev size. It
    seems like each entry is about 31 bytes (34M / 1118228), and you say
    (49M / 1118228) = ~45 bytes are used, so that's ~14 bytes of overhead,
    probably the two size_t's on x64.

    https://sourceware.org/glibc/wiki/MallocInternals#What_is_a_Chunk.3F

    Tavis.

    --
    _o) $ lynx lock.cmpxchg8b.com
    /\\ _o) _o) $ finger taviso@sdf.org
    _\_V _( ) _( ) @taviso

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Mueller@21:1/5 to All on Tue Jul 13 22:46:37 2021
    Am 13.07.21 um 21:18 schrieb Rainer Weikusat:
    I'm currently working on something supposed to enable matching names in
    DNS queries against a large "list of bad domains you don't want to be
    talking to". The test file I'm working with has 1,118,228
    entries. Reading all of the information in it into memory needs a little
    over 34M of RAM.

    Too much for reasonable RAM cache efficiencies. You should consider more efficient data structures.

    I wrote something like that about 25 years ago (domain & URL blacklists
    & whitelists for security critical environments).
    Trie (not tree!) structures could significantly reduce the amount of
    memory and avoid too much random access to the memory. You could use
    single characters or even domain components like the TLD as atoms in the
    trie.
    Furthermore tree lookups have high cache efficiency (compared to hash
    tables) since every search starts at the same root.
    In fact you need hybrid structures that decay to ordinary B trees at
    deeper levels to avoid large overhead on long names.

    Trie and B tree structures can also be combined. E.g. you could use
    chunks of 4 bytes as trie atoms, e.g. ".com" or "n.de" in the top level
    This 32 bit values could be directly stored in the B tree nodes used for dispatch tables. With a node size of 63 or 127 you could get only a few
    bits overhead per B tree entry, since an 8 bit value is sufficient for
    node size and node type (node/leaf). And redundant trailing domain parts
    are still deduplicated due to the trie structure. The next level
    contains only the previous 4 bytes of the domain name.

    A program using calloc to allocate the necessary
    storage ends up with an >49M heap due to memory wasted by GNU malloc on $something.

    Do you allocate the storage as a single chunk? (not recommended)
    Or did you allocate every string individually? Then you get a
    considerable overhead.

    In contrast to this, using mmap to allocate an area of a suitable size
    and putting the structures and domain names there just wastes 2400
    bytes.

    mmap operates directly on MMU-Level, typically allocating 4k Pages.
    It is efficient if it is not used too often or for too small chunks.

    The code will possibly also run on a modern "embedded system"/ SOHO
    router.

    Then you should /really/ optimize the data structures. The CPUs of these systems are usually not that fast. And even small delays of DNS queries
    are annoying.

    While these are pretty powerful nowadays, 15M of RAM getting
    sucked into malloc for no particular purpose seems excessive to
    me.

    34MB for a domain blacklist on an embedded device sounds unreasonably to
    me too. At least you should ensure that only a small part of this memory
    is "hot", i.e. used for most of the comparisons for performance reasons.


    Marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Kettlewell@21:1/5 to MrSpud_pjnlvZ@r7hwq328lx.tv on Wed Jul 14 08:32:42 2021
    MrSpud_pjnlvZ@r7hwq328lx.tv writes:
    Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:
    In contrast to this, using mmap to allocate an area of a suitable
    size and putting the structures and domain names there just wastes
    2400 bytes.

    mmap operates directly on MMU-Level, typically allocating 4k Pages.
    It is efficient if it is not used too often or for too small chunks.

    I used mmap on Linux in a compression utility recently that had to
    scan the same file a number of times because I assumed it would be
    faster than normal I/O. Turned out I was wrong and in fact using
    open(), read() etc was about twice as fast which I still find
    confusing because I thought all the standard C I/O used mmap (or its
    kernel hook) eventually anyway. Strange.

    I did a similar experiment a couple of decades ago, with similar
    results.

    The reason, as I understand it, is that changing your process’s virtual memory mapping is relatively expensive, even compared to the copying
    involved in read().

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MrSpud_pjnlvZ@r7hwq328lx.tv@21:1/5 to Marcel Mueller on Wed Jul 14 07:24:29 2021
    On Tue, 13 Jul 2021 22:46:37 +0200
    Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:
    In contrast to this, using mmap to allocate an area of a suitable size
    and putting the structures and domain names there just wastes 2400
    bytes.

    mmap operates directly on MMU-Level, typically allocating 4k Pages.
    It is efficient if it is not used too often or for too small chunks.

    I used mmap on Linux in a compression utility recently that had to scan the same file a number of times because I assumed it would be faster than normal I/O. Turned out I was wrong and in fact using open(), read() etc was about twice as fast which I still find confusing because I thought all the standard
    C I/O used mmap (or its kernel hook) eventually anyway. Strange.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MrSpud_9307fg0b@swzygk2or5gveo8v.go@21:1/5 to Richard Kettlewell on Wed Jul 14 08:16:52 2021
    On Wed, 14 Jul 2021 08:32:42 +0100
    Richard Kettlewell <invalid@invalid.invalid> wrote: >MrSpud_pjnlvZ@r7hwq328lx.tv writes:
    Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:
    In contrast to this, using mmap to allocate an area of a suitable
    size and putting the structures and domain names there just wastes
    2400 bytes.

    mmap operates directly on MMU-Level, typically allocating 4k Pages.
    It is efficient if it is not used too often or for too small chunks.

    I used mmap on Linux in a compression utility recently that had to
    scan the same file a number of times because I assumed it would be
    faster than normal I/O. Turned out I was wrong and in fact using
    open(), read() etc was about twice as fast which I still find
    confusing because I thought all the standard C I/O used mmap (or its
    kernel hook) eventually anyway. Strange.

    I did a similar experiment a couple of decades ago, with similar
    results.

    The reason, as I understand it, is that changing your process’s virtual >memory mapping is relatively expensive, even compared to the copying
    involved in read().

    Ah ok, that makes sense but its a bit of a shame.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nicolas George@21:1/5 to All on Wed Jul 14 09:55:01 2021
    Tavis Ormandy , dans le message <il6e2bFlhmcU1@mid.individual.net>, a
    écrit :
    If it's the former, it sounds about right. I don't think this is fragmentation, just the allocator chunk metadata overhead.

    Each chunk needs some overhead to hold the size/flags and prev size. It
    seems like each entry is about 31 bytes (34M / 1118228), and you say
    (49M / 1118228) = ~45 bytes are used, so that's ~14 bytes of overhead, probably the two size_t's on x64.

    Almost exactly. A few checks show that GNU malloc has a 8 chars = 1 size_t overhead on x86_64, but it guarantees an alignment of 16.

    That means if you allocate between 25 and 40 chars inclusive, the overhead
    adds to between 33 and 48, and it gets rounded up to 48.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Tavis Ormandy on Wed Jul 14 15:11:16 2021
    Tavis Ormandy <taviso@gmail.com> writes:
    On 2021-07-13, Rainer Weikusat wrote:
    I'm currently working on something supposed to enable matching names in
    DNS queries against a large "list of bad domains you don't want to be
    talking to". The test file I'm working with has 1,118,228
    entries. Reading all of the information in it into memory needs a little
    over 34M of RAM. A program using calloc to allocate the necessary
    storage ends up with an >49M heap due to memory wasted by GNU malloc on
    $something.

    You mean each entry is a calloc(), or you do calloc(1118228, sizeof(entry))?

    If it's the former, it sounds about right. I don't think this is fragmentation, just the allocator chunk metadata overhead.

    Each chunk needs some overhead to hold the size/flags and prev size. It
    seems like each entry is about 31 bytes (34M / 1118228), and you say
    (49M / 1118228) = ~45 bytes are used, so that's ~14 bytes of overhead, probably the two size_t's on x64.

    The entries are of different sizes although 32 byte is the by far most
    common one. Judging from

    When a chunk is in use by the application, the only data that's
    "remembered" is the size of the chunk.

    there should be an 8 byte overhead per chunk which accounts for about
    half of the memory overhead.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Rainer Weikusat on Wed Jul 14 15:02:06 2021
    Rainer Weikusat <rweikusat@talktalk.net> writes:
    I'm currently working on something supposed to enable matching names in
    DNS queries against a large "list of bad domains you don't want to be
    talking to". The test file I'm working with has 1,118,228
    entries. Reading all of the information in it into memory needs a little
    over 34M of RAM. A program using calloc to allocate the necessary
    storage ends up with an >49M heap due to memory wasted by GNU malloc on >$something.

    Seems like a perfect application for a tree structure by domain
    component. . at the top, then TLD, then subs etc.

    . -> com -> sun -> www

    etc.


    Lookups will be quick as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Mueller@21:1/5 to All on Thu Jul 15 07:37:42 2021
    Am 14.07.21 um 09:24 schrieb MrSpud_pjnlvZ@r7hwq328lx.tv:
    On Tue, 13 Jul 2021 22:46:37 +0200
    Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:
    In contrast to this, using mmap to allocate an area of a suitable size
    and putting the structures and domain names there just wastes 2400
    bytes.

    mmap operates directly on MMU-Level, typically allocating 4k Pages.
    It is efficient if it is not used too often or for too small chunks.

    I used mmap on Linux in a compression utility recently that had to scan the same file a number of times because I assumed it would be faster than normal I/O. Turned out I was wrong and in fact using open(), read() etc was about twice as fast which I still find confusing because I thought all the standard C I/O used mmap (or its kernel hook) eventually anyway. Strange.

    The problem with mmap is, that it does not read anything. You will get a
    page fault every 4kB and so the Data is read in 4k Chunks, allocating
    physical memory in the same chunk size. The latter could cause
    significant fragmentation of the physical memory leading to many TLB
    entries. And the latter are a limited resource. If your thread needs
    more that the hardware provides you will get further page faults.
    Reading data in 4k chunks was inefficient 30 years ago and is even more inefficient nowadays. Page faults are expensive too. You loose in the
    order of 10³ clock cycles.
    In contrast you probably did not use read with only 4kb chunks. But even
    if you did so the file system cache has optimizations for sequential
    read and uses read ahead.


    Marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MrSpud_na@7a31.net@21:1/5 to Marcel Mueller on Thu Jul 15 08:35:33 2021
    On Thu, 15 Jul 2021 07:37:42 +0200
    Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:
    In contrast you probably did not use read with only 4kb chunks. But even
    if you did so the file system cache has optimizations for sequential
    read and uses read ahead.

    In that case does anyone know how read() etc on Linux actually access the
    file information? Seems like I was wrong about the I/O library mapping the
    file (unless it does it in chunks) so are there other kernel hooks it
    uses?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MrSpud_na@7a31.net on Thu Jul 15 14:05:40 2021
    MrSpud_na@7a31.net writes:
    On Thu, 15 Jul 2021 07:37:42 +0200
    Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:
    In contrast you probably did not use read with only 4kb chunks. But even
    if you did so the file system cache has optimizations for sequential
    read and uses read ahead.

    In that case does anyone know how read() etc on Linux actually access the >file information? Seems like I was wrong about the I/O library mapping the >file (unless it does it in chunks) so are there other kernel hooks it
    uses?

    In the normal case, read() will issue a request through the I/O subsystem
    which will find its way to the filesystem-specific code, which will
    read the 4k (on intel, 4/16/64k on armv8) block containing the required
    data into an unused page of memory. Subsequent reads will be be satified
    from that page until the read crosses a page boundary, in which case the
    kernel will read the next page of data from the file and the read will
    be satisfied from that. This is generally known as the page cache,
    the size of which varies from a configured minimum to as much memory
    as is currently not used for any other purpose. Pages that
    exceed the configured minimum will be reclaimed if needed for
    e.g. fork() of a new process, with those in the configured minimum
    subject to a replacement algorithm (e.g. LRU).

    O_DIRECT can be used to bypass the page cache and read data directly
    into the application address space.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Marcel Mueller on Thu Jul 15 14:00:02 2021
    Marcel Mueller <news.5.maazl@spamgourmet.org> writes:
    Am 14.07.21 um 09:24 schrieb MrSpud_pjnlvZ@r7hwq328lx.tv:
    On Tue, 13 Jul 2021 22:46:37 +0200
    Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:
    In contrast to this, using mmap to allocate an area of a suitable size >>>> and putting the structures and domain names there just wastes 2400
    bytes.

    mmap operates directly on MMU-Level, typically allocating 4k Pages.
    It is efficient if it is not used too often or for too small chunks.

    I used mmap on Linux in a compression utility recently that had to scan the >> same file a number of times because I assumed it would be faster than normal >> I/O. Turned out I was wrong and in fact using open(), read() etc was about >> twice as fast which I still find confusing because I thought all the standard
    C I/O used mmap (or its kernel hook) eventually anyway. Strange.

    The problem with mmap is, that it does not read anything. You will get a
    page fault every 4kB and so the Data is read in 4k Chunks, allocating >physical memory in the same chunk size.

    (1) the underlying page size varies by processor (e.g. 4k, 16k or 64k on
    ARMv8) and arm/intel/amd processors can configure larger block
    sizes that will consume a single TLB entry per block (4M on 32-bit x86,
    2M/1G on 64-bit x86_64, and a wide range of sizes on ARMv8).

    And mmap on linux allows one to map using large pages.

    (2) one can use madvise to tell the kernel that the access
    is sequential allowing the kernel to both pre-read successive
    pages before the page fault and invalidate the TLB entry for
    the prior page.

    (3) The transparent Huge Pages feature on linux will automatically
    use large pages when conditions allow.


    The latter could cause
    significant fragmentation of the physical memory leading to many TLB
    entries.

    The size of the TLB(s) is fixed. There are so many 4K I/DTLBs
    associated with the L1 cache, and so many 4k/2M/4M/1G TLB entries
    associated with the L2 cache. They're filled on first reference
    and replaced typically using a form of LRU replacement algorithm.

    Check processor specific documentation for the size of the TLBs on
    each processor.

    The number of pages used for the mmap is irrelevent with respect
    to TLB pressure - the access pattern is far more relevent.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Kettlewell@21:1/5 to MrSpud_na@7a31.net on Thu Jul 15 17:09:54 2021
    MrSpud_na@7a31.net writes:
    Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:
    In contrast you probably did not use read with only 4kb chunks. But even
    if you did so the file system cache has optimizations for sequential
    read and uses read ahead.

    In that case does anyone know how read() etc on Linux actually access the file information? Seems like I was wrong about the I/O library mapping the file (unless it does it in chunks) so are there other kernel hooks it
    uses?

    Scott has given you most of the answer. The connection to memory mapping
    is that when you memory map (part of) a file, the physical RAM used for
    that mapping is same RAM that is used for the kernel’s page cache for
    the corresponding part of the file (with some caveats about private
    mappings).

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James K. Lowden@21:1/5 to Richard Kettlewell on Thu Jul 15 12:38:19 2021
    On Wed, 14 Jul 2021 08:32:42 +0100
    Richard Kettlewell <invalid@invalid.invalid> wrote:

    I used mmap on Linux in a compression utility recently that had to
    scan the same file a number of times because I assumed it would be
    faster than normal I/O. Turned out I was wrong and in fact using
    open(), read() etc was about twice as fast which I still find
    confusing because I thought all the standard C I/O used mmap (or its
    kernel hook) eventually anyway. Strange.

    I did a similar experiment a couple of decades ago, with similar
    results.

    The reason, as I understand it, is that changing your process?s
    virtual memory mapping is relatively expensive, even compared to the
    copying involved in read().

    My working hypothesis is that the cost of rearranging the process's
    memory exceeds the cost of copying the data to existing memory N times,
    where N = 1. For some value of N, the rearrangement cost is
    amortized, and total I/O costs minimized using mmap.

    I'm relying on empirical evidence. Linux uses mmap for executables, so
    it can't be terrible. And LMDB has had great success with performance
    using memory-mapped stores.

    --jkl

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Richard Kettlewell on Thu Jul 15 17:22:10 2021
    Richard Kettlewell <invalid@invalid.invalid> writes:
    MrSpud_na@7a31.net writes:
    Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:
    In contrast you probably did not use read with only 4kb chunks. But even >>>if you did so the file system cache has optimizations for sequential
    read and uses read ahead.

    In that case does anyone know how read() etc on Linux actually access the
    file information? Seems like I was wrong about the I/O library mapping the >> file (unless it does it in chunks) so are there other kernel hooks it
    uses?

    Scott has given you most of the answer. The connection to memory mapping
    is that when you memory map (part of) a file, the physical RAM used for
    that mapping is same RAM that is used for the kernel’s page cache for
    the corresponding part of the file (with some caveats about private >mappings).

    And some caveats about huge pages. The linux mmap() allows the
    application to specify the use of legacy huge pages (2m on x86_64, 4m on ia32) for the mapping, in which case the mapping uses physical memory reserved
    at boot time. (TLB entries for huge pages must be aligned to the page
    size, which makes creating a properly aligned huge page out of a 4k
    pool of pages at run time a very difficult (and costly) proposition).

    More recently, linux added THP (Transparent Huge Pages) which will opportunistically use huge pages when circumstances allow.

    Using large pages when appropriate can significantly improve the
    TLB hit rate (table walks, particularly in Virtual Machines) can
    be quite expensive.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)