Forum: >>> Magnum BBS <<<

GNU malloc and external fragmentation

From Rainer Weikusat@21:1/5 to All on Tue Jul 13 20:18:20 2021

I'm currently working on something supposed to enable matching names in
DNS queries against a large "list of bad domains you don't want to be
talking to". The test file I'm working with has 1,118,228
entries. Reading all of the information in it into memory needs a little
over 34M of RAM. A program using calloc to allocate the necessary
storage ends up with an >49M heap due to memory wasted by GNU malloc on $something.

In contrast to this, using mmap to allocate an area of a suitable size
and putting the structures and domain names there just wastes 2400
bytes.

The code will possibly also run on a modern "embedded system"/ SOHO
router. While these are pretty powerful nowadays, 15M of RAM getting
sucked into malloc for no particular purpose seems excessive to
me.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tavis Ormandy@21:1/5 to Rainer Weikusat on Tue Jul 13 21:14:51 2021

On 2021-07-13, Rainer Weikusat wrote:

I'm currently working on something supposed to enable matching names in
DNS queries against a large "list of bad domains you don't want to be
talking to". The test file I'm working with has 1,118,228
entries. Reading all of the information in it into memory needs a little
over 34M of RAM. A program using calloc to allocate the necessary
storage ends up with an >49M heap due to memory wasted by GNU malloc on $something.

You mean each entry is a calloc(), or you do calloc(1118228, sizeof(entry))?

If it's the former, it sounds about right. I don't think this is
fragmentation, just the allocator chunk metadata overhead.

Each chunk needs some overhead to hold the size/flags and prev size. It
seems like each entry is about 31 bytes (34M / 1118228), and you say
(49M / 1118228) = ~45 bytes are used, so that's ~14 bytes of overhead,
probably the two size_t's on x64.

https://sourceware.org/glibc/wiki/MallocInternals#What_is_a_Chunk.3F

Tavis.

--
_o) $ lynx lock.cmpxchg8b.com
/\\ _o) _o) $ finger taviso@sdf.org
_\_V _( ) _( ) @taviso

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Mueller@21:1/5 to All on Tue Jul 13 22:46:37 2021

Am 13.07.21 um 21:18 schrieb Rainer Weikusat:

I'm currently working on something supposed to enable matching names in
DNS queries against a large "list of bad domains you don't want to be
talking to". The test file I'm working with has 1,118,228
entries. Reading all of the information in it into memory needs a little
over 34M of RAM.

Too much for reasonable RAM cache efficiencies. You should consider more efficient data structures.

I wrote something like that about 25 years ago (domain & URL blacklists
& whitelists for security critical environments).
Trie (not tree!) structures could significantly reduce the amount of
memory and avoid too much random access to the memory. You could use
single characters or even domain components like the TLD as atoms in the
trie.
Furthermore tree lookups have high cache efficiency (compared to hash
tables) since every search starts at the same root.
In fact you need hybrid structures that decay to ordinary B trees at
deeper levels to avoid large overhead on long names.

Trie and B tree structures can also be combined. E.g. you could use
chunks of 4 bytes as trie atoms, e.g. ".com" or "n.de" in the top level
This 32 bit values could be directly stored in the B tree nodes used for dispatch tables. With a node size of 63 or 127 you could get only a few
bits overhead per B tree entry, since an 8 bit value is sufficient for
node size and node type (node/leaf). And redundant trailing domain parts
are still deduplicated due to the trie structure. The next level
contains only the previous 4 bytes of the domain name.

A program using calloc to allocate the necessary
storage ends up with an >49M heap due to memory wasted by GNU malloc on $something.

Do you allocate the storage as a single chunk? (not recommended)
Or did you allocate every string individually? Then you get a
considerable overhead.

In contrast to this, using mmap to allocate an area of a suitable size
and putting the structures and domain names there just wastes 2400
bytes.

mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.

The code will possibly also run on a modern "embedded system"/ SOHO
router.

Then you should /really/ optimize the data structures. The CPUs of these systems are usually not that fast. And even small delays of DNS queries
are annoying.

While these are pretty powerful nowadays, 15M of RAM getting
sucked into malloc for no particular purpose seems excessive to
me.

34MB for a domain blacklist on an embedded device sounds unreasonably to
me too. At least you should ensure that only a small part of this memory
is "hot", i.e. used for most of the comparisons for performance reasons.

Marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Kettlewell@21:1/5 to MrSpud_pjnlvZ@r7hwq328lx.tv on Wed Jul 14 08:32:42 2021

MrSpud_pjnlvZ@r7hwq328lx.tv writes:

Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:

In contrast to this, using mmap to allocate an area of a suitable
size and putting the structures and domain names there just wastes
2400 bytes.

mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.

I used mmap on Linux in a compression utility recently that had to
scan the same file a number of times because I assumed it would be
faster than normal I/O. Turned out I was wrong and in fact using
open(), read() etc was about twice as fast which I still find
confusing because I thought all the standard C I/O used mmap (or its
kernel hook) eventually anyway. Strange.

I did a similar experiment a couple of decades ago, with similar
results.

The reason, as I understand it, is that changing your process’s virtual memory mapping is relatively expensive, even compared to the copying
involved in read().

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MrSpud_pjnlvZ@r7hwq328lx.tv@21:1/5 to Marcel Mueller on Wed Jul 14 07:24:29 2021

On Tue, 13 Jul 2021 22:46:37 +0200
Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:

In contrast to this, using mmap to allocate an area of a suitable size
and putting the structures and domain names there just wastes 2400
bytes.

mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.

I used mmap on Linux in a compression utility recently that had to scan the same file a number of times because I assumed it would be faster than normal I/O. Turned out I was wrong and in fact using open(), read() etc was about twice as fast which I still find confusing because I thought all the standard
C I/O used mmap (or its kernel hook) eventually anyway. Strange.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MrSpud_9307fg0b@swzygk2or5gveo8v.go@21:1/5 to Richard Kettlewell on Wed Jul 14 08:16:52 2021

On Wed, 14 Jul 2021 08:32:42 +0100
Richard Kettlewell <invalid@invalid.invalid> wrote: >MrSpud_pjnlvZ@r7hwq328lx.tv writes:

Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:

In contrast to this, using mmap to allocate an area of a suitable
size and putting the structures and domain names there just wastes
2400 bytes.

mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.

I used mmap on Linux in a compression utility recently that had to
scan the same file a number of times because I assumed it would be
faster than normal I/O. Turned out I was wrong and in fact using
open(), read() etc was about twice as fast which I still find
confusing because I thought all the standard C I/O used mmap (or its
kernel hook) eventually anyway. Strange.

I did a similar experiment a couple of decades ago, with similar
results.

The reason, as I understand it, is that changing your process’s virtual >memory mapping is relatively expensive, even compared to the copying
involved in read().

Ah ok, that makes sense but its a bit of a shame.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Nicolas George@21:1/5 to All on Wed Jul 14 09:55:01 2021

Tavis Ormandy , dans le message <il6e2bFlhmcU1@mid.individual.net>, a
�crit�:

If it's the former, it sounds about right. I don't think this is fragmentation, just the allocator chunk metadata overhead.

Each chunk needs some overhead to hold the size/flags and prev size. It
seems like each entry is about 31 bytes (34M / 1118228), and you say
(49M / 1118228) = ~45 bytes are used, so that's ~14 bytes of overhead, probably the two size_t's on x64.

Almost exactly. A few checks show that GNU malloc has a 8 chars = 1 size_t overhead on x86_64, but it guarantees an alignment of 16.

That means if you allocate between 25 and 40 chars inclusive, the overhead
adds to between 33 and 48, and it gets rounded up to 48.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Tavis Ormandy on Wed Jul 14 15:11:16 2021

Tavis Ormandy <taviso@gmail.com> writes:

On 2021-07-13, Rainer Weikusat wrote:

I'm currently working on something supposed to enable matching names in
DNS queries against a large "list of bad domains you don't want to be
talking to". The test file I'm working with has 1,118,228
entries. Reading all of the information in it into memory needs a little
over 34M of RAM. A program using calloc to allocate the necessary
storage ends up with an >49M heap due to memory wasted by GNU malloc on
$something.

You mean each entry is a calloc(), or you do calloc(1118228, sizeof(entry))?

If it's the former, it sounds about right. I don't think this is fragmentation, just the allocator chunk metadata overhead.

Each chunk needs some overhead to hold the size/flags and prev size. It
seems like each entry is about 31 bytes (34M / 1118228), and you say
(49M / 1118228) = ~45 bytes are used, so that's ~14 bytes of overhead, probably the two size_t's on x64.

The entries are of different sizes although 32 byte is the by far most
common one. Judging from

When a chunk is in use by the application, the only data that's
"remembered" is the size of the chunk.

there should be an 8 byte overhead per chunk which accounts for about
half of the memory overhead.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Rainer Weikusat on Wed Jul 14 15:02:06 2021

Rainer Weikusat <rweikusat@talktalk.net> writes:

I'm currently working on something supposed to enable matching names in
DNS queries against a large "list of bad domains you don't want to be
talking to". The test file I'm working with has 1,118,228
entries. Reading all of the information in it into memory needs a little
over 34M of RAM. A program using calloc to allocate the necessary
storage ends up with an >49M heap due to memory wasted by GNU malloc on >$something.

Seems like a perfect application for a tree structure by domain
component. . at the top, then TLD, then subs etc.

. -> com -> sun -> www

etc.

Lookups will be quick as well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Mueller@21:1/5 to All on Thu Jul 15 07:37:42 2021

Am 14.07.21 um 09:24 schrieb MrSpud_pjnlvZ@r7hwq328lx.tv:

On Tue, 13 Jul 2021 22:46:37 +0200
Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:

In contrast to this, using mmap to allocate an area of a suitable size
and putting the structures and domain names there just wastes 2400
bytes.

mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.

I used mmap on Linux in a compression utility recently that had to scan the same file a number of times because I assumed it would be faster than normal I/O. Turned out I was wrong and in fact using open(), read() etc was about twice as fast which I still find confusing because I thought all the standard C I/O used mmap (or its kernel hook) eventually anyway. Strange.

The problem with mmap is, that it does not read anything. You will get a
page fault every 4kB and so the Data is read in 4k Chunks, allocating
physical memory in the same chunk size. The latter could cause
significant fragmentation of the physical memory leading to many TLB
entries. And the latter are a limited resource. If your thread needs
more that the hardware provides you will get further page faults.
Reading data in 4k chunks was inefficient 30 years ago and is even more inefficient nowadays. Page faults are expensive too. You loose in the
order of 10� clock cycles.
In contrast you probably did not use read with only 4kb chunks. But even
if you did so the file system cache has optimizations for sequential
read and uses read ahead.

Marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MrSpud_na@7a31.net@21:1/5 to Marcel Mueller on Thu Jul 15 08:35:33 2021

On Thu, 15 Jul 2021 07:37:42 +0200
Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:

In contrast you probably did not use read with only 4kb chunks. But even
if you did so the file system cache has optimizations for sequential
read and uses read ahead.

In that case does anyone know how read() etc on Linux actually access the
file information? Seems like I was wrong about the I/O library mapping the
file (unless it does it in chunks) so are there other kernel hooks it
uses?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MrSpud_na@7a31.net on Thu Jul 15 14:05:40 2021

MrSpud_na@7a31.net writes:

On Thu, 15 Jul 2021 07:37:42 +0200
Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:

In contrast you probably did not use read with only 4kb chunks. But even
if you did so the file system cache has optimizations for sequential
read and uses read ahead.

In that case does anyone know how read() etc on Linux actually access the >file information? Seems like I was wrong about the I/O library mapping the >file (unless it does it in chunks) so are there other kernel hooks it
uses?

In the normal case, read() will issue a request through the I/O subsystem
which will find its way to the filesystem-specific code, which will
read the 4k (on intel, 4/16/64k on armv8) block containing the required
data into an unused page of memory. Subsequent reads will be be satified
from that page until the read crosses a page boundary, in which case the
kernel will read the next page of data from the file and the read will
be satisfied from that. This is generally known as the page cache,
the size of which varies from a configured minimum to as much memory
as is currently not used for any other purpose. Pages that
exceed the configured minimum will be reclaimed if needed for
e.g. fork() of a new process, with those in the configured minimum
subject to a replacement algorithm (e.g. LRU).

O_DIRECT can be used to bypass the page cache and read data directly
into the application address space.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Marcel Mueller on Thu Jul 15 14:00:02 2021

Marcel Mueller <news.5.maazl@spamgourmet.org> writes:

Am 14.07.21 um 09:24 schrieb MrSpud_pjnlvZ@r7hwq328lx.tv:

On Tue, 13 Jul 2021 22:46:37 +0200
Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:

In contrast to this, using mmap to allocate an area of a suitable size >>>> and putting the structures and domain names there just wastes 2400
bytes.

mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.

I used mmap on Linux in a compression utility recently that had to scan the >> same file a number of times because I assumed it would be faster than normal >> I/O. Turned out I was wrong and in fact using open(), read() etc was about >> twice as fast which I still find confusing because I thought all the standard
C I/O used mmap (or its kernel hook) eventually anyway. Strange.

The problem with mmap is, that it does not read anything. You will get a
page fault every 4kB and so the Data is read in 4k Chunks, allocating >physical memory in the same chunk size.

(1) the underlying page size varies by processor (e.g. 4k, 16k or 64k on
ARMv8) and arm/intel/amd processors can configure larger block
sizes that will consume a single TLB entry per block (4M on 32-bit x86,
2M/1G on 64-bit x86_64, and a wide range of sizes on ARMv8).

And mmap on linux allows one to map using large pages.

(2) one can use madvise to tell the kernel that the access
is sequential allowing the kernel to both pre-read successive
pages before the page fault and invalidate the TLB entry for
the prior page.

(3) The transparent Huge Pages feature on linux will automatically
use large pages when conditions allow.

The latter could cause

significant fragmentation of the physical memory leading to many TLB
entries.

The size of the TLB(s) is fixed. There are so many 4K I/DTLBs
associated with the L1 cache, and so many 4k/2M/4M/1G TLB entries
associated with the L2 cache. They're filled on first reference
and replaced typically using a form of LRU replacement algorithm.

Check processor specific documentation for the size of the TLBs on
each processor.

The number of pages used for the mmap is irrelevent with respect
to TLB pressure - the access pattern is far more relevent.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Kettlewell@21:1/5 to MrSpud_na@7a31.net on Thu Jul 15 17:09:54 2021

MrSpud_na@7a31.net writes:

Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:

In contrast you probably did not use read with only 4kb chunks. But even
if you did so the file system cache has optimizations for sequential
read and uses read ahead.

In that case does anyone know how read() etc on Linux actually access the file information? Seems like I was wrong about the I/O library mapping the file (unless it does it in chunks) so are there other kernel hooks it
uses?

Scott has given you most of the answer. The connection to memory mapping
is that when you memory map (part of) a file, the physical RAM used for
that mapping is same RAM that is used for the kernel’s page cache for
the corresponding part of the file (with some caveats about private
mappings).

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James K. Lowden@21:1/5 to Richard Kettlewell on Thu Jul 15 12:38:19 2021

On Wed, 14 Jul 2021 08:32:42 +0100
Richard Kettlewell <invalid@invalid.invalid> wrote:

I used mmap on Linux in a compression utility recently that had to
scan the same file a number of times because I assumed it would be
faster than normal I/O. Turned out I was wrong and in fact using
open(), read() etc was about twice as fast which I still find
confusing because I thought all the standard C I/O used mmap (or its
kernel hook) eventually anyway. Strange.

I did a similar experiment a couple of decades ago, with similar
results.

The reason, as I understand it, is that changing your process?s
virtual memory mapping is relatively expensive, even compared to the
copying involved in read().

My working hypothesis is that the cost of rearranging the process's
memory exceeds the cost of copying the data to existing memory N times,
where N = 1. For some value of N, the rearrangement cost is
amortized, and total I/O costs minimized using mmap.

I'm relying on empirical evidence. Linux uses mmap for executables, so
it can't be terrible. And LMDB has had great success with performance
using memory-mapped stores.

--jkl

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Richard Kettlewell on Thu Jul 15 17:22:10 2021

Richard Kettlewell <invalid@invalid.invalid> writes:

MrSpud_na@7a31.net writes:

Marcel Mueller <news.5.maazl@spamgourmet.org> wrote:

In contrast you probably did not use read with only 4kb chunks. But even >>>if you did so the file system cache has optimizations for sequential
read and uses read ahead.

In that case does anyone know how read() etc on Linux actually access the
file information? Seems like I was wrong about the I/O library mapping the >> file (unless it does it in chunks) so are there other kernel hooks it
uses?

Scott has given you most of the answer. The connection to memory mapping
is that when you memory map (part of) a file, the physical RAM used for
that mapping is same RAM that is used for the kernel’s page cache for
the corresponding part of the file (with some caveats about private >mappings).

And some caveats about huge pages. The linux mmap() allows the
application to specify the use of legacy huge pages (2m on x86_64, 4m on ia32) for the mapping, in which case the mapping uses physical memory reserved
at boot time. (TLB entries for huge pages must be aligned to the page
size, which makes creating a properly aligned huge page out of a 4k
pool of pages at run time a very difficult (and costly) proposition).

More recently, linux added THP (Transparent Huge Pages) which will opportunistically use huge pages when circumstances allow.

Using large pages when appropriate can significantly improve the
TLB hit rate (table walks, particularly in Virtual Machines) can
be quite expensive.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Thu Apr 25 22:17:10 2024
  from Wales, Uk via Telnet
- Keyop
  Thu Apr 25 21:14:50 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Fri Apr 26 08:24:20 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Apr 26 06:40:30 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	71:00:11
Calls:	6,656
Calls today:	2
Files:	12,201
Messages:	5,332,212
Posted today:	1

GNU malloc and external fragmentation

Who's Online

Recent Visitors

System Info