So just for giggles, I've been thinking more about what (if anything) can b= >e done to improve system performance by tweaks to a CPU's cache architectur= >e.
I've always thought that having a separate cache for supervisor mode refere= >nces and user mode references _SHOULD_ make things faster, but when I poked=
around old stuff on the Internet about caches from the beginning of time, =
I found that while Once Upon A Time, separate supervisor mode and user mode=
caches were considered something to try, they were apparently abandoned be=
cause a unified cache seemed to work better in simulations. Surprise, surp= >rise.
This seems just so odd to me and so I've been wondering how much this resul= >t is an artifact of the toy OS that as used in the simulations (Unix) or th=
So just for giggles, I've been thinking more about what (if anything) can be done to improve system performance by tweaks to a CPU's cache architecture.Upon A Time, separate supervisor mode and user mode caches were considered something to try, they were apparently abandoned because a unified cache seemed to work better in simulations. Surprise, surprise.
I've always thought that having a separate cache for supervisor mode references and user mode references _SHOULD_ make things faster, but when I poked around old stuff on the Internet about caches from the beginning of time, I found that while Once
This seems just so odd to me and so I've been wondering how much this result is an artifact of the toy OS that as used in the simulations (Unix) or the (by today's standards) small single layer caches used.
This got me to thinking about the 1110 AKA 1100/40 which had no caches but did have two different types of memory with different access speeds.
(I've always thought of the 1110 AKA 1100/40 as such an ugly machine that I've always ignored it and therefore remained ignorant of it even when I worked for the Company.)
To the extent that the faster (but smaller) memory could be viewed as a "cache" with a 100% hit rate, I've been wondering about how performance differed based on memory placement back then.
Was the Exec (whatever level it might have been ... between 27 and 33?) mostly or wholly loaded into the faster memory?
Was there special code (I think there was) that prioritized placement of certain things in memory and if so how?
What sort of performance gains did use of the faster memory produce or conversely what sort of performance penalties occur when it wasn't?
IOW, anyone care to dig up some old memories about the 1110 AKA 1100/40 you'd like to regale me with? Enquiring minds want to know.
Lewis Cole <l_cole@juno.com> writes:
So just for giggles, I've been thinking more about what (if anything) can b= >> e done to improve system performance by tweaks to a CPU's cache architectur= >> e.
I've always thought that having a separate cache for supervisor mode refere= >> nces and user mode references _SHOULD_ make things faster, but when I poked= >> around old stuff on the Internet about caches from the beginning of time, = >> I found that while Once Upon A Time, separate supervisor mode and user mode= >> caches were considered something to try, they were apparently abandoned be= >> cause a unified cache seemed to work better in simulations. Surprise, surp= >> rise.
This seems just so odd to me and so I've been wondering how much this resul= >> t is an artifact of the toy OS that as used in the simulations (Unix) or th=
Toy OS?
On 8/13/2023 10:18 PM, Lewis Cole wrote:
So just for giggles, I've been thinking
more about what (if anything) can be
done to improve system performance by
tweaks to a CPU's cache architecture.
I've always thought that having a
separate cache for supervisor mode
references and user mode references
_SHOULD_ make things faster, but when
I poked around old stuff on the
Internet about caches from the
beginning of time, I found that while
Once Upon A Time, separate supervisor
mode and user mode caches were
considered something to try, they
were apparently abandoned because a
unified cache seemed to work better
in simulations. Surprise, surprise.
Yeah. Only using half the cache at any one time would seem to decrease performance. :-)
This seems just so odd to me and so
I've been wondering how much this
result is an artifact of the toy OS
that as used in the simulations
(Unix) or the (by today's standards)
small single layer caches used.
This got me to thinking about the
1110 AKA 1100/40 which had no caches
but did have two different types of
memory with different access speeds.
(I've always thought of the 1110
AKA 1100/40 as such an ugly machine
that I've always ignored it and
therefore remained ignorant of it
even when I worked for the Company.)
To the extent that the faster (but
smaller) memory could be viewed as
a "cache" with a 100% hit rate, I've
been wondering about how performance
differed based on memory placement
back then.
According to the 1110 system description on Bitsavers, the cycle time
for the primary memory (implemented as plated wire) was 325ns for a read
and 520ns for a write, whereas the extended memory (the same core
modules as used for the 1106 main memory) had 1,500 ns cycle time, so a substantial difference, especially for reads.
But it really wasn't a cache. While there was a way to use the a
channel in a back-to back configuration, to transfer memory blocks from
one type of memory to the other (i.e. not use BT instructions), IIRC,
this was rarely used.
Was the Exec (whatever level it
might have been ... between 27
and 33?) mostly or wholly loaded
into the faster memory?
IIRC, 27 was the last 1108/1106 only level. 28 was an internal
Roseville level to start the integration of 1110 support. Level 29
(again IIRC) was the second internal version, perhaps also used for
early beta site 1110 customers; 30 was the first 1110 version, released
on a limited basis primarily to 1110 customers, while 31 was the general stable release.
Was there special code (I think
there was) that prioritized
placement of certain things in
memory and if so how?
There were options on the bank collector statements to specify prefer or require either primary or extended memory. If you didn't specify, the
default was I-banks in primary, D-banks in extended. That made sense,
as all instructions required an I-bank read, but many instructions don't require a D-bank reference (e.g. register to register, j=U or XU,
control transfer instructions), and the multiple D-bank instructions
(e.g. Search and BT) were rare. Also, since I-banks were almost
entirely reads, you took advantage of the faster read cycle time.
Also, I suspect most programs had a larger D-bank than I-bank, and since
you typically had more extended than primary memory, this allowed more optimal use of the expensive primary memory.
I don't remember what parts of the Exec were where, but I suspect it was
the same as for user programs. Of course, the interrupt vector
instructions had to be in primary due to their hardware fixed addresses.
What sort of performance gains
did use of the faster memory
produce or conversely what sort
of performance penalties occur
when it wasn't?
As you can see from the different cycle times, the differences were substantial.
IOW, anyone care to dig up some
old memories about the 1110 AKA
1100/40 you'd like to regale me
with? Enquiring minds want to know.
I hope this helps. :-)
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
On Monday, August 14, 2023 at 9:14:50 AM UTC-7, Stephen Fuld wrote:used somehow manages to increase the overall hit rate ... such as reducing a unified cache that's used to store both code and data with a separate i-cache for holding instructions and a separate d-cache for holding data which is _de rigueur_ on processor
On 8/13/2023 10:18 PM, Lewis Cole wrote:
So just for giggles, I've been thinking
more about what (if anything) can be
done to improve system performance by
tweaks to a CPU's cache architecture.
I've always thought that having a
separate cache for supervisor mode
references and user mode references
_SHOULD_ make things faster, but when
I poked around old stuff on the
Internet about caches from the
beginning of time, I found that while
Once Upon A Time, separate supervisor
mode and user mode caches were
considered something to try, they
were apparently abandoned because a
unified cache seemed to work better
in simulations. Surprise, surprise.
Yeah. Only using half the cache at any one time would seem to decrease
performance. :-)
Of course, the smiley face indicates that you are being facetious.
But just on the off chance that someone wandering through the group might take you seriously, let me point out that re-purposing half of a cache DOES NOT necessarily reduce performance, and may in fact increase it if the way that the "missing" half is
I think it should be clear from the multiple layers of cache these days, each layer being slower but larger than the one above it, that the further you go down (towards memory), the more a given cache is supposed to cache instructions/data that is "high use", but not so much as what's in the cache above it.
And even since the beginning of time (well ... since real live multi-tasking OS appeared), it has been obvious that processors tend to spend most of their time in supervisor mode (OS) code rather than in user (program) code.
From what I've read, the reason why separate supervisor and user mode caches performed worse than a unified cache was because of all the bouncing around through out the OS that was done.very long if it somehow managed to get there.
Back in The Good Old days where caches were very small essentially single layer, it is easy to imagine that a substantial fraction of any OS code/data (including that of a toy) could not fit in the one and only small cache and would not stay there for
But these days, caches are huge (especially the lower level ones) and it doesn't seem all that unimaginable to me that you could fit and keep a substantial portion of any OS laying around in one of the L3 caches of today ... or worse yet, in a L4 cacheif a case for better performance can be made.
This seems just so odd to me and so
I've been wondering how much this
result is an artifact of the toy OS
that as used in the simulations
(Unix) or the (by today's standards)
small single layer caches used.
This got me to thinking about the
1110 AKA 1100/40 which had no caches
but did have two different types of
memory with different access speeds.
(I've always thought of the 1110
AKA 1100/40 as such an ugly machine
that I've always ignored it and
therefore remained ignorant of it
even when I worked for the Company.)
To the extent that the faster (but
smaller) memory could be viewed as
a "cache" with a 100% hit rate, I've
been wondering about how performance
differed based on memory placement
back then.
According to the 1110 system description on Bitsavers, the cycle time
for the primary memory (implemented as plated wire) was 325ns for a read
and 520ns for a write, whereas the extended memory (the same core
modules as used for the 1106 main memory) had 1,500 ns cycle time, so a
substantial difference, especially for reads.
Yes.
But it really wasn't a cache. While there was a way to use the a
channel in a back-to back configuration, to transfer memory blocks from
one type of memory to the other (i.e. not use BT instructions), IIRC,
this was rarely used.
No, it wasn't a cache, which I thought I made clear in my OP.
Nevertheless, I think one can reasonably view/think of "primary" memory as if it were a slower memory that just happened to be cached where just by some accident, the cache would always return a hit.
Perhaps this seems weird to you, but it seems like a convenient tool to me to see if there might be any advantage to having separate supervisor mode and user mode caches.
was often times interleaved.Was the Exec (whatever level it
might have been ... between 27
and 33?) mostly or wholly loaded
into the faster memory?
IIRC, 27 was the last 1108/1106 only level. 28 was an internal
Roseville level to start the integration of 1110 support. Level 29
(again IIRC) was the second internal version, perhaps also used for
early beta site 1110 customers; 30 was the first 1110 version, released
on a limited basis primarily to 1110 customers, while 31 was the general
stable release.
Thanks for the history.
Was there special code (I think
there was) that prioritized
placement of certain things in
memory and if so how?
There were options on the bank collector statements to specify prefer or
require either primary or extended memory. If you didn't specify, the
default was I-banks in primary, D-banks in extended. That made sense,
as all instructions required an I-bank read, but many instructions don't
require a D-bank reference (e.g. register to register, j=U or XU,
control transfer instructions), and the multiple D-bank instructions
(e.g. Search and BT) were rare. Also, since I-banks were almost
entirely reads, you took advantage of the faster read cycle time.
Also, I suspect most programs had a larger D-bank than I-bank, and since
you typically had more extended than primary memory, this allowed more
optimal use of the expensive primary memory.
I don't remember what parts of the Exec were where, but I suspect it was
the same as for user programs. Of course, the interrupt vector
instructions had to be in primary due to their hardware fixed addresses.
For me, life started with 36 level by which time *BOOT1, et. al. had given way to *BTBLK, et. al.
Whatever the old bootstrap did, the new one tried to place the Exec I- and D-banks at opposite ends of memory, presumably so that concurrent accesses stood a better chance of not blocking one another due to being in a physically different memory that
IIRC, whether or not this was actually useful, it didn't change until M-Series hit the fan with paging.
be A Good Idea?What sort of performance gains
did use of the faster memory
produce or conversely what sort
of performance penalties occur
when it wasn't?
As you can see from the different cycle times, the differences were
substantial.
Yes, but do you know of anything that would suggest things were faster/slower because a lot of the OS was in primary storage most of the time ... IOW something that would support/refute the idea that separate supervisor and user mode caches might now
On 8/14/2023 8:38 AM, Scott Lurndal wrote:
Lewis Cole <l_cole@juno.com> writes:
So just for giggles, I've been thinking more about what (if anything) can b=
e done to improve system performance by tweaks to a CPU's cache architectur=
e.
I've always thought that having a separate cache for supervisor mode refere=
nces and user mode references _SHOULD_ make things faster, but when I poked=
around old stuff on the Internet about caches from the beginning of time, = >>> I found that while Once Upon A Time, separate supervisor mode and user mode=
caches were considered something to try, they were apparently abandoned be= >>> cause a unified cache seemed to work better in simulations. Surprise, surp=
rise.
This seems just so odd to me and so I've been wondering how much this resul=
t is an artifact of the toy OS that as used in the simulations (Unix) or th=
Toy OS?
Back in the time frame Lewis was talking about (1970s), many mainframe
people regarded Unix as a "toy OS". No one would think that now!
On 8/14/2023 1:47 PM, Lewis Cole wrote:used somehow manages to increase the overall hit rate ... such as reducing a unified cache that's used to store both code and data with a separate i-cache for holding instructions and a separate d-cache for holding data which is _de rigueur_ on processor
On Monday, August 14, 2023 at 9:14:50 AM UTC-7, Stephen Fuld wrote:
On 8/13/2023 10:18 PM, Lewis Cole wrote:
So just for giggles, I've been thinking
more about what (if anything) can be
done to improve system performance by
tweaks to a CPU's cache architecture.
I've always thought that having a
separate cache for supervisor mode
references and user mode references
_SHOULD_ make things faster, but when
I poked around old stuff on the
Internet about caches from the
beginning of time, I found that while
Once Upon A Time, separate supervisor
mode and user mode caches were
considered something to try, they
were apparently abandoned because a
unified cache seemed to work better
in simulations. Surprise, surprise.
Yeah. Only using half the cache at any one time would seem to decrease
performance. :-)
Of course, the smiley face indicates that you are being facetious.
But just on the off chance that someone wandering through the group might take you seriously, let me point out that re-purposing half of a cache DOES NOT necessarily reduce performance, and may in fact increase it if the way that the "missing" half is
high use", but not so much as what's in the cache above it.
I think it should be clear from the multiple layers of cache these days, each layer being slower but larger than the one above it, that the further you go down (towards memory), the more a given cache is supposed to cache instructions/data that is "
And even since the beginning of time (well ... since real live multi-tasking OS appeared), it has been obvious that processors tend to spend most of their time in supervisor mode (OS) code rather than in user (program) code.
I don't want to get into an argument about caching with you, but I am
sure that the percentage of time spent in supervisor mode is very
workload dependent.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 8/14/2023 8:38 AM, Scott Lurndal wrote:
Lewis Cole <l_cole@juno.com> writes:
So just for giggles, I've been thinking more about what (if anything) can b=
e done to improve system performance by tweaks to a CPU's cache architectur=
e.
I've always thought that having a separate cache for supervisor mode refere=
nces and user mode references _SHOULD_ make things faster, but when I poked=
around old stuff on the Internet about caches from the beginning of time, =
I found that while Once Upon A Time, separate supervisor mode and user mode=
caches were considered something to try, they were apparently abandoned be=
cause a unified cache seemed to work better in simulations. Surprise, surp=
rise.
This seems just so odd to me and so I've been wondering how much this resul=
t is an artifact of the toy OS that as used in the simulations (Unix) or th=
Toy OS?
Back in the time frame Lewis was talking about (1970s), many mainframe
people regarded Unix as a "toy OS". No one would think that now!
Some people, perhaps.
Burroughs, on the other hand, had unix offerings via Convergent Technologies,
and as Unisys, developed the unix-based OPUS systems (distributed, massively parallel
intel-based systems running a custom microkernel-based distributed version of SVR4).
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 8/14/2023 8:38 AM, Scott Lurndal wrote:
Lewis Cole <l_cole@juno.com> writes:
So just for giggles, I've been thinking more about what (if anything) can b=
e done to improve system performance by tweaks to a CPU's cache architectur=
e.
I've always thought that having a separate cache for supervisor mode refere=
nces and user mode references _SHOULD_ make things faster, but when I poked=
around old stuff on the Internet about caches from the beginning of time, =
I found that while Once Upon A Time, separate supervisor mode and user mode=
caches were considered something to try, they were apparently abandoned be=
cause a unified cache seemed to work better in simulations. Surprise, surp=
rise.
This seems just so odd to me and so I've been wondering how much this resul=
t is an artifact of the toy OS that as used in the simulations (Unix) or th=
Toy OS?
Back in the time frame Lewis was talking about (1970s), many mainframe >>people regarded Unix as a "toy OS". No one would think that now!
Some people, perhaps.
Burroughs, on the other hand, had unix offerings via Convergent Technologies, >and as Unisys, developed the unix-based OPUS systems (distributed, massively parallel
intel-based systems running a custom microkernel-based distributed version of SVR4).
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:is used somehow manages to increase the overall hit rate ... such as reducing a unified cache that's used to store both code and data with a separate i-cache for holding instructions and a separate d-cache for holding data which is _de rigueur_ on
On 8/14/2023 1:47 PM, Lewis Cole wrote:
On Monday, August 14, 2023 at 9:14:50 AM UTC-7, Stephen Fuld wrote:
On 8/13/2023 10:18 PM, Lewis Cole wrote:
So just for giggles, I've been thinking
more about what (if anything) can be
done to improve system performance by
tweaks to a CPU's cache architecture.
I've always thought that having a
separate cache for supervisor mode
references and user mode references
_SHOULD_ make things faster, but when
I poked around old stuff on the
Internet about caches from the
beginning of time, I found that while
Once Upon A Time, separate supervisor
mode and user mode caches were
considered something to try, they
were apparently abandoned because a
unified cache seemed to work better
in simulations. Surprise, surprise.
Yeah. Only using half the cache at any one time would seem to decrease >>>> performance. :-)
Of course, the smiley face indicates that you are being facetious.
But just on the off chance that someone wandering through the group might take you seriously, let me point out that re-purposing half of a cache DOES NOT necessarily reduce performance, and may in fact increase it if the way that the "missing" half
high use", but not so much as what's in the cache above it.
I think it should be clear from the multiple layers of cache these days, each layer being slower but larger than the one above it, that the further you go down (towards memory), the more a given cache is supposed to cache instructions/data that is "
And even since the beginning of time (well ... since real live multi-tasking OS appeared), it has been obvious that processors tend to spend most of their time in supervisor mode (OS) code rather than in user (program) code.
I don't want to get into an argument about caching with you, but I am
sure that the percentage of time spent in supervisor mode is very
workload dependent.
Indeed. On modern toy unix systems, the split is closer to 10% system, 90% user.
For example, a large parallel compilation job[*] (using half of the available 64 cores):
%Cpu(s): 28.6 us, 2.7 sy, 0.1 ni, 66.6 id, 1.8 wa, 0.0 hi, 0.3 si, 0.0 st
That's 28.6% in user mode, 2.7% in system (supervisor) mode.
Most modern server processors (intel, arm64) offer programmable cache partitioning
mechanisms that allow the OS to designate that a schedulable entity belongs to a
specific partition, and provides controls to designate portions of the cache are
reserved to those schedulable entities (threads, processes).
Note also that in modern server grade processors, there are extensions to the instruction set to allow the application to instruct the cache that data will be used in the future, in which case the cache controller _may_ pre-load the data in anticipation of future use.
Most cache subsystems also include logic to anticipate future accesses and preload the data into the cache before the processor requires it based on historic patterns of access.
On 8/15/2023 7:07 AM, Scott Lurndal wrote:
Most modern server processors (intel, arm64) offer programmable cache partitioning
mechanisms that allow the OS to designate that a schedulable entity belongs to a
specific partition, and provides controls to designate portions of the cache are
reserved to those schedulable entities (threads, processes).
I wasn't aware of this. I will have to do some research. :-)
Note also that in modern server grade processors, there are extensions to the
instruction set to allow the application to instruct the cache that data will
be used in the future, in which case the cache controller _may_ pre-load the >> data in anticipation of future use.
Something beyond prefetch instructions?
Most cache subsystems also include logic to anticipate future accesses and >> preload the data into the cache before the processor requires it based on
historic patterns of access.
Sure - especially sequential access patterns are easy to detect.
Yeah. Only using half the cache at any one time would seem to decrease
performance. :-)
Of course, the smiley face indicates
that you are being facetious.
But just on the off chance that
someone wandering through the group
might take you seriously, let me
point out that re-purposing half of
a cache DOES NOT necessarily reduce
performance, and may in fact increase
it if the way that the "missing" half
is used somehow manages to increase
the overall hit rate ... such as
reducing a unified cache that's used
to store both code and data with a
separate i-cache for holding
instructions and a separate d-cache
for holding data which is _de rigueur_
on processor caches these days.
I think it should be clear from the
multiple layers of cache these days,
each layer being slower but larger
than the one above it, that the
further you go down (towards memory),
the more a given cache is supposed to
cache instructions/data that is "high
use", but not so much as what's in
the cache above it.
And even since the beginning of time
(well ... since real live multi-tasking
OS appeared), it has been obvious that
processors tend to spend most of their
time in supervisor mode (OS) code
rather than in user (program) code.
I don't want to get into an argument about caching with you, [...]
[...] but I am sure that the percentage of time spent in supervisor mode is very
workload dependent.
No, it wasn't a cache, which I thought
I made clear in my OP.
Nevertheless, I think one can reasonably
view/think of "primary" memory as if it
were a slower memory that just happened
to be cached where just by some accident,
the cache would always return a hit.
Perhaps this seems weird to you, but it
seems like a convenient tool to me to
see if there might be any advantage to
having separate supervisor mode and user
mode caches.
I agree that it sounds weird to me, but if it helps you, have at it.
I don't remember what parts of the Exec were where, but I suspect it was >>> the same as for user programs. Of course, the interrupt vectorFor me, life started with 36 level by
instructions had to be in primary due to their hardware fixed addresses. >>>
which time *BOOT1, et. al. had given
way to *BTBLK, et. al.
Whatever the old bootstrap did, the
new one tried to place the Exec I-
and D-banks at opposite ends of memory,
presumably so that concurrent accesses
stood a better chance of not blocking
one another due to being in a
physically different memory that was
often times interleaved.
IIRC, whether or not this was actually
useful, it didn't change until M-Series hit the fan with paging.
First of all, when I mentioned the interrupt vectors, I wasn't talking
about boot elements, but the code starting at address 0200 (128 decimal) through 0377 on at least pre 1100/90 systems which was a set of LMJ instructions, one per interrupt type, that were the first instructions executed after an interrupt. e.g. on an 1108, on an ISI External
Interrupt on CPU0 the hardware would transfer control to address 0200,
where the LMJ instruction would capture the address of the next
instruction to be executed in the interrupted program, then transfer
control to the ISI interrupt handler.
But you did jog my memory about Exec placement. On an 1108, the Exec
I-bank was loaded starting at address 0, and extended at far as needed.
The Exec D-bank was loaded at the end of memory i.e. ending at 262K for
a fully configured memory, extending "downward" as far as needed. This
left the largest contiguous space possible for user programs, as well as insuring that the Exec I and D banks were in different memory banks, to guarantee overlapped timing for I fetch and data access. I guess that
the 1110 just did the same thing, as it didn't require changing another thing, and maximized the contiguous space available for user banks in
both primary and extended memory.
Stephen Fuld writes:
I don't want to get into an argument about caching with you, but I am
sure that the percentage of time spent in supervisor mode is very
workload dependent.
Indeed. On modern toy unix systems, the split is closer to 10% system, 90% user.
For example, a large parallel compilation job[*] (using half of the available 64 cores):
%Cpu(s): 28.6 us, 2.7 sy, 0.1 ni, 66.6 id, 1.8 wa, 0.0 hi, 0.3 si, 0.0 st
That's 28.6% in user mode, 2.7% in system (supervisor) mode.
Most modern server processors (intel, arm64) offer programmable cache partitioning
mechanisms that allow the OS to designate that a schedulable entity belongs to a
specific partition, and provides controls to designate portions of the cache are
reserved to those schedulable entities (threads, processes).
Note also that in modern server grade processors, there are extensions to the instruction set to allow the application to instruct the cache that data will be used in the future, in which case the cache controller _may_ pre-load the data in anticipation of future use.
Most cache subsystems also include logic to anticipate future accesses and preload the data into the cache before the processor requires it based on historic patterns of access.
[*] takes close to an hour on a single core system, over 9 million SLOC.
On Tuesday, August 15, 2023 at 7:07:23=E2=80=AFAM UTC-7, Scott Lurndal wrot= >e:
Stephen Fuld writes:=20
I don't want to get into an argument about caching with you, but I am
sure that the percentage of time spent in supervisor mode is very
workload dependent.
Indeed. On modern toy unix systems, the split is closer to 10% system, 90= >% user.
For example, a large parallel compilation job[*] (using half of the avail= >able 64 cores):
%Cpu(s): 28.6 us, 2.7 sy, 0.1 ni, 66.6 id, 1.8 wa, 0.0 hi, 0.3 si, 0.0 st
That's 28.6% in user mode, 2.7% in system (supervisor) mode.
That's nice.
But it doesn't speak to what would happen to system performance if the amou= >nt of time spent in the OS went down, does it?
Nor does it say anything about whether or not having a dedicated supervisor=
cache would help or hurt things.
I think that "programmable cache partitioning" is what the ARM folks call s= >ome of their processors' ability to partition one of their caches.
I think that the equivalent thing for x86-64 processors by Intel (which is = >the dominant x86-64 server processor maker) is called "Cache Allocation Tec= >hnology" (CAT) and what it does is to basically set limits on how much cach= >e thread/process/something-or-other can use so that said thread/process/som= >ething-or-other's cache usage doesn't impact other threads/processes/someth= >ing-or-others.
it would be A Good Idea to dedicate one processor *CORE* to each simulated = >2200 IP and basically ignore any "hyperthreaded" processors as all they can=
do if they are allowed to execute is to disrupt the cache needed by the ac=
tual core to simulate the IP.
If you are referring to the various PREFETCHxxx instructions, yes, they exi= >st, but they are usually "hints" the last time I looked and only load *DATA= >* into the L3 DATA cache in advance of its possible use.
So unless something's changed (and feel free to let me know if it has), you=
can't pre-fetch supervisor mode code for some service that user mode code =
might need Real Soon Now.
Most cache subsystems also include logic to anticipate future accesses an= >d
preload the data into the cache before the processor requires it based on
historic patterns of access.
Again, data, not code.
So if you have a request for some OS-like service that would be useful to h= >ave done as quickly as possible, the actual code might not still be in an i= >nstruction cache if a user program has been running for a long time and so = >has filled the cache with its working set.
On Tuesday, August 15, 2023 at 12:06:53 AM UTC-7, Stephen Fuld wrote: <snip>
Yeah. Only using half the cache at any one time would seem to decrease >>>> performance. :-)
Of course, the smiley face indicates
that you are being facetious.
But just on the off chance that
someone wandering through the group
might take you seriously, let me
point out that re-purposing half of
a cache DOES NOT necessarily reduce
performance, and may in fact increase
it if the way that the "missing" half
is used somehow manages to increase
the overall hit rate ...
such as
reducing a unified cache that's used
to store both code and data with a
separate i-cache for holding
instructions and a separate d-cache
for holding data which is _de rigueur_
on processor caches these days.
I think it should be clear from the
multiple layers of cache these days,
each layer being slower but larger
than the one above it, that the
further you go down (towards memory),
the more a given cache is supposed to
cache instructions/data that is "high
use", but not so much as what's in
the cache above it.
And even since the beginning of time
(well ... since real live multi-tasking
OS appeared), it has been obvious that
processors tend to spend most of their
time in supervisor mode (OS) code
rather than in user (program) code.
I don't want to get into an argument about caching with you, [...]
I'm not sure what sort of argument you think I'm trying get into WRT caching, but I assume that we both are familiar enough with it so that there's really no argument to be had so your comment makes no sense to me.
[...] but I am sure that the percentage of time spent in supervisor mode is very
workload dependent.
Agreed.
But to the extent that the results of the SS keyin were ... useful .. in The Good Old Days at Roseville, I recall seeing something in excess of 80+ percent of the time was spent in the Exec on regular basis.
On 8/15/2023 11:48 AM, Lewis Cole wrote:
On Tuesday, August 15, 2023 at 12:06:53 AM UTC-7, Stephen Fuld wrote:
Agreed.
But to the extent that the results of the SS keyin were ... useful .. in The Good Old Days at Roseville, I recall seeing something in excess of 80+ percent of the time was spent in the Exec on regular basis.
It took me a while to respond to this, as I had a memory, but had to
find the manual to check. You might have had some non-released code
running in Roseville, but the standard SS keyin doesn't show what
percentage of time is spent in Exec. To me, and as supported by the
evidence Scott gave, 80% seems way too high.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 8/15/2023 11:48 AM, Lewis Cole wrote:
On Tuesday, August 15, 2023 at 12:06:53 AM UTC-7, Stephen Fuld wrote:
Agreed.
But to the extent that the results of the SS keyin were ... useful .. in The Good Old Days at Roseville, I recall seeing something in excess of 80+ percent of the time was spent in the Exec on regular basis.
It took me a while to respond to this, as I had a memory, but had to
find the manual to check. You might have had some non-released code
running in Roseville, but the standard SS keyin doesn't show what
percentage of time is spent in Exec. To me, and as supported by the
evidence Scott gave, 80% seems way too high.
To be fair, one must consider functional differences in operating systems.
Back in the day, for example, the operating system was responsible for
record management, while in *nix code that is all delegated to user mode code.
So for applications that heavily used files (and in the olden days
the lack of memory was compensated for by using temporary files on mass storage devices) there would likely be far more time spent in supervisor
code than in *nix/windows today.
In the Burroughs MCP, for example,
a subtantial portion of DMSII is part of the OS rather than purely usermode code is it would be with e.g. Oracle on *nix.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 8/15/2023 7:07 AM, Scott Lurndal wrote:
Most modern server processors (intel, arm64) offer programmable cache partitioning
mechanisms that allow the OS to designate that a schedulable entity belongs to a
specific partition, and provides controls to designate portions of the cache are
reserved to those schedulable entities (threads, processes).
I wasn't aware of this. I will have to do some research. :-)
For the ARM64 version, look for Memory System Resource Partitioning and Monitoring
(MPAM).
https://developer.arm.com/documentation/ddi0598/latest/
Note that this only controls "allocation", not "access" - i.e. any application
can hit a line in any partition, but new lines are only allocated in partitions
associated with the entity that caused the fill to occur.
Resources include cache allocation and memory bandwidth.
On 8/15/2023 9:58 AM, Scott Lurndal wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 8/15/2023 7:07 AM, Scott Lurndal wrote:
Most modern server processors (intel, arm64) offer programmable cache partitioning
mechanisms that allow the OS to designate that a schedulable entity belongs to a
specific partition, and provides controls to designate portions of the cache are
reserved to those schedulable entities (threads, processes).
I wasn't aware of this. I will have to do some research. :-)
For the ARM64 version, look for Memory System Resource Partitioning and Monitoring
(MPAM).
https://developer.arm.com/documentation/ddi0598/latest/
Note that this only controls "allocation", not "access" - i.e. any application
can hit a line in any partition, but new lines are only allocated in partitions
associated with the entity that caused the fill to occur.
Resources include cache allocation and memory bandwidth.
First, thanks for the link. I looked at it a little (I am not an ARM >programmer). I can appreciate its utility, particularly in something
like a cloud server environment where it is useful to prevent one
application either inadvertently or on purpose, from "overpowering"
(i.e. monopolizing resources) another, so you can meet SLAs etc. This
is well explained in the "Overview" section of the manual.
However, I don't see its value in the situation that Lewis is talking
about, supervisor vs users. A user can't "overpower" the OS, as the OS
could simply not give it much CPU time. And if can't rely on the OS not
to overpower or steal otherwise needed resources from the user programs,
then I think you have worse problems. :-(
On 8/15/2023 9:58 AM, Scott Lurndal wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 8/15/2023 7:07 AM, Scott Lurndal wrote:
Most modern server processors (intel, arm64) offer programmable
cache partitioning
mechanisms that allow the OS to designate that a schedulable entity
belongs to a
specific partition, and provides controls to designate portions of
the cache are
reserved to those schedulable entities (threads, processes).
I wasn't aware of this. I will have to do some research. :-)
For the ARM64 version, look for Memory System Resource Partitioning
and Monitoring
(MPAM).
https://developer.arm.com/documentation/ddi0598/latest/
Note that this only controls "allocation", not "access" - i.e. any
application
can hit a line in any partition, but new lines are only allocated in
partitions
associated with the entity that caused the fill to occur.
Resources include cache allocation and memory bandwidth.
First, thanks for the link. I looked at it a little (I am not an ARM programmer). I can appreciate its utility, particularly in something
like a cloud server environment where it is useful to prevent one
application either inadvertently or on purpose, from "overpowering"
(i.e. monopolizing resources) another, so you can meet SLAs etc. This
is well explained in the "Overview" section of the manual.
However, I don't see its value in the situation that Lewis is talking
about, supervisor vs users. A user can't "overpower" the OS, as the OS could simply not give it much CPU time. And if can't rely on the OS not
to overpower or steal otherwise needed resources from the user programs,
then I think you have worse problems. :-(
On Tuesday, August 15, 2023 at 12:06:53AM UTC-7, Stephen Fuld wrote:
<snip>
And even since the beginning of time
(well ... since real live multi-tasking
OS appeared), it has been obvious that
processors tend to spend most of their
time in supervisor mode (OS) code
rather than in user (program) code.
I don't want to get into an argument about caching with you, [...]
I'm not sure what sort of argument you
think I'm trying get into WRT caching,
but I assume that we both are familiar
enough with it so that there's really
no argument to be had so your
comment makes no sense to me.
[...] but I am sure that the percentage of time spent in supervisor mode is very
workload dependent.
Agreed.
But to the extent that the results of
the SS keyin were ... useful .. in The
Good Old Days at Roseville, I recall
seeing something in excess of 80+
percent of the time was spent in the
Exec on regular basis.
It took me a while to respond to this, as I had a memory, but had to
find the manual to check. You might have had some non-released code
running in Roseville, but the standard SS keyin doesn't show what
percentage of time is spent in Exec. To me, and as supported by the
evidence Scott gave, 80% seems way too high.
On Tuesday, August 15, 2023 at 12:06:53AM UTC-7, Stephen Fuld wrote:
<snip>
Yeah. Only using half the cache at any one time would seem to decrease >>>>> performance. :-)
Of course, the smiley face indicates
that you are being facetious.
No. I wasn't. See below.
But just on the off chance that
someone wandering through the group
might take you seriously, let me
point out that re-purposing half of
a cache DOES NOT necessarily reduce
performance, and may in fact increase
it if the way that the "missing" half
is used somehow manages to increase
the overall hit rate ...
Splitting a size X cache into two sized x/2 caches will almost certainly *reduce* hit rate.
Think of it this way. The highest hit rate is
obtained when the number of most likely to be used blocks are exactly
evenly split between the two caches.
That would make the contents of
the two half sized caches exactly the same as those of the full sized
cache.
Conversely, if one of the caches has a different, (which means
lesser used) block, then its hit rate would be lower.
There is no way
that splitting the caches would lead to a higher hit rate.
But hit rate
isn't the only thing that determines cache/system performance.
such as
reducing a unified cache that's used
to store both code and data with a
separate i-cache for holding
instructions and a separate d-cache
for holding data which is _de rigueur_
on processor caches these days.
Separating I and D caches has other advantages. Specifically, since
they have separate (duplicated) hardware logic both for addressing and
the actual data storage, the two caches can be accessed simultaneously,
which improves performance, as the instruction fetch part of a modern
CPU is totally asynchronous with the operand fetch/store part, and they
can be overlapped. This ability, to do an instruction fetch from cache simultaneously with handling a load/store is enough to overcome the
lower hit rat
Note that this advantage doesn't apply to a
user/supervisor separation, as the CPU is in one mode or the other, not
both simultaneously.
I think it should be clear from the
multiple layers of cache these days,
each layer being slower but larger
than the one above it, that the
further you go down (towards memory),
the more a given cache is supposed to
cache instructions/data that is "high
use", but not so much as what's in
the cache above it.
True for an exclusive cache, but not for an inclusive one.
I'm up to my ass in alligators IRL and so I haven't (and likely won't for s= >ome time) have a lot of time to read and respond to posts here.
So I'm going to respond to only Mr. Fuld's posts rather than any others.
And since I am up to my ass in alligators, I'm going to break up my respons= >e to Mr. Fuld's post into two parts so that I can get SOMETHING out Real So= >on Now.
So here is the first part:
On 8/15/2023 11:48 AM, Lewis Cole wrote:
On Tuesday, August 15, 2023 at 12:06:53AM UTC-7, Stephen Fuld wrote:
<snip>
Yeah. Only using half the cache at any one time would seem to decreas= >e
performance. :-)
Of course, the smiley face indicates
that you are being facetious.
No. I wasn't. See below.
I thought you didn't want to argue about caching? ;-)
Well, hopefully we both argue on pretty much everything and any disagreemen= >t is likely due to us not being on the same page WRT our working assumption= >s, so perhaps this is A Good Thing.
However, I apologize to you for any annoyance I caused you due to my assump= >tions about your reply.
But just on the off chance that
someone wandering through the group
might take you seriously, let me
point out that re-purposing half of
a cache DOES NOT necessarily reduce
performance, and may in fact increase
it if the way that the "missing" half
is used somehow manages to increase
the overall hit rate ...
Splitting a size X cache into two sized x/2 caches will almost certainly
*reduce* hit rate.=20
There are several factors that influence hit rate, one of them is cache siz= >e.
Others, such as number of associativity ways, replacement policy, and line = >size are also obvious influences.
In addition, there are other factors that can make up for a slightly reduce= >d hit rate so that such a cache can still be competitive with a cache with = >a slightly higher hit rate such as cycle time.
If two caches are literally identical in every way except for size, then yo= >u are *CORRECT* that the hit rate will almost certainly be lower for the sm= >aller cache.
However, given two caches, one of which just happens to be half the size of=
the other, it does *NOT* follow that the smaller cache must necessarily ha=
ve a lower hit rate than the other as changes to some of the other factors = >that affect hit rate might just make up for what was lost due to the smalle= >r size.
Think of it this way. The highest hit rate is
obtained when the number of most likely to be used blocks are exactly
evenly split between the two caches.
Ummm, no. I guess we are going to have an argument over caching after all .= >...
The highest hit rate is obtained when a cache manages to successfully antic= >ipate, load up into its local storage, and then provide that which the proc= >essor needs *BEFORE* the processor actually makes a request to get it from = >memory. Period.
This is true regardless of whether or not we're talking about one cache or = >multiple caches.
From the point of view of performance, it's what makes caching work.
(Note that there may be other reasons for a cache such as to reduce bus tra= >ffic, but let's ignore these reasons for the moment.)
Whether or not, say, an I-cache happens to have the same number of likely-t= >o-be-used blocks as the D-cache is irrelevant.
They may have the same number. They may not have the same number. I suspe= >ct for reasons that I'll wave my arms at shortly that they usually aren't. >What matters is whether they have what's needed and can deliver it before t= >he processor actually requests it.
Now if an I-cache is getting lots and lots of hits, then presumably it is l= >ikely filled with code loops that are being executed frequently.
The longer that the processor can continue to execute these loops, the more=
it will execute them at speeds that approach that which it would if main m=
emory were as fast as the cache memory.
And the more that this happens, the more this speed offsets the the much sl= >ower speed of the processor when it isn't getting constant hits.
However, running in cached loops doesn't imply much about the data that the= >se loops are accessing.
They may be marching through long arrays of data or they may be pounding aw= >ay at a small portion of data structure such as at the front of a ring buff= >er. It's all very much application dependent.
About the only thing that we can infer is that because the code is executin= >g loops, there is at least one instruction (the jump to the top of a loop) = >which doesn't have/need a corresponding piece of data in the D-cache.
IOW, there will tend to be one fewer piece of data in the D-cache than in t= >he I-cache.
Whether or not this translates into equal numbers of cache lines rather tha= >n data words just depends.
(And note I haven't even touched on the effect of writes to the contents of=
the D-cache.)
So if you think that caches will have their highest hit rate when the "numb= >er of most likely to be used blocks are exactly evenly split between the tw= >o caches", you're going to have to provide a better argument/evidence to su= >pport your reasoning before I will accept this premise, either in the form = >of a citation or better yet, a "typical" example.
That would make the contents of
the two half sized caches exactly the same as those of the full sized
cache.=20
No, it wouldn't. See above.
Conversely, if one of the caches has a different, (which means
lesser used) block, then its hit rate would be lower.=20
No, it wouldn't. See above.
There is no way
that splitting the caches would lead to a higher hit rate.=20
As I waved my arms at before, it is possible if more changes than are made = >than just to its size.
For example, if a cache happens to be a directed mapped cache, then there's=
only one spot in the cache for a piece of data with a particular index.
If another piece of data with the same index is requested, then old piece o= >f data is lost/replaced with the new one.
This is basic direct mapped cache behavior 101.
OTOH, if a cache happens to be a set associative cache of any way greater t= >han one (i.e. not direct mapped), then the new piece of data can end up in = >a different spot within the same set for the given index from which it can = >be returned if it is not lost/replaced for some other reason.
This is basic set associativity cache behavior 101.
The result is that if the processor has a direct mapped cache and just happ= >ens to make alternating accesses to two pieces of data that have the same i= >ndex, the directed mapped cache will *ALWAYS* take a miss on every access (= >i.e. have a hit rate of 0%), while the same processor with a set associativ= >e cache of any way greater than one will *ALWAYS* take a take a hit (i.e. h= >ave a hit rate of 100%).
And note that nowhere in the above description is there any mention of cach= >e size.
Cache size DOES implicitly affect the likelihood of a collision, and so "ty= >pically" you will get more collisions which will cause a direct mapped cach= >e to perform worse than a set associative cache.
And you can theoretically (although not practically) go one step further by=
making a cache fully associative which will eliminate conflict misses enti=
rely.
In short, there most certain is "a way" that the hit rate can be higher on = >a smaller size cache than a larger one, contrary to your claim.
But hit rate
isn't the only thing that determines cache/system performance.
Yes. Finally, we agree on something.
such as
reducing a unified cache that's used
to store both code and data with a
separate i-cache for holding
instructions and a separate d-cache
for holding data which is _de rigueur_
on processor caches these days.
Separating I and D caches has other advantages. Specifically, since
they have separate (duplicated) hardware logic both for addressing and
the actual data storage, the two caches can be accessed simultaneously,
which improves performance, as the instruction fetch part of a modern
CPU is totally asynchronous with the operand fetch/store part, and they
can be overlapped. This ability, to do an instruction fetch from cache
simultaneously with handling a load/store is enough to overcome the
lower hit rat
Having a separate I-cache and D-cache may well have other advantages beside= >s increased hit rate.
And increased concurrency may well be one of them.
However, my point by mentioning the existence of separate I-caches and D-ca= >ches was to point out that given a sufficiently Good Reason, splitting/repl= >acing a single cache with smaller caches may be A Good Idea.
Increased concurrency doesn't change that argument in the slightest.
Simply replace any mention of "increased hit rate" with "increased concurre= >ncy" and the result is the same.
If you want to claim that increased concurrency was the *MAIN* reason for t= >he existence of separate I-caches and D-caches, then I await with bated bre= >ath for you to present evidence and/or a better argument to show this was t= >he case.
And if you're wondering why I'm not presenting -- and am not going to prese= >nt -- any evidence or argument to support my claim that it was due to incre= >ased hit rate, that's because we both seem to agree on the basic premise I = >mentioned before, namely, given a sufficiently Good Reason, then splitting/= >replaceing a single cache with smaller caches may be A Good Idea.
Any argument that you present strengthens that premise without the need for=
me to do anything.
I will point out, however, that I think that increased concurrency seems li= >ke a pretty weak justification.
Yes, separate caches might well allow for increased concurrency, but you ha= >ve to come up with finding those things that can be done during instruction=
execution that can be done in parallel.
And if you manage to find that parallelism, then you need to not only be ab= >le to issue separate operations in parallel, you have to make sure that the= >se parallel operations don't interfere with one another, which is to say th= >at your caches remain "coherent" despite doing things like modifying the co= >de stream currently being executed (i.e. self modifying code).
Given the limited transistor budget In The Early Days, I doubt that dealing=
with these issues was something that designers were willing to mess with i=
f they didn't have to.
(The first caches tended to be direct mapped because they were the simplest=
and therefore the cheapest to implement while also having the fastest acce=
ss times.
Set associative caches performed better, but were more complicated and ther= >efore more expensive as well as having slower access times and so came late= >r.)
ISTM that a more plausible reason other than hit rate would be to further r= >educe bus traffic which was one of the other big reasons that DEC (IIRC) go= >t into using them In the Beginning.
Note that this advantage doesn't apply to a
user/supervisor separation, as the CPU is in one mode or the other, not
both simultaneously.
Bullshit.
Assuming that you have two separate caches that can be kept fed and otherwi= >se operate concurrently, then All You Have To Do to make them both somethin= >g at the same time is to generate "select" signals for each so that they kn= >ow that they should operate at the same time.
Obviously, a processor knows whether or not it is in user mode or superviso= >r mode when it performs an instruction fetch and so it is trivially easy fo= >r a processor in either mode to generate the correct "select" signal for an=
instruction fetch from the correct instruction cache.
It should be equally obvious that a processor in user mode or supervisor mo= >de knows (or can know) when it is executing an instruction that should oper= >ate on data that is in the same mode as the instruction its executing.
And it should be obvious that you don't want a user mode instruction to eve= >r be able to access supervisor mode data.
The only case this leaves to address when it comes to the generation of a "= >select" signal is when a processor running in supervisor mode wants to do s= >omething with user mode code or data.
But generating a "select" signal that will access instructions or data in e= >ither a user mode instruction cache or data in a user mode data cache is tr= >ivially easy as well, at least conceptually, especially if one is willing t= >o make use of/exploit that which is common practice in OSs these days.
In particular, since even before the Toy OSs grew up, there has been a fixa= >tion with dividing the logical address space into two parts, one part for u= >ser code and data and the other part for supervisor code and data.
When the logical space is divided exactly in half (as was the case for much=
of the time for 32-bit machines), the result was that the high order bit o=
f the address indicates (and therefore could be used as a select line for) = >user space versus supervisor space cache access.
While things have changed a bit since 64-bit machines have become dominant,=
it is still at least conceptually possible to treat some part of the high =
order part of a logical address as such an indicator.
"But wait ... ," you might be tempted to say, "... something like that does= >n't work at all on a system like a 2200 ... the Exec has never had the same=
sort of placement fixation in either absolute or real space that the forme=
r Toy OSs had/have", which is true.
But the thing is that the logical address of any accessible word in memory = >is NOT "U", but rather "(B,U)" (both explicitly in Extended mode and implic= >itly in Basic Mode) where B is the number of a base register, and each B-re= >gister contains an access lock field which in turn is made up of a "ring" a= >nd a "domain".
Supervisor mode and user mode is all about degrees of trust which is a simp= >lification of the more general "ring" and "domain" scheme where some collec= >tion of rings are supposedly for "supervisor" mode and the remaining collec= >tion are supposedly for "user" mode.
Whether or not this is actually the way things are used, it is at least con= >ceptually possible that an address (B,U) can be turned into a supervisor or=
user mode indicator that can be concatenated with U which can then be sent= to the hardware to select a cache and then a particular word within that c=
ache.
So once again, we're back to being able to identify supervisor mode code/da= >ta versus user mode code/data by its address.
(And yes, I know about the Processor Privilege [PP] flags in the designator=
register, and reconciling their use with the ring bits might be a problem,= but at least conceptually, PP does not -- or at least need not -- matter w=
hen it comes to selecting a particular cache.)
If you want to say no one in their right mind -- certainly no real live CPU=
designer -- would think in terms of using some part of an address as a "ri=
ng" together with an offset, I would point out to you that this is not the = >case: a real, live CPU designer *DID* choose to merge security modes with = >addressing and the result was a relatively successful computer.
It was called the Data General Eclipse and Kidder's book, "Soul of a New Ma= >chine", mentions this being done.
What I find ... "interesting" ... here, however, is that you would try to m= >ake an argument at all about the possible lack of concurrency WRT a possibl= >e supervisor cache.
As I have indicated before, I assume that any such cache would be basically=
at the same level as current L3 caches and it is my understanding that for= the most part, they're not doing any sort of concurrent operations today.
It seems, therefore, that you're trying to present a strawman by suggesting=
a disadvantage that doesn't exist at all when compared to existing L3 cach=
es.
I think it should be clear from the
multiple layers of cache these days,
each layer being slower but larger
than the one above it, that the
further you go down (towards memory),
the more a given cache is supposed to
cache instructions/data that is "high
use", but not so much as what's in
the cache above it.
True for an exclusive cache, but not for an inclusive one.
I don't know what you mean by a "exclusive" cache versus an "inclusive one"= >.
Please feel free to elaborate on what you mean.
In every multi-layered cache in a real live processor chip that I'm aware o= >f, each line in the L1 cache is also represented by a larger line in the L2=
cache that contains the L1 line as a subset, each line in the L2 cache is =
also represented by a larger line in the L3 that contains the L2 line as a = >subset.
At this point, I'm going to end my response to Mr. Fuld's post here and go = >off and do other things before I get back to making a final reply to the re= >maining part of his post.
So here's the second part of my reply to Mr. Fuld's last response to me. >Considering how quickly this reply has grown, I may end up breaking it up i= >nto a third part as well.
Just for giggles, though, let's say that that was then and this is now, mea= >ning that the amount of time spent in the Exec is (and has been for some ti= >me) roughly the same as the figure that Mr. Lurndal cited ... so what?
Mr. Lurndal apparently wants to argue that since the *AVERAGE* amount of ti= >me that some systems (presumably those whose OSs' names end in the letters = >"ix") spend in the supervisor is "only" 20%, that means that it isn't worth=
having a separate supervisor cache.
After all, his reasoning goes, if the entire time spent in the supervisor w= >ere eliminated, that would mean an increase of only 20% more time to user p= >rograms.
Just on principle, this is a silly thing to say.
It obviously incorrectly equates time with useful work and then compounds t= >hat by treating time spent in user code as important while time spent in su= >pervisor mode as waste.
It shouldn't take much to realize that this is nonsense.
Imagine a user program making a request to the OS to send a message somewhe= >re that can't be delivered for some reason (e.g. an error or some programma= >tic limits being exceeded), the OS should return a bit more quickly than if=
it could send the message.
So the user program should get a bit more time and the OS should get a bit = >less time.
But no one in their right mind should automatically presume the user progra= >m should be able to do something more "useful" with the extra time it has.
So I'm going to end this part and go on to a new Part 3.
So here's the third part of my reply to Mr. Fuld's last response to me.
So about 10 years ago, the boys and girls at ETH Zurich along with the boys=
and girls at Microsoft decided to try to come up with an OS based on a new= model which became known as a "multi-kernel".
The new OS they created, called Barrelfish, treated all CPUs as if they wer= >e networked even if they were on the same chip, sharing the same common mem= >ory.
Think of it this way. The highest hit rate is
obtained when the number of most likely to be used blocks are exactly
evenly split between the two caches.
Ummm, no. I guess we are going to have an argument over caching after all .= >...
The highest hit rate is obtained when a cache manages to successfully antic= >ipate, load up into its local storage, and then provide that which the proc= >essor needs *BEFORE* the processor actually makes a request to get it from = >memory. Period.
Whether or not, say, an I-cache happens to have the same number of likely-t= >o-be-used blocks as the D-cache is irrelevant.
They may have the same number. They may not have the same number. I suspe= >ct for reasons that I'll wave my arms at shortly that they usually aren't. >What matters is whether they have what's needed and can deliver it before t= >he processor actually requests it.
Now if an I-cache is getting lots and lots of hits, then presumably it is l= >ikely filled with code loops that are being executed frequently.
The longer that the processor can continue to execute these loops, the more=
it will execute them at speeds that approach that which it would if main m=
emory were as fast as the cache memory.
That would make the contents of
the two half sized caches exactly the same as those of the full sized
cache.=20
No, it wouldn't. See above.
Conversely, if one of the caches has a different, (which means
lesser used) block, then its hit rate would be lower.=20
No, it wouldn't. See above.
There is no way
that splitting the caches would lead to a higher hit rate.=20
As I waved my arms at before, it is possible if more changes than are made = >than just to its size.
For example, if a cache happens to be a directed mapped cache, then there's=
only one spot in the cache for a piece of data with a particular index.
If another piece of data with the same index is requested, then old piece o= >f data is lost/replaced with the new one.
This is basic direct mapped cache behavior 101.
OTOH, if a cache happens to be a set associative cache of any way greater t= >han one (i.e. not direct mapped), then the new piece of data can end up in = >a different spot within the same set for the given index from which it can = >be returned if it is not lost/replaced for some other reason.
This is basic set associativity cache behavior 101.
The result is that if the processor has a direct mapped cache and just happ= >ens to make alternating accesses to two pieces of data that have the same i= >ndex, the directed mapped cache will *ALWAYS* take a miss on every access (= >i.e. have a hit rate of 0%), while the same processor with a set associativ= >e cache of any way greater than one will *ALWAYS* take a take a hit (i.e. h= >ave a hit rate of 100%).
Having a separate I-cache and D-cache may well have other advantages beside= >s increased hit rate.
And increased concurrency may well be one of them.
However, my point by mentioning the existence of separate I-caches and D-ca= >ches was to point out that given a sufficiently Good Reason, splitting/repl= >acing a single cache with smaller caches may be A Good Idea.
I will point out, however, that I think that increased concurrency seems li= >ke a pretty weak justification.
Yes, separate caches might well allow for increased concurrency, but you ha= >ve to come up with finding those things that can be done during instruction=
execution that can be done in parallel.
But generating a "select" signal that will access instructions or data in e= >ither a user mode instruction cache or data in a user mode data cache is tr= >ivially easy as well, at least conceptually, especially if one is willing t= >o make use of/exploit that which is common practice in OSs these days.
In particular, since even before the Toy OSs grew up, there has been a fixa= >tion with dividing the logical address space into two parts, one part for u= >ser code and data and the other part for supervisor code and data.
When the logical space is divided exactly in half (as was the case for much=
of the time for 32-bit machines),
f the address indicates (and therefore could be used as a select line for) = >user space versus supervisor space cache access.
While things have changed a bit since 64-bit machines have become dominant,=
it is still at least conceptually possible to treat some part of the high =
order part of a logical address as such an indicator.
What I find ... "interesting" ... here, however, is that you would try to m= >ake an argument at all about the possible lack of concurrency WRT a possibl= >e supervisor cache.
As I have indicated before, I assume that any such cache would be basically=
at the same level as current L3 caches and it is my understanding that for= the most part, they're not doing any sort of concurrent operations today.
I don't know what you mean by a "exclusive" cache versus an "inclusive one"=
So here's the second part of my reply to Mr. Fuld's last response to me. Considering how quickly this reply has grown, I may end up breaking it up into a third part as well.
On 8/15/2023 11:48 AM, Lewis Cole wrote:
On Tuesday, August 15, 2023 at 12:06:53AM UTC-7, Stephen Fuld wrote:
<snip>
And even since the beginning of time
(well ... since real live multi-tasking
OS appeared), it has been obvious that
processors tend to spend most of their
time in supervisor mode (OS) code
rather than in user (program) code.
I don't want to get into an argument about caching with you, [...]
I'm not sure what sort of argument you
think I'm trying get into WRT caching,
but I assume that we both are familiar
enough with it so that there's really
no argument to be had so your
comment makes no sense to me.
[...] but I am sure that the percentage of time spent in supervisor mode is very
workload dependent.
Agreed.
But to the extent that the results of
the SS keyin were ... useful .. in The
Good Old Days at Roseville, I recall
seeing something in excess of 80+
percent of the time was spent in the
Exec on regular basis.
It took me a while to respond to this, as I had a memory, but had to
find the manual to check. You might have had some non-released code
running in Roseville, but the standard SS keyin doesn't show what
percentage of time is spent in Exec. To me, and as supported by the
evidence Scott gave, 80% seems way too high.
So let me get this straight: You don't believe the 80% figure I cite because it seems too high to you and it didn't come a "standard" SS keyin of the time.
Meanwhile, you believe the figure cited by Mr. Lurndal because it seems more believable even though it comes from a system that's almost certainly running a different workload than the one I'm referring to which was from decades ago.
Did I get this right?
Seriously?
What happened to the bit where *YOU* were saying about the amount of time spent in an OS was probably workload dependent?
And since when does the credibility of local code written in Roseville (by people who are likely responsible for the care and feeding of the Exec that the local code is being written for) some how become suspect just because said code didn't make itinto a release ... whether it's output is consistent with what you believe or not?
FWIW, I stand by the statement about seeing CPU utilization in excess of 80+% on a regular basis because that is what I recall seeing.
You can choose to believe me or not.
(And I would like to point out that I don't appreciate being called a liar no matter how politely it is done.)
I cannot provide direct evidence to support my statement.
I don't have any console listings or demand terminal session listings where I entered an "@@cons ss", for example.
However, I can point to an old (~1981) video that clearly suggests that the 20% figure cited by Mr. Lurndal almost certainly doesn't apply to the Exec at least in some environments from way back when.
And I can wave my arms at why it is most certainly possible for a much higher figure to show up, at least theoretically, even today.
So the video I want to draw your attention to is entitled, "19th Annual Sperry Univac Spring Technical Symposium - 'Proposed Memory Management Techniques for Sperry Univac 1100 Series Systems'", and can be found here:
< https://digital.hagley.org/VID_1985261_B110_ID05?solr_nav%5Bid%5D=88d187d912cfce1a5ad1&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=2 >
On 10/8/2023 7:30 PM, Lewis Cole wrote:
So here's the second part of my reply to Mr. Fuld's last response to me.
Considering how quickly this reply has grown, I may end up breaking it up into a third part as well.
On 8/15/2023 11:48 AM, Lewis Cole wrote:
On Tuesday, August 15, 2023 at 12:06:53AM UTC-7, Stephen Fuld wrote:
<snip>
And even since the beginning of time
(well ... since real live multi-tasking
OS appeared), it has been obvious that
processors tend to spend most of their
time in supervisor mode (OS) code
rather than in user (program) code.
I don't want to get into an argument about caching with you, [...]
I'm not sure what sort of argument you
think I'm trying get into WRT caching,
but I assume that we both are familiar
enough with it so that there's really
no argument to be had so your
comment makes no sense to me.
[...] but I am sure that the percentage of time spent in supervisor mode is very
workload dependent.
Agreed.
But to the extent that the results of
the SS keyin were ... useful .. in The
Good Old Days at Roseville, I recall
seeing something in excess of 80+
percent of the time was spent in the
Exec on regular basis.
It took me a while to respond to this, as I had a memory, but had to
find the manual to check. You might have had some non-released code
running in Roseville, but the standard SS keyin doesn't show what
percentage of time is spent in Exec. To me, and as supported by the
evidence Scott gave, 80% seems way too high.
So let me get this straight: You don't believe the 80% figure I cite because it seems too high to you and it didn't come a "standard" SS keyin of the time.
Meanwhile, you believe the figure cited by Mr. Lurndal because it seems more believable even though it comes from a system that's almost certainly running a different workload than the one I'm referring to which was from decades ago.
Did I get this right?
Seriously?
Basically right. Add to that the belief that I have that if OS
utilization were frequently 80%, then no customer would buy such a
system, as they would be losing 80% of it to the OS. And the fact that
I saw lots of customer systems when I was active in the community and
never saw anything like it.
But see below for a possible resolution to this issue.
What happened to the bit where *YOU* were saying about the amount of time spent in an OS was probably workload dependent?
I believe that. But 80% is way above any experience that I have had.
into a release ... whether it's output is consistent with what you believe or not?And since when does the credibility of local code written in Roseville (by people who are likely responsible for the care and feeding of the Exec that the local code is being written for) some how become suspect just because said code didn't make it
Wow! I never doubted the credibility of the Roseville Exec programmers.
They were exceptionally good. But you presented no evidence that such
code ever existed. I posited it as a possibility to explain what you >recalled seeing. I actually doubt such code existed.
FWIW, I stand by the statement about seeing CPU utilization in excess of 80+% on a regular basis because that is what I recall seeing.
Ahhh! Here is the key. In this sentence, you say *CPU utilization*,
not *OS utilization*. The CPU utilization includes everything but idle
time, specifically including Exec plus all user (Batch, Demand, TIP,
RT,) time. BTW, this is readily calculated from the numbers in a
standard SS keyin. I certainly agree that this could, and frequently was
at 80% or higher. If you claim that you frequently saw CPU utilization
at 80%, I will readily believe you, and I suspect that Scott will too.
You can choose to believe me or not.
(And I would like to point out that I don't appreciate being called a liar no matter how politely it is done.)
Again, Wow! I never called you a liar. To be pedantic, a lie is
something that the originator knows is incorrect. I never said you were >lying. At worst, I accused you of having bad recollection, not
intention, which, as I get older, I suffer from more and more. :-(
I cannot provide direct evidence to support my statement.
I don't have any console listings or demand terminal session listings where I entered an "@@cons ss", for example.
However, I can point to an old (~1981) video that clearly suggests that the 20% figure cited by Mr. Lurndal almost certainly doesn't apply to the Exec at least in some environments from way back when.
And I can wave my arms at why it is most certainly possible for a much higher figure to show up, at least theoretically, even today.
So the video I want to draw your attention to is entitled, "19th Annual Sperry Univac Spring Technical Symposium - 'Proposed Memory Management Techniques for Sperry Univac 1100 Series Systems'", and can be found here:
< https://digital.hagley.org/VID_1985261_B110_ID05?solr_nav%5Bid%5D=88d187d912cfce1a5ad1&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=2 >
Interesting video, thank you. BTW, the excessive time spent in memory >allocation searching for the best fit, figuring out what to swap and >minimizing fragmentation were probably motivating factors for going to a >paging system.
But note that the allocation times getting up to 33% (as he said due to >larger memories being available and changing workload patterns) was such
a problem that they convened a task force to fix it, and it seems put in >patches pretty quickly. Assuming their changes were successful, it
should have substantially reduced memory allocation time.
But all of this about utilization and workload changing is not relevant
to the original question of whether having two caches, one of size X >dedicated to Exec (supervisor) and one of size Y, dedicated to user use
is better than a single cache of size X+Y available to both.
Since when in Exec mode, the effective cache size is smaller (does not >include Y), and similarly for user, i.e. not including X, performance
will be worse for both workloads. This is different from a separate I
cache vs. D cache, as all programs use both simultaneously.
Feel free to respond, but ISTM that this thread has wandered so far from
its original topic that I am losing interest, and probably won't respond.
On Wed, 11 Oct 2023 08:58:51 -0700, Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:
I know I'm going to regret responding to all of this...
On 10/8/2023 7:30 PM, Lewis Cole wrote:I have to believe your memory is conflating two different things. Not surprising, given the timespan involved.
So here's the second part of my reply to Mr. Fuld's last response to me. >> Considering how quickly this reply has grown, I may end up breaking it up into a third part as well.
On 8/15/2023 11:48 AM, Lewis Cole wrote:
On Tuesday, August 15, 2023 at 12:06:53AM UTC-7, Stephen Fuld wrote: >>>> <snip>
I'm not sure what sort of argument youAnd even since the beginning of time
(well ... since real live multi-tasking
OS appeared), it has been obvious that
processors tend to spend most of their
time in supervisor mode (OS) code
rather than in user (program) code.
I don't want to get into an argument about caching with you, [...] >>>>
think I'm trying get into WRT caching,
but I assume that we both are familiar
enough with it so that there's really
no argument to be had so your
comment makes no sense to me.
[...] but I am sure that the percentage of time spent in supervisor mode is very
workload dependent.
Agreed.
But to the extent that the results of
the SS keyin were ... useful .. in The
Good Old Days at Roseville, I recall
seeing something in excess of 80+
percent of the time was spent in the
Exec on regular basis.
FWIW, the output from the SS keyin does not tell anyone how much time
was spent in the Exec. It tells the operator what percentage of the
possible Standard Units of Processing were consumed by Batch programs, Demand programs, and TIP transactions. SUPs are *not* accumulated by
the Exec. Note that the amount of possible SUPs in a measuring
interval is not particularly well-defined.
I have a vague memory from when I first worked at Univac facilities in Minnesota of seeing a sign describing how the system in the "Fishbowl"
had been instrumented to display performance numbers in real time (as opposed to Real Time performance numbers). I don't recall ever seeing
the system/display, so it's possible that thr forty-some odd years
misspent in the employment of Uivac and Unisys has left me with a
false memory.
Otherwisem the only way to see how much time is spent in the Exec
involves the use of SIP/OSAM, which you were almost certainly not
using from the operator's console.
On Wednesday, October 11, 2023 at 9:36:53?PM UTC-7, David W Schroth wrote:
On Wed, 11 Oct 2023 08:58:51 -0700, Stephen Fuld
<sf...@alumni.cmu.edu.invalid> wrote:
I know I'm going to regret responding to all of this...
On 10/8/2023 7:30 PM, Lewis Cole wrote:I have to believe your memory is conflating two different things. Not
So here's the second part of my reply to Mr. Fuld's last response to me. >> >> Considering how quickly this reply has grown, I may end up breaking it up into a third part as well.
On 8/15/2023 11:48 AM, Lewis Cole wrote:
On Tuesday, August 15, 2023 at 12:06:53AM UTC-7, Stephen Fuld wrote:
<snip>
And even since the beginning of time
(well ... since real live multi-tasking
OS appeared), it has been obvious that
processors tend to spend most of their
time in supervisor mode (OS) code
rather than in user (program) code.
I don't want to get into an argument about caching with you, [...]
I'm not sure what sort of argument you
think I'm trying get into WRT caching,
but I assume that we both are familiar
enough with it so that there's really
no argument to be had so your
comment makes no sense to me.
[...] but I am sure that the percentage of time spent in supervisor mode is very
workload dependent.
Agreed.
But to the extent that the results of
the SS keyin were ... useful .. in The
Good Old Days at Roseville, I recall
seeing something in excess of 80+
percent of the time was spent in the
Exec on regular basis.
surprising, given the timespan involved.
FWIW, the output from the SS keyin does not tell anyone how much time
was spent in the Exec. It tells the operator what percentage of the
possible Standard Units of Processing were consumed by Batch programs,
Demand programs, and TIP transactions. SUPs are *not* accumulated by
the Exec. Note that the amount of possible SUPs in a measuring
interval is not particularly well-defined.
I have a vague memory from when I first worked at Univac facilities in
Minnesota of seeing a sign describing how the system in the "Fishbowl"
had been instrumented to display performance numbers in real time (as
opposed to Real Time performance numbers). I don't recall ever seeing
the system/display, so it's possible that thr forty-some odd years
misspent in the employment of Uivac and Unisys has left me with a
false memory.
Otherwisem the only way to see how much time is spent in the Exec
involves the use of SIP/OSAM, which you were almost certainly not
using from the operator's console.
If Mr. Schroth says that I'm full of shit WRT being able to determine the amount of time spent in the Exec via an SS keyin, then I accept that I am full of shit.
I am/was wrong.
Thank you for the correction, Mr. Schroth.
On Wed, 11 Oct 2023 08:58:51 -0700, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
I know I'm going to regret responding to all of this...
So the video I want to draw your attention to is entitled, "19th Annual Sperry Univac Spring Technical Symposium - 'Proposed Memory Management Techniques for Sperry Univac 1100 Series Systems'", and can be found here:
< https://digital.hagley.org/VID_1985261_B110_ID05?solr_nav%5Bid%5D=88d187d912cfce1a5ad1&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=2 >
Interesting video, thank you. BTW, the excessive time spent in memory
allocation searching for the best fit, figuring out what to swap and
minimizing fragmentation were probably motivating factors for going to a
paging system.
Probably not so much. While I wasn't there when paging was
architected, I was there to design and implement it. The motivating
factor is almost certainly called out in the following quote - "There
is only one mistake in computer design that is difficult to recover
from - not having enough address bits for memory addressing and memory management."
On 10/11/2023 9:38 PM, David W Schroth wrote:
On Wed, 11 Oct 2023 08:58:51 -0700, Stephen Fuld
<sfuld@alumni.cmu.edu.invalid> wrote:
I know I'm going to regret responding to all of this...
I hope not. I am sure I am not alone in valuing your contributions here.
big snip
So the video I want to draw your attention to is entitled, "19th Annual Sperry Univac Spring Technical Symposium - 'Proposed Memory Management Techniques for Sperry Univac 1100 Series Systems'", and can be found here:
< https://digital.hagley.org/VID_1985261_B110_ID05?solr_nav%5Bid%5D=88d187d912cfce1a5ad1&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=2 >
Interesting video, thank you. BTW, the excessive time spent in memory
allocation searching for the best fit, figuring out what to swap and
minimizing fragmentation were probably motivating factors for going to a >>> paging system.
Probably not so much. While I wasn't there when paging was
architected, I was there to design and implement it. The motivating
factor is almost certainly called out in the following quote - "There
is only one mistake in computer design that is difficult to recover
from - not having enough address bits for memory addressing and memory
management."
While I absolutely agree with the quotation, with all due respect, I
disagree that it was the motivation for implementing paging. A caveat, I
was not involved at all in either the architecture nor the
implementation. My argument is based primarily on logical analysis.
The reason that the ability for a program to address lots of memory
(i.e. more address bits) wasn't a factor in the decision is that Univac >already that problem solved!
I remember a conversation I had with Ron Smith at a Use conference
sometime probably in the late 1970s or early 1980s, when IBM had
implemented virtual memory/paging in the S/370 line. I can't remember
the exact quotation, but it was essentially that paging was sort of like >multibanking, but turned "inside out".
That is, with virtual memory, multiple different, potentially large,
user program addresses get mapped to the same physical memory at
different times, whereas with multibanking, multiple smaller user
program addresses (i.e. bank relative addresses), get mapped at
different times (i.e. when the bank was pointed), to the same physical >memory. In other words, both virtual memory/paging and multibanking
break the identity of program relative addresses with physical memory >addresses.
Since you can have a large number (hundreds or thousands) of banks
defined within a program, by pointing different banks at different
times, you can address a huge amount of memory (far larger than any >contemplated physical memory), and the limitation expressed in that
quotation doesn't apply.
Each solution (paging and multi banking) has advantages and
disadvantages, and one can argue the relative merits of the two
solutions (we can discuss that further if anyone cares), they both solve
the problem, so solving that problem shouldn't/couldn't be the
motivation for Unisys implementing paging in 2200s.
Obviously, I invite comments/questions/arguments, etc.
On Tue, 21 Nov 2023 10:47:02 -0800, Stephen Fuld ><sfuld@alumni.cmu.edu.invalid> wrote:
On 10/11/2023 9:38 PM, David W Schroth wrote:
On Wed, 11 Oct 2023 08:58:51 -0700, Stephen Fuld
<sfuld@alumni.cmu.edu.invalid> wrote:
I know I'm going to regret responding to all of this...
I hope not. I am sure I am not alone in valuing your contributions here.
big snip
So the video I want to draw your attention to is entitled, "19th Annual Sperry Univac Spring Technical Symposium - 'Proposed Memory Management Techniques for Sperry Univac 1100 Series Systems'", and can be found here:
< https://digital.hagley.org/VID_1985261_B110_ID05?solr_nav%5Bid%5D=88d187d912cfce1a5ad1&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=2 >
Interesting video, thank you. BTW, the excessive time spent in memory >>>> allocation searching for the best fit, figuring out what to swap and
minimizing fragmentation were probably motivating factors for going to a >>>> paging system.
Probably not so much. While I wasn't there when paging was
architected, I was there to design and implement it. The motivating
factor is almost certainly called out in the following quote - "There
is only one mistake in computer design that is difficult to recover
from - not having enough address bits for memory addressing and memory
management."
While I absolutely agree with the quotation, with all due respect, I >>disagree that it was the motivation for implementing paging. A caveat, I >>was not involved at all in either the architecture nor the
implementation. My argument is based primarily on logical analysis.
The reason that the ability for a program to address lots of memory
(i.e. more address bits) wasn't a factor in the decision is that Univac >>already that problem solved!
I remember a conversation I had with Ron Smith at a Use conference
sometime probably in the late 1970s or early 1980s, when IBM had >>implemented virtual memory/paging in the S/370 line. I can't remember
the exact quotation, but it was essentially that paging was sort of like >>multibanking, but turned "inside out".
That is, with virtual memory, multiple different, potentially large,
user program addresses get mapped to the same physical memory at
different times, whereas with multibanking, multiple smaller user
program addresses (i.e. bank relative addresses), get mapped at
different times (i.e. when the bank was pointed), to the same physical >>memory. In other words, both virtual memory/paging and multibanking
break the identity of program relative addresses with physical memory >>addresses.
Since you can have a large number (hundreds or thousands) of banks
defined within a program, by pointing different banks at different
times, you can address a huge amount of memory (far larger than any >>contemplated physical memory), and the limitation expressed in that >>quotation doesn't apply.
I believe there are a couple of problems with that view.
The Exec depended very much on absolute addressing when managing
mamory, which limited the systems to 2 ** 24 words of physical memory,
which was Not Enough.
And the amount of virtual space available to the system was limited by
the amount of swapfile space which, if I recall correctly, was limited
to 03400000000 words (less tha half a GiW, although I am too lazy to
figure out the exact amount).
I think both of those problems could have been addressed in a swapping
system by applying some of the paging design (2200 paging supplied one
or more Working Set file(s) for each subsystem), but swapping Large
Banks (2**24 words max) or Very Large Banks (2**30 words max) would
take too long and consume too much I/O bandwidth.
I grant it would be
interesting to follow up on Nick McLaren's idea of using base and
bounds with swapping on systems with a lot of memory, but my
experiences fixing 2200 paging bugs suggests (to me) that the end
result would not be as satisfactory as Nick thought (even though he's >probably much smarter than me).
My views are colored by my suspicion that I am one of the very few
people still working who has worked down in the bowels of memory
management by swapping and memory management by paging.
David W Schroth <davidschroth@harrietmanor.com> writes:
On Tue, 21 Nov 2023 10:47:02 -0800, Stephen Fuld >><sfuld@alumni.cmu.edu.invalid> wrote:
On 10/11/2023 9:38 PM, David W Schroth wrote:
On Wed, 11 Oct 2023 08:58:51 -0700, Stephen Fuld
<sfuld@alumni.cmu.edu.invalid> wrote:
I know I'm going to regret responding to all of this...
I hope not. I am sure I am not alone in valuing your contributions here. >>>
big snip
So the video I want to draw your attention to is entitled, "19th Annual Sperry Univac Spring Technical Symposium - 'Proposed Memory Management Techniques for Sperry Univac 1100 Series Systems'", and can be found here:
< https://digital.hagley.org/VID_1985261_B110_ID05?solr_nav%5Bid%5D=88d187d912cfce1a5ad1&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=2 >
Interesting video, thank you. BTW, the excessive time spent in memory >>>>> allocation searching for the best fit, figuring out what to swap and >>>>> minimizing fragmentation were probably motivating factors for going to a >>>>> paging system.
Probably not so much. While I wasn't there when paging was
architected, I was there to design and implement it. The motivating
factor is almost certainly called out in the following quote - "There
is only one mistake in computer design that is difficult to recover
from - not having enough address bits for memory addressing and memory >>>> management."
While I absolutely agree with the quotation, with all due respect, I >>>disagree that it was the motivation for implementing paging. A caveat, I >>>was not involved at all in either the architecture nor the >>>implementation. My argument is based primarily on logical analysis.
The reason that the ability for a program to address lots of memory
(i.e. more address bits) wasn't a factor in the decision is that Univac >>>already that problem solved!
I remember a conversation I had with Ron Smith at a Use conference >>>sometime probably in the late 1970s or early 1980s, when IBM had >>>implemented virtual memory/paging in the S/370 line. I can't remember >>>the exact quotation, but it was essentially that paging was sort of like >>>multibanking, but turned "inside out".
That is, with virtual memory, multiple different, potentially large,
user program addresses get mapped to the same physical memory at >>>different times, whereas with multibanking, multiple smaller user
program addresses (i.e. bank relative addresses), get mapped at
different times (i.e. when the bank was pointed), to the same physical >>>memory. In other words, both virtual memory/paging and multibanking >>>break the identity of program relative addresses with physical memory >>>addresses.
Since you can have a large number (hundreds or thousands) of banks >>>defined within a program, by pointing different banks at different
times, you can address a huge amount of memory (far larger than any >>>contemplated physical memory), and the limitation expressed in that >>>quotation doesn't apply.
I believe there are a couple of problems with that view.
The Exec depended very much on absolute addressing when managing
mamory, which limited the systems to 2 ** 24 words of physical memory, >>which was Not Enough.
And the amount of virtual space available to the system was limited by
the amount of swapfile space which, if I recall correctly, was limited
to 03400000000 words (less tha half a GiW, although I am too lazy to
figure out the exact amount).
I think both of those problems could have been addressed in a swapping >>system by applying some of the paging design (2200 paging supplied one
or more Working Set file(s) for each subsystem), but swapping Large
Banks (2**24 words max) or Very Large Banks (2**30 words max) would
take too long and consume too much I/O bandwidth.
I grant it would be
interesting to follow up on Nick McLaren's idea of using base and
bounds with swapping on systems with a lot of memory, but my
experiences fixing 2200 paging bugs suggests (to me) that the end
result would not be as satisfactory as Nick thought (even though he's >>probably much smarter than me).
My experiences with MCP/VS on the Burroughs side, which
supported swapping (rollout/rollin) contiguous regions
of 1000 digit "pages" showed that checkerboarding of
memory was inevitable, leading to OS defragmentation
overhead and/or excessive swapping in any kind of
multiprogramming environment.
Swapping to solid state disk ameliorated the performance
overhead somewhat, but at a price.
My views are colored by my suspicion that I am one of the very few
people still working who has worked down in the bowels of memory
management by swapping and memory management by paging.
I'll take paging over any segmentation scheme anyday.
On Tue, 21 Nov 2023 10:47:02 -0800, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
On 10/11/2023 9:38 PM, David W Schroth wrote:
On Wed, 11 Oct 2023 08:58:51 -0700, Stephen Fuld
<sfuld@alumni.cmu.edu.invalid> wrote:
I know I'm going to regret responding to all of this...
I hope not. I am sure I am not alone in valuing your contributions here.
big snip
So the video I want to draw your attention to is entitled, "19th Annual Sperry Univac Spring Technical Symposium - 'Proposed Memory Management Techniques for Sperry Univac 1100 Series Systems'", and can be found here:
< https://digital.hagley.org/VID_1985261_B110_ID05?solr_nav%5Bid%5D=88d187d912cfce1a5ad1&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=2 >
Interesting video, thank you. BTW, the excessive time spent in memory >>>> allocation searching for the best fit, figuring out what to swap and
minimizing fragmentation were probably motivating factors for going to a >>>> paging system.
Probably not so much. While I wasn't there when paging was
architected, I was there to design and implement it. The motivating
factor is almost certainly called out in the following quote - "There
is only one mistake in computer design that is difficult to recover
from - not having enough address bits for memory addressing and memory
management."
While I absolutely agree with the quotation, with all due respect, I
disagree that it was the motivation for implementing paging. A caveat, I
was not involved at all in either the architecture nor the
implementation. My argument is based primarily on logical analysis.
The reason that the ability for a program to address lots of memory
(i.e. more address bits) wasn't a factor in the decision is that Univac
already that problem solved!
I remember a conversation I had with Ron Smith at a Use conference
sometime probably in the late 1970s or early 1980s, when IBM had
implemented virtual memory/paging in the S/370 line. I can't remember
the exact quotation, but it was essentially that paging was sort of like
multibanking, but turned "inside out".
That is, with virtual memory, multiple different, potentially large,
user program addresses get mapped to the same physical memory at
different times, whereas with multibanking, multiple smaller user
program addresses (i.e. bank relative addresses), get mapped at
different times (i.e. when the bank was pointed), to the same physical
memory. In other words, both virtual memory/paging and multibanking
break the identity of program relative addresses with physical memory
addresses.
Since you can have a large number (hundreds or thousands) of banks
defined within a program, by pointing different banks at different
times, you can address a huge amount of memory (far larger than any
contemplated physical memory), and the limitation expressed in that
quotation doesn't apply.
I believe there are a couple of problems with that view.
The Exec depended very much on absolute addressing when managing
mamory, which limited the systems to 2 ** 24 words of physical memory,
which was Not Enough.
And the amount of virtual space available to the system was limited by
the amount of swapfile space which, if I recall correctly, was limited
to 03400000000 words (less tha half a GiW, although I am too lazy to
figure out the exact amount).
I think both of those problems could have been addressed in a swapping
system by applying some of the paging design (2200 paging supplied one
or more Working Set file(s) for each subsystem),
Banks (2**24 words max) or Very Large Banks (2**30 words max) would
take too long and consume too much I/O bandwidth.
I grant it would be
interesting to follow up on Nick McLaren's idea of using base and
bounds with swapping on systems with a lot of memory, but my
experiences fixing 2200 paging bugs suggests (to me) that the end
result would not be as satisfactory as Nick thought (even though he's probably much smarter than me).
Each solution (paging and multi banking) has advantages and
disadvantages, and one can argue the relative merits of the two
solutions (we can discuss that further if anyone cares), they both solve
the problem, so solving that problem shouldn't/couldn't be the
motivation for Unisys implementing paging in 2200s.
Obviously, I invite comments/questions/arguments, etc.
My views are colored by my suspicion that I am one of the very few
people still working who has worked down in the bowels of memory
management by swapping and memory management by paging.
On Thu, 23 Nov 2023 20:25:03 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:
My experiences with MCP/VS on the Burroughs side, which
supported swapping (rollout/rollin) contiguous regions
of 1000 digit "pages" showed that checkerboarding of
memory was inevitable, leading to OS defragmentation
overhead and/or excessive swapping in any kind of
multiprogramming environment.
Swapping to solid state disk ameliorated the performance
overhead somewhat, but at a price.
My views are colored by my suspicion that I am one of the very few
people still working who has worked down in the bowels of memory
management by swapping and memory management by paging.
I'll take paging over any segmentation scheme anyday.
And my experience with OS22000 n=memory managent leaves preferring
both paging and segmentation, each doing what they do best.
Paging for mapping virtual to physical and getting chunks of virtual
space into physical memory and out to backing store.
And segmentation for process/thread isolation and access control.
Wich is how they have been used in OS2200 su=ince the early '90s...
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 07:40:54 |
Calls: | 6,706 |
Files: | 12,236 |
Messages: | 5,350,636 |