Forum: >>> Magnum BBS <<<

Re: Bypassing VM

From Thomas Koenig@21:1/5 to Robert Finch on Thu Feb 22 17:14:01 2024

Robert Finch <robfi680@gmail.com> schrieb:

Wondering tonight how to get by without virtual memory. The page table
walker has some sort of bug in it and I do not feel like debugging it :) Virtual memory support also uses up at least 6k LUTs. It is the virtual addressing part that may be made a config option. Memory would still be protected on a page basis with a set of keys.

So, for a large number of pages, how do you access the keys? What else
is there, apart from page walking?

So, it is not possible to
access the memory page without the correct key. IMO think there is not
that much hardware or instruction set required to support a system
without VM. It is mainly software approaches. I added absolute address subroutine calls with a large address field to Q+, because code may be calling other code anywhere in memory if it is not mapped. It is a bit
easier to relocate absolute addresses.

I wonder what would be required to dynamically re-locate programs at run time. And have programs such that the pages making up the program may be relocated.

Define one or two base registers, and make all addresses relative
to that base register. If Wikipedia is not mistaken, the Univac
1108 had this feature (and the S/360 didn't).

One problem is large BSS sections, which can be mapped to a single
zero page with copy-on-write semantics on VM systems.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Thu Feb 22 17:31:17 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

Robert Finch <robfi680@gmail.com> schrieb:

Wondering tonight how to get by without virtual memory. The page table
walker has some sort of bug in it and I do not feel like debugging it :)
Virtual memory support also uses up at least 6k LUTs. It is the virtual
addressing part that may be made a config option. Memory would still be
protected on a page basis with a set of keys.

So, for a large number of pages, how do you access the keys? What else
is there, apart from page walking?

So, it is not possible to
access the memory page without the correct key. IMO think there is not
that much hardware or instruction set required to support a system
without VM. It is mainly software approaches. I added absolute address
subroutine calls with a large address field to Q+, because code may be
calling other code anywhere in memory if it is not mapped. It is a bit
easier to relocate absolute addresses.

I wonder what would be required to dynamically re-locate programs at run
time. And have programs such that the pages making up the program may be
relocated.

Define one or two base registers, and make all addresses relative
to that base register. If Wikipedia is not mistaken, the Univac
1108 had this feature (and the S/360 didn't).

The Burroughs B3500 addressing was relative to a base register.

When the base register was zero, that was considered "control"
state, when the base register was non-zero, that was "normal"
state. Applications ran in normal state, and in normal state
privileged instructions would trap to the MCP (which operated
in control state).

Later generations added 7 more base-limit register pairs to support
access to eight active segments at any one time by an application
(a non-local procedure call instruction could load a different set
of base-limit registers as part of the call (other than base 0
which was common to the application and contained the stack). The
high order digit of the address selected which base-limit pair
to use.

The problem with all segmentation schemes is memory fragmentation
and the corresponding overhead to move stuff around during
allocation).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Feb 22 18:26:37 2024

According to Scott Lurndal <slp53@pacbell.net>:

I wonder what would be required to dynamically re-locate programs at run >>> time. And have programs such that the pages making up the program may be >>> relocated.

Define one or two base registers, and make all addresses relative
to that base register. If Wikipedia is not mistaken, the Univac
1108 had this feature (and the S/360 didn't).

The Burroughs B3500 addressing was relative to a base register.

The PDP-6 had base and limit registers that relocated all addresses
when a program was running in user mode. The KA-10 added a second pair
so the high and low halves of the address space could be relocated
separately and the high half read-only, allowing multiple copies of a
running program to share the same code segment. The KI-10 and later
models had paging.

The base/limit stuff worked, but the operating system wasted a fair
amount of time shuffling memory to make free space contiguous, and a
segment could only be entirely swapped in or out, not partially
resident as with paging.

S/360 had no relocation at all other than a hack called prefixing
which relocated the lowest addresses in multiprocessors so they could
each handle their own interrupts. Reading between the lines in the
S/360 architecture paper, it appears they thought that with lots of
registers and a base register in every address, they could do it all
by adjusting the registers. That was of course wrong since programs
store addresses in memory and few programs were written with enough
discipline for that to work. (The only one I know was APL\360.) Almost immediately the 360/67 added its own paging used in TSS and CP/67, and
five years later S/370 added paging as a standard feature.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Thu Feb 22 19:39:50 2024

Robert Finch wrote:

Wondering tonight how to get by without virtual memory. The page table
walker has some sort of bug in it and I do not feel like debugging it :)

MIPS used a SW TLB reloader from a hashed <and big> table. You could make access to the hash table HW and if the access works--great, if not trap to
SW. This is the simplest HW TLB reloader. SW determines what is in the table
HW determines if what is in the table gets in the TLB. So, this is not a
table walker, but a single hash table probe.

In general it works well when the size of the table is 1MB. The hash fctn
and data layout are up to you.

Virtual memory support also uses up at least 6k LUTs. It is the virtual addressing part that may be made a config option. Memory would still be protected on a page basis with a set of keys. So, it is not possible to access the memory page without the correct key. IMO think there is not
that much hardware or instruction set required to support a system
without VM. It is mainly software approaches. I added absolute address subroutine calls with a large address field to Q+, because code may be calling other code anywhere in memory if it is not mapped. It is a bit
easier to relocate absolute addresses.

I wonder what would be required to dynamically re-locate programs at run time. And have programs such that the pages making up the program may be relocated.

VM gives the ability to relocate bad memory pages, and make the memory
look contiguous. What that gives is linear execution of instructions
across page boundaries.
To get some sort of emulation of the capability to relocate pages, a
jump instruction to the next page of memory could be placed at the end
of a memory page. Having a jump at the end of a page is not going to
impact performance significantly. A table of addresses that need to be relocated for a given page could be placed at the end of the page of memory.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Thu Feb 22 22:11:50 2024

BGB <cr88192@gmail.com> writes:

On 2/22/2024 1:39 PM, MitchAlsup1 wrote:

Robert Finch wrote:

Wondering tonight how to get by without virtual memory. The page table
walker has some sort of bug in it and I do not feel like debugging it :)

MIPS used a SW TLB reloader from a hashed <and big> table. You could make
access to the hash table HW and if the access works--great, if not trap
to SW. This is the simplest HW TLB reloader. SW determines what is in
the table
HW determines if what is in the table gets in the TLB. So, this is not a
table walker, but a single hash table probe.

Yes, I had considered this as a possible ISA feature I had called "VIPT".

VIPT usually refers to cache organization (virtual index, physical tag).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Feb 22 22:41:31 2024

BGB wrote:

On 2/22/2024 1:39 PM, MitchAlsup1 wrote:

Robert Finch wrote:

Wondering tonight how to get by without virtual memory. The page table
walker has some sort of bug in it and I do not feel like debugging it :)

MIPS used a SW TLB reloader from a hashed <and big> table. You could make
access to the hash table HW and if the access works--great, if not trap
to SW. This is the simplest HW TLB reloader. SW determines what is in
the table
HW determines if what is in the table gets in the TLB. So, this is not a
table walker, but a single hash table probe.

Yes, I had considered this as a possible ISA feature I had called "VIPT".

Though, thus far, VIPT has not been implemented.
It is either this, or a HW page walker, a case could be made for the tradeoffs either way.

As can be noted, my existing 256x4-way TLB has a reasonably good average
case hit-rate.

A properly good hashing function is as good as sets of associativity
most of the time and often better on average.

Something like VIPT would mainly have merit if it could have higher associativity, but associativity is more expensive in this case than
capacity (I had mostly assumed 8-way, if VIPT were implemented, likely
used in combination with the 4-way in the main TLB).

Higher associativity in VIPT could be possible if one assumes a
state-machine and linear probing (makes sense, as the bus supports
128-bit fetches, which are 1 TLBE in the 48-bit addressing mode; an
8-way VIPT requiring 8 bus fetches / probes).

Once the HW is making more than 1 access, you might as well walk the
page access structure.

Had considered hashing based on ASID, but this interferes with the
ability to have global pages. Splitting up global pages into groups,
with only ASIDs within a given group able to see each others' pages,
does help here (then the group can be used to hash the entry into the TLB).

ASID is only part of a good hashing function.

One other considered idea was to split the TLB into separate ways for
Global and Non-Global pages, but this has a similar cost issue to
increasing associativity (because it is), say:
4-way, non-global pages (index is hashed);
2-way, global pages (index is strictly modulo).

Though, another option being to simply not actually have global pages.

Or, also possible (if VIPT were used):
TLB remains 4-way, but purely ASID-local internally;
6-way associativity for local pages in VIPT (hashed based on ASID);
2-way associativity for global pages in VIPT (strict modulo indexing).

Where, I had noted before that 4-way is seemingly near the minimum
needed to get good results from the TLB (or entirely avoid deadlock
scenarios in the case of software-managed TLB).

In general it works well when the size of the table is 1MB. The hash fctn
and data layout are up to you.

Probably depends on entry size and associativity.

Virtual memory support also uses up at least 6k LUTs. It is the
virtual addressing part that may be made a config option. Memory would
still be protected on a page basis with a set of keys. So, it is not
possible to access the memory page without the correct key. IMO think
there is not that much hardware or instruction set required to support
a system without VM. It is mainly software approaches. I added
absolute address subroutine calls with a large address field to Q+,
because code may be calling other code anywhere in memory if it is not
mapped. It is a bit easier to relocate absolute addresses.

I wonder what would be required to dynamically re-locate programs at
run time. And have programs such that the pages making up the program
may be relocated.

VM gives the ability to relocate bad memory pages, and make the memory
look contiguous. What that gives is linear execution of instructions
across page boundaries.
To get some sort of emulation of the capability to relocate pages, a
jump instruction to the next page of memory could be placed at the end
of a memory page. Having a jump at the end of a page is not going to
impact performance significantly. A table of addresses that need to be
relocated for a given page could be placed at the end of the page of
memory.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Robert Finch on Fri Feb 23 15:27:33 2024

Robert Finch <robfi680@gmail.com> writes:

On 2024-02-22 2:55 p.m., BGB wrote:

There may be a need to relocate programs due to the OS defragmenting
memory. If the OS might be able to create a larger free contiguous space
if it can shift a program to one side or the other of space that it is
in the middle of. Even better if it is allowed to rearrange code into >discontiguous blocks. Getting a program relocated is probably easier
than ensuring that there is a large enough data space. Most programs are >small compared to the size of memory, so maybe relocating in
discontiguous blocks is not necessary. There are hundreds or thousands
of pieces of code running in memory. I can see the code moving around >periodically to open up free space for data. A background defrag task
could be running, which would burn up CPU cycles, but it is cycles that
would otherwise be burned up by VM. A trick would be to keep the program
code runnable while it is being relocated. Lots of trampolines or memory >indirect calls.

Having spent the large part of a decade working on operating system
on a multiple processor system with no paging:
DON'T GO THERE!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Scott Lurndal on Fri Feb 23 18:35:00 2024

In article <FP2CN.411932$PuZ9.183438@fx11.iad>, scott@slp53.sl.home
(Scott Lurndal) wrote:

Having spent the large part of a decade working on operating system
on a multiple processor system with no paging:
DON'T GO THERE!

I have not done that, but I have done programming on two related classes
of operating systems:

Virtual memory systems without swapping or demand paging. You are limited
to physical memory, but the ability of the OS to re-map pages to avoid fragmentation makes life reasonably simple. But only reasonably: you
really get to exercise your out-of-memory handlers.

Non-virtual memory systems without swapping or demand paging. For small
systems with only one application program running, this isn't too bad.
But as soon as the OS needs to allocate memory in response to calls to it
(say, if there's a GUI that needs bitblt buffers) it becomes a nightmare.
Not because of code being shuffled, that can be done transparently, but
because of *data* having to be moved around.

That immediately means you can't store pointers to data in memory, so
linked lists become very complicated and even slower. The way memory
worked on the OSes were I did this (Classic MacOS and 16-bit Windows)
involved a two level allocation scheme: you allocate memory for data, and
get back a "handle". If you want to access that memory, you have to "pin"
the handle, which makes the memory block unmovable until you un-pin it.
Stacks have to be pinned at all times, of course, and saving pointers to functions becomes ... complicated.

It's normally impractical to port code written for conventional virtual
memory operating systems to handle-based systems. It's pretty hard to
move code in the opposite direction, too. Nobody who has experienced both
forms wants anything to do with handle-based systems.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Fri Feb 23 18:54:05 2024

jgd@cix.co.uk (John Dallman) writes:

In article <FP2CN.411932$PuZ9.183438@fx11.iad>, scott@slp53.sl.home
(Scott Lurndal) wrote:

Having spent the large part of a decade working on operating system
on a multiple processor system with no paging:
DON'T GO THERE!

I have not done that, but I have done programming on two related classes
of operating systems:

Virtual memory systems without swapping or demand paging. You are limited
to physical memory, but the ability of the OS to re-map pages to avoid >fragmentation makes life reasonably simple. But only reasonably: you
really get to exercise your out-of-memory handlers.

Non-virtual memory systems without swapping or demand paging. For small >systems with only one application program running, this isn't too bad.
But as soon as the OS needs to allocate memory in response to calls to it >(say, if there's a GUI that needs bitblt buffers) it becomes a nightmare.
Not because of code being shuffled, that can be done transparently, but >because of *data* having to be moved around.

The mainframe system mentioned above had base-limit registers
(with eight active at any one time) describing the application memory
regions. Region zero was the primary data region (containing
processor data (index registers, stack pointer, etc) in the first
50 bytes) and user data (plus the stack) in the remaining space
in the segment (max seg size 500KB (1,000,000 4-bit digits).
Region one contained code.
Regions 2 through 7 were optional and contained application
data (up to 500KB each).

A collection of eight regions was called an Environment, of
which one could active at any time. A Virtual Enter (far call)
instruction would switch to a new environment, loading a new
set of regions into the base limit registers (region zero
would remain the same as it contained the stack [which grew
towards higher addresses] and would automatically be extended
by the OS if the stack reached the current region limit, up to 500KB).

A program could have up to 10,000 environments.

Three instructions were available to move data between
the active regions and inactive regions (move string,
compare string, hash string).

Addresses were context-relative (unindexed data accesses would use
region zero and code accesses (e.g. branchs) would use region 1)
or relative to one of the 8 base registers (selected
by the high order digit in an index register, of which there
were the three at the base of region zero, and four more stored
in internal registers in the processor - these "Mobile" index
registers were added (along with extensions to the
address operand in the instrution stream) to the architecture 15 years after it was originally developed while maintaining compatability with old binaries).

Making space in fragmented memory would, in those days, roll the entire
process out to drum/disk (and later SSD based on RAM chips). When
the ability to support more than 1 region per program was added,
the OS was updated to roll out individual regions). The MCP
would also move regions for inactive processes to make space for
new allocations. The roll-out/roll-in and move code was originally written
in assembler and was called 'hiho'. As in, hi-ho, hi-ho it's off
to work we go.... The assembler labels were the names of the dwarves,
snow, white, prince, etc.

Should have cross-posted to folklore :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Feb 23 19:37:14 2024

BGB wrote:

On 2/22/2024 4:41 PM, MitchAlsup1 wrote:

BGB wrote:

Where, say, one can interpret the ASID as:
abcd-efxx-xxxx-xxxx

And, then compose an index from (Addr(21:14)^{f,e,d,c,b,a,f,e}).

A good HW hash often takes a field of bits and reverses the order--brain
dead easy in Verilog::

hash = addr[63..48] ^ addr[32..47] ...

Where, in this case, no two consecutive addresses may land in the same
place in the TLB.

Indeed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to BGB on Fri Feb 23 14:44:52 2024

BGB wrote:

But, if ability to dynamically move things seems needed, a segmented addressing like approach would make more sense.

Or, one possibility even could have a sort of "segmented virtual flat addressing" scheme, say:
(47:36): Index into a program segment table;
(35: 0): Address within segment.

This is like the PDP-11 or 68000 MMU method: use some upper address
bits as an index into a hardware table to select a mapping register.
By setting base relocation register values OS can move whole logical
segments around in a physical space and apply privilege checking.

PDP-11 used 3 msb address bits plus the S/U privilege mode as an index
into a 16 entry HW table to map into a 18-bit physical space.

68000 had the 68451 which was similar but had 32 segments. https://en.wikipedia.org/wiki/Motorola_68451

I've never used either of them but imagine most of their complexity
would be due to the very small number of mapping registers forcing
software to do things like segment overlays, segment swapping, etc.
If hardware can handle say up to 4096 mapping registers
then these complexities would be unlikely to occur.

I believe both used logical relocation - substitution of high address bits.
I would use *arithmetic* relocation whereby the selected register adds
a base physical address to the selected segment offset.
That allows much easier OS memory management as you don't need to be
concerned with alignment boundaries, just segment size.

Where, one could potentially have an x86 style segmented addressing
scheme faking the appearance of a linear address space, as opposed to explicit segment registers.

Neither of the above were like the x86 segmentation as the their
segmentation was created and managed completely inside the OS
(and thus none of the 16-bit x86 segmentation cruft).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to John Dallman on Fri Feb 23 09:47:29 2024

jgd@cix.co.uk (John Dallman) writes:

Virtual memory systems without swapping or demand paging. You are limited
to physical memory, but the ability of the OS to re-map pages to avoid fragmentation makes life reasonably simple. But only reasonably: you
really get to exercise your out-of-memory handlers.

In the 60s, Boeing Huntsville modified OS/360 MVT13 (real memory) to do
just that. They had gotten a 360/67 multiprocessor for tss/360 with lots
of 2250 graphics displays ... but tss/360 never came to production
... so they configured it as two 360/65s and ran os/360. Because of MVT
storage management problems (exacerbated by long running 2250 graphics
cad/cam progams), they modified MVT13 to run virtual memory using 360/67 hardware ... but w/o paging.

a little more than decade ago, I was asked to tract down decision to add virtual memory to all IBM 370s. Turns out MVT storage management was so
bad, that program execution "regions" had to be specified four times
larger than used ... a typical one mbyte 370/165 would only run four concurrently executing regions, insufficient to keep machine busy and justified. Running MVT in a 16mbyte virtual memory would allow
increasing concurrently executing regions by four times with little or
no paging (similar to running MVT in CP67 16mbyte virtual machine).

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon Feb 26 23:20:56 2024

BGB wrote:

On 2/26/2024 3:31 AM, Robert Finch wrote:

On 2024-02-23 1:20 a.m., Robert Finch wrote:

On 2024-02-22 2:55 p.m., BGB wrote:

Fixed a couple more bugs in the PTW and now I think it works. So its
back to virtual memory support.

Possible.

Might make sense to consider a page-walker at some point, but at the
moment TLB Miss handling is a relatively low percentage of the CPU time.

Granted, my TLB is, from what I can gather, fairly large...
Roughly on par with modern ARM64 cores;
Significantly larger than most of the 90s era RISC's.

Where it seemed in the 90s, 32/64/128 entry fully associative designs
were popular (as opposed to 256x 4-way, so 1024 TLBE's in the TLB).

The general nomenclature is that a 64KB cache with 4-way set associativity
has 64KB of total storage. You are using 256× 4-way as having 1024 entries
of storage.

And yes, that is a significant amount of PTEs.

The fully associative were in the range 32-64 elements and generally
configured to produce a result that could be used in the same cycle.
Once these got bigger (i.e., slower) it was/is natural to switch from
CAM to SRAM losing associativity while gaining capacity.

As noted elsewhere, seem to have gotten a minor speedup by adding a
small set-associative cache between the L1 and L2 caches, whose main job
is mostly to try to absorb conflict misses (pros/cons; breaks a cheap
but potentially questionable mechanism for inter-processor memory
coherence).

Did require a bit of debugging, mostly in the area of what to cache, and extra checks to exclude things from being cached. Say, for example, if
one line in a way gets flushed, all of the other ways need to be flushed
as well (along with excluding some areas outside the main RAM area, *, etc).

*: There were some garbage writes into ROM areas, it seems though that
ROM must be seen as ROM. With the direct-mapped caches, these garbage
writes were being discarded, but with a set-associative intermediate
cache, a garbage write into a ROM area could still see the written value
on a later access.

LoL.

Had experimented with 4-way and 8-way:
Disabled: L1 Bridge module disappears entirely;
4-way: Exists, roughly 600 LUTs;
8-way: Exists, roughly 2400 LUTs

Why is this 2× more than simply 2× larger ?

(also "timing slack" gets visibly worse).

The 4-way vs 8-way difference makes a relatively smaller impact though
on hit-rare, so 8-way is likely not worth it here. Both cases seem to
give benefit more by causing the L2 cache to miss less often, rather
than in getting data back to the L1 faster (it seems like many of the conflict-misses coming out of the direct-mapped L1 cache were also
leading to conflict misses in the direct-mapped L2 cache).

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	307
Nodes:	16 (2 / 14)
Uptime:	126:33:11
Calls:	6,854
Files:	12,360
Messages:	5,417,481

Re: Bypassing VM

Who's Online

System Info