Wondering tonight how to get by without virtual memory. The page table
walker has some sort of bug in it and I do not feel like debugging it :) Virtual memory support also uses up at least 6k LUTs. It is the virtual addressing part that may be made a config option. Memory would still be protected on a page basis with a set of keys.
So, it is not possible to
access the memory page without the correct key. IMO think there is not
that much hardware or instruction set required to support a system
without VM. It is mainly software approaches. I added absolute address subroutine calls with a large address field to Q+, because code may be calling other code anywhere in memory if it is not mapped. It is a bit
easier to relocate absolute addresses.
I wonder what would be required to dynamically re-locate programs at run time. And have programs such that the pages making up the program may be relocated.
Robert Finch <robfi680@gmail.com> schrieb:
Wondering tonight how to get by without virtual memory. The page table
walker has some sort of bug in it and I do not feel like debugging it :)
Virtual memory support also uses up at least 6k LUTs. It is the virtual
addressing part that may be made a config option. Memory would still be
protected on a page basis with a set of keys.
So, for a large number of pages, how do you access the keys? What else
is there, apart from page walking?
So, it is not possible to
access the memory page without the correct key. IMO think there is not
that much hardware or instruction set required to support a system
without VM. It is mainly software approaches. I added absolute address
subroutine calls with a large address field to Q+, because code may be
calling other code anywhere in memory if it is not mapped. It is a bit
easier to relocate absolute addresses.
I wonder what would be required to dynamically re-locate programs at run
time. And have programs such that the pages making up the program may be
relocated.
Define one or two base registers, and make all addresses relative
to that base register. If Wikipedia is not mistaken, the Univac
1108 had this feature (and the S/360 didn't).
I wonder what would be required to dynamically re-locate programs at run >>> time. And have programs such that the pages making up the program may be >>> relocated.
Define one or two base registers, and make all addresses relative
to that base register. If Wikipedia is not mistaken, the Univac
1108 had this feature (and the S/360 didn't).
The Burroughs B3500 addressing was relative to a base register.
Wondering tonight how to get by without virtual memory. The page table
walker has some sort of bug in it and I do not feel like debugging it :)
Virtual memory support also uses up at least 6k LUTs. It is the virtual addressing part that may be made a config option. Memory would still be protected on a page basis with a set of keys. So, it is not possible to access the memory page without the correct key. IMO think there is not
that much hardware or instruction set required to support a system
without VM. It is mainly software approaches. I added absolute address subroutine calls with a large address field to Q+, because code may be calling other code anywhere in memory if it is not mapped. It is a bit
easier to relocate absolute addresses.
I wonder what would be required to dynamically re-locate programs at run time. And have programs such that the pages making up the program may be relocated.
VM gives the ability to relocate bad memory pages, and make the memory
look contiguous. What that gives is linear execution of instructions
across page boundaries.
To get some sort of emulation of the capability to relocate pages, a
jump instruction to the next page of memory could be placed at the end
of a memory page. Having a jump at the end of a page is not going to
impact performance significantly. A table of addresses that need to be relocated for a given page could be placed at the end of the page of memory.
On 2/22/2024 1:39 PM, MitchAlsup1 wrote:
Robert Finch wrote:
Wondering tonight how to get by without virtual memory. The page table
walker has some sort of bug in it and I do not feel like debugging it :)
MIPS used a SW TLB reloader from a hashed <and big> table. You could make
access to the hash table HW and if the access works--great, if not trap
to SW. This is the simplest HW TLB reloader. SW determines what is in
the table
HW determines if what is in the table gets in the TLB. So, this is not a
table walker, but a single hash table probe.
Yes, I had considered this as a possible ISA feature I had called "VIPT".
On 2/22/2024 1:39 PM, MitchAlsup1 wrote:
Robert Finch wrote:
Wondering tonight how to get by without virtual memory. The page table
walker has some sort of bug in it and I do not feel like debugging it :)
MIPS used a SW TLB reloader from a hashed <and big> table. You could make
access to the hash table HW and if the access works--great, if not trap
to SW. This is the simplest HW TLB reloader. SW determines what is in
the table
HW determines if what is in the table gets in the TLB. So, this is not a
table walker, but a single hash table probe.
Yes, I had considered this as a possible ISA feature I had called "VIPT".
Though, thus far, VIPT has not been implemented.
It is either this, or a HW page walker, a case could be made for the tradeoffs either way.
As can be noted, my existing 256x4-way TLB has a reasonably good average
case hit-rate.
Something like VIPT would mainly have merit if it could have higher associativity, but associativity is more expensive in this case than
capacity (I had mostly assumed 8-way, if VIPT were implemented, likely
used in combination with the 4-way in the main TLB).
Higher associativity in VIPT could be possible if one assumes a
state-machine and linear probing (makes sense, as the bus supports
128-bit fetches, which are 1 TLBE in the 48-bit addressing mode; an
8-way VIPT requiring 8 bus fetches / probes).
Had considered hashing based on ASID, but this interferes with the
ability to have global pages. Splitting up global pages into groups,
with only ASIDs within a given group able to see each others' pages,
does help here (then the group can be used to hash the entry into the TLB).
One other considered idea was to split the TLB into separate ways for
Global and Non-Global pages, but this has a similar cost issue to
increasing associativity (because it is), say:
4-way, non-global pages (index is hashed);
2-way, global pages (index is strictly modulo).
Though, another option being to simply not actually have global pages.
Or, also possible (if VIPT were used):
TLB remains 4-way, but purely ASID-local internally;
6-way associativity for local pages in VIPT (hashed based on ASID);
2-way associativity for global pages in VIPT (strict modulo indexing).
Where, I had noted before that 4-way is seemingly near the minimum
needed to get good results from the TLB (or entirely avoid deadlock
scenarios in the case of software-managed TLB).
In general it works well when the size of the table is 1MB. The hash fctn
and data layout are up to you.
Probably depends on entry size and associativity.
Virtual memory support also uses up at least 6k LUTs. It is the
virtual addressing part that may be made a config option. Memory would
still be protected on a page basis with a set of keys. So, it is not
possible to access the memory page without the correct key. IMO think
there is not that much hardware or instruction set required to support
a system without VM. It is mainly software approaches. I added
absolute address subroutine calls with a large address field to Q+,
because code may be calling other code anywhere in memory if it is not
mapped. It is a bit easier to relocate absolute addresses.
I wonder what would be required to dynamically re-locate programs at
run time. And have programs such that the pages making up the program
may be relocated.
VM gives the ability to relocate bad memory pages, and make the memory
look contiguous. What that gives is linear execution of instructions
across page boundaries.
To get some sort of emulation of the capability to relocate pages, a
jump instruction to the next page of memory could be placed at the end
of a memory page. Having a jump at the end of a page is not going to
impact performance significantly. A table of addresses that need to be
relocated for a given page could be placed at the end of the page of
memory.
On 2024-02-22 2:55 p.m., BGB wrote:
There may be a need to relocate programs due to the OS defragmenting
memory. If the OS might be able to create a larger free contiguous space
if it can shift a program to one side or the other of space that it is
in the middle of. Even better if it is allowed to rearrange code into >discontiguous blocks. Getting a program relocated is probably easier
than ensuring that there is a large enough data space. Most programs are >small compared to the size of memory, so maybe relocating in
discontiguous blocks is not necessary. There are hundreds or thousands
of pieces of code running in memory. I can see the code moving around >periodically to open up free space for data. A background defrag task
could be running, which would burn up CPU cycles, but it is cycles that
would otherwise be burned up by VM. A trick would be to keep the program
code runnable while it is being relocated. Lots of trampolines or memory >indirect calls.
Having spent the large part of a decade working on operating system
on a multiple processor system with no paging:
DON'T GO THERE!
In article <FP2CN.411932$PuZ9.183438@fx11.iad>, scott@slp53.sl.home
(Scott Lurndal) wrote:
Having spent the large part of a decade working on operating system
on a multiple processor system with no paging:
DON'T GO THERE!
I have not done that, but I have done programming on two related classes
of operating systems:
Virtual memory systems without swapping or demand paging. You are limited
to physical memory, but the ability of the OS to re-map pages to avoid >fragmentation makes life reasonably simple. But only reasonably: you
really get to exercise your out-of-memory handlers.
Non-virtual memory systems without swapping or demand paging. For small >systems with only one application program running, this isn't too bad.
But as soon as the OS needs to allocate memory in response to calls to it >(say, if there's a GUI that needs bitblt buffers) it becomes a nightmare.
Not because of code being shuffled, that can be done transparently, but >because of *data* having to be moved around.
On 2/22/2024 4:41 PM, MitchAlsup1 wrote:
BGB wrote:
Where, say, one can interpret the ASID as:
abcd-efxx-xxxx-xxxx
And, then compose an index from (Addr(21:14)^{f,e,d,c,b,a,f,e}).
Where, in this case, no two consecutive addresses may land in the same
place in the TLB.
But, if ability to dynamically move things seems needed, a segmented addressing like approach would make more sense.
Or, one possibility even could have a sort of "segmented virtual flat addressing" scheme, say:
(47:36): Index into a program segment table;
(35: 0): Address within segment.
Where, one could potentially have an x86 style segmented addressing
scheme faking the appearance of a linear address space, as opposed to explicit segment registers.
Virtual memory systems without swapping or demand paging. You are limited
to physical memory, but the ability of the OS to re-map pages to avoid fragmentation makes life reasonably simple. But only reasonably: you
really get to exercise your out-of-memory handlers.
On 2/26/2024 3:31 AM, Robert Finch wrote:
On 2024-02-23 1:20 a.m., Robert Finch wrote:
On 2024-02-22 2:55 p.m., BGB wrote:Fixed a couple more bugs in the PTW and now I think it works. So its
back to virtual memory support.
Possible.
Might make sense to consider a page-walker at some point, but at the
moment TLB Miss handling is a relatively low percentage of the CPU time.
Granted, my TLB is, from what I can gather, fairly large...
Roughly on par with modern ARM64 cores;
Significantly larger than most of the 90s era RISC's.
Where it seemed in the 90s, 32/64/128 entry fully associative designs
were popular (as opposed to 256x 4-way, so 1024 TLBE's in the TLB).
As noted elsewhere, seem to have gotten a minor speedup by adding a
small set-associative cache between the L1 and L2 caches, whose main job
is mostly to try to absorb conflict misses (pros/cons; breaks a cheap
but potentially questionable mechanism for inter-processor memory
coherence).
Did require a bit of debugging, mostly in the area of what to cache, and extra checks to exclude things from being cached. Say, for example, if
one line in a way gets flushed, all of the other ways need to be flushed
as well (along with excluding some areas outside the main RAM area, *, etc).
*: There were some garbage writes into ROM areas, it seems though that
ROM must be seen as ROM. With the direct-mapped caches, these garbage
writes were being discarded, but with a set-associative intermediate
cache, a garbage write into a ROM area could still see the written value
on a later access.
Had experimented with 4-way and 8-way:
Disabled: L1 Bridge module disappears entirely;
4-way: Exists, roughly 600 LUTs;
8-way: Exists, roughly 2400 LUTs
(also "timing slack" gets visibly worse).
The 4-way vs 8-way difference makes a relatively smaller impact though
on hit-rare, so 8-way is likely not worth it here. Both cases seem to
give benefit more by causing the L2 cache to miss less often, rather
than in getting data back to the L1 faster (it seems like many of the conflict-misses coming out of the direct-mapped L1 cache were also
leading to conflict misses in the direct-mapped L2 cache).
....
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 307 |
Nodes: | 16 (2 / 14) |
Uptime: | 126:33:11 |
Calls: | 6,854 |
Files: | 12,360 |
Messages: | 5,417,481 |