• Re: Bypassing VM

    From Thomas Koenig@21:1/5 to Robert Finch on Thu Feb 22 17:14:01 2024
    Robert Finch <robfi680@gmail.com> schrieb:
    Wondering tonight how to get by without virtual memory. The page table
    walker has some sort of bug in it and I do not feel like debugging it :) Virtual memory support also uses up at least 6k LUTs. It is the virtual addressing part that may be made a config option. Memory would still be protected on a page basis with a set of keys.

    So, for a large number of pages, how do you access the keys? What else
    is there, apart from page walking?

    So, it is not possible to
    access the memory page without the correct key. IMO think there is not
    that much hardware or instruction set required to support a system
    without VM. It is mainly software approaches. I added absolute address subroutine calls with a large address field to Q+, because code may be calling other code anywhere in memory if it is not mapped. It is a bit
    easier to relocate absolute addresses.

    I wonder what would be required to dynamically re-locate programs at run time. And have programs such that the pages making up the program may be relocated.

    Define one or two base registers, and make all addresses relative
    to that base register. If Wikipedia is not mistaken, the Univac
    1108 had this feature (and the S/360 didn't).

    One problem is large BSS sections, which can be mapped to a single
    zero page with copy-on-write semantics on VM systems.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Thu Feb 22 17:31:17 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Robert Finch <robfi680@gmail.com> schrieb:
    Wondering tonight how to get by without virtual memory. The page table
    walker has some sort of bug in it and I do not feel like debugging it :)
    Virtual memory support also uses up at least 6k LUTs. It is the virtual
    addressing part that may be made a config option. Memory would still be
    protected on a page basis with a set of keys.

    So, for a large number of pages, how do you access the keys? What else
    is there, apart from page walking?

    So, it is not possible to
    access the memory page without the correct key. IMO think there is not
    that much hardware or instruction set required to support a system
    without VM. It is mainly software approaches. I added absolute address
    subroutine calls with a large address field to Q+, because code may be
    calling other code anywhere in memory if it is not mapped. It is a bit
    easier to relocate absolute addresses.

    I wonder what would be required to dynamically re-locate programs at run
    time. And have programs such that the pages making up the program may be
    relocated.

    Define one or two base registers, and make all addresses relative
    to that base register. If Wikipedia is not mistaken, the Univac
    1108 had this feature (and the S/360 didn't).

    The Burroughs B3500 addressing was relative to a base register.

    When the base register was zero, that was considered "control"
    state, when the base register was non-zero, that was "normal"
    state. Applications ran in normal state, and in normal state
    privileged instructions would trap to the MCP (which operated
    in control state).

    Later generations added 7 more base-limit register pairs to support
    access to eight active segments at any one time by an application
    (a non-local procedure call instruction could load a different set
    of base-limit registers as part of the call (other than base 0
    which was common to the application and contained the stack). The
    high order digit of the address selected which base-limit pair
    to use.

    The problem with all segmentation schemes is memory fragmentation
    and the corresponding overhead to move stuff around during
    allocation).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Feb 22 18:26:37 2024
    According to Scott Lurndal <slp53@pacbell.net>:
    I wonder what would be required to dynamically re-locate programs at run >>> time. And have programs such that the pages making up the program may be >>> relocated.

    Define one or two base registers, and make all addresses relative
    to that base register. If Wikipedia is not mistaken, the Univac
    1108 had this feature (and the S/360 didn't).

    The Burroughs B3500 addressing was relative to a base register.

    The PDP-6 had base and limit registers that relocated all addresses
    when a program was running in user mode. The KA-10 added a second pair
    so the high and low halves of the address space could be relocated
    separately and the high half read-only, allowing multiple copies of a
    running program to share the same code segment. The KI-10 and later
    models had paging.

    The base/limit stuff worked, but the operating system wasted a fair
    amount of time shuffling memory to make free space contiguous, and a
    segment could only be entirely swapped in or out, not partially
    resident as with paging.

    S/360 had no relocation at all other than a hack called prefixing
    which relocated the lowest addresses in multiprocessors so they could
    each handle their own interrupts. Reading between the lines in the
    S/360 architecture paper, it appears they thought that with lots of
    registers and a base register in every address, they could do it all
    by adjusting the registers. That was of course wrong since programs
    store addresses in memory and few programs were written with enough
    discipline for that to work. (The only one I know was APL\360.) Almost immediately the 360/67 added its own paging used in TSS and CP/67, and
    five years later S/370 added paging as a standard feature.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Thu Feb 22 19:39:50 2024
    Robert Finch wrote:

    Wondering tonight how to get by without virtual memory. The page table
    walker has some sort of bug in it and I do not feel like debugging it :)

    MIPS used a SW TLB reloader from a hashed <and big> table. You could make access to the hash table HW and if the access works--great, if not trap to
    SW. This is the simplest HW TLB reloader. SW determines what is in the table
    HW determines if what is in the table gets in the TLB. So, this is not a
    table walker, but a single hash table probe.

    In general it works well when the size of the table is 1MB. The hash fctn
    and data layout are up to you.

    Virtual memory support also uses up at least 6k LUTs. It is the virtual addressing part that may be made a config option. Memory would still be protected on a page basis with a set of keys. So, it is not possible to access the memory page without the correct key. IMO think there is not
    that much hardware or instruction set required to support a system
    without VM. It is mainly software approaches. I added absolute address subroutine calls with a large address field to Q+, because code may be calling other code anywhere in memory if it is not mapped. It is a bit
    easier to relocate absolute addresses.

    I wonder what would be required to dynamically re-locate programs at run time. And have programs such that the pages making up the program may be relocated.

    VM gives the ability to relocate bad memory pages, and make the memory
    look contiguous. What that gives is linear execution of instructions
    across page boundaries.
    To get some sort of emulation of the capability to relocate pages, a
    jump instruction to the next page of memory could be placed at the end
    of a memory page. Having a jump at the end of a page is not going to
    impact performance significantly. A table of addresses that need to be relocated for a given page could be placed at the end of the page of memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Thu Feb 22 22:11:50 2024
    BGB <cr88192@gmail.com> writes:
    On 2/22/2024 1:39 PM, MitchAlsup1 wrote:
    Robert Finch wrote:

    Wondering tonight how to get by without virtual memory. The page table
    walker has some sort of bug in it and I do not feel like debugging it :)

    MIPS used a SW TLB reloader from a hashed <and big> table. You could make
    access to the hash table HW and if the access works--great, if not trap
    to SW. This is the simplest HW TLB reloader. SW determines what is in
    the table
    HW determines if what is in the table gets in the TLB. So, this is not a
    table walker, but a single hash table probe.


    Yes, I had considered this as a possible ISA feature I had called "VIPT".

    VIPT usually refers to cache organization (virtual index, physical tag).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Feb 22 22:41:31 2024
    BGB wrote:

    On 2/22/2024 1:39 PM, MitchAlsup1 wrote:
    Robert Finch wrote:

    Wondering tonight how to get by without virtual memory. The page table
    walker has some sort of bug in it and I do not feel like debugging it :)

    MIPS used a SW TLB reloader from a hashed <and big> table. You could make
    access to the hash table HW and if the access works--great, if not trap
    to SW. This is the simplest HW TLB reloader. SW determines what is in
    the table
    HW determines if what is in the table gets in the TLB. So, this is not a
    table walker, but a single hash table probe.


    Yes, I had considered this as a possible ISA feature I had called "VIPT".

    Though, thus far, VIPT has not been implemented.
    It is either this, or a HW page walker, a case could be made for the tradeoffs either way.


    As can be noted, my existing 256x4-way TLB has a reasonably good average
    case hit-rate.

    A properly good hashing function is as good as sets of associativity
    most of the time and often better on average.

    Something like VIPT would mainly have merit if it could have higher associativity, but associativity is more expensive in this case than
    capacity (I had mostly assumed 8-way, if VIPT were implemented, likely
    used in combination with the 4-way in the main TLB).

    Higher associativity in VIPT could be possible if one assumes a
    state-machine and linear probing (makes sense, as the bus supports
    128-bit fetches, which are 1 TLBE in the 48-bit addressing mode; an
    8-way VIPT requiring 8 bus fetches / probes).

    Once the HW is making more than 1 access, you might as well walk the
    page access structure.

    Had considered hashing based on ASID, but this interferes with the
    ability to have global pages. Splitting up global pages into groups,
    with only ASIDs within a given group able to see each others' pages,
    does help here (then the group can be used to hash the entry into the TLB).

    ASID is only part of a good hashing function.

    One other considered idea was to split the TLB into separate ways for
    Global and Non-Global pages, but this has a similar cost issue to
    increasing associativity (because it is), say:
    4-way, non-global pages (index is hashed);
    2-way, global pages (index is strictly modulo).

    Though, another option being to simply not actually have global pages.


    Or, also possible (if VIPT were used):
    TLB remains 4-way, but purely ASID-local internally;
    6-way associativity for local pages in VIPT (hashed based on ASID);
    2-way associativity for global pages in VIPT (strict modulo indexing).


    Where, I had noted before that 4-way is seemingly near the minimum
    needed to get good results from the TLB (or entirely avoid deadlock
    scenarios in the case of software-managed TLB).


    In general it works well when the size of the table is 1MB. The hash fctn
    and data layout are up to you.


    Probably depends on entry size and associativity.


    Virtual memory support also uses up at least 6k LUTs. It is the
    virtual addressing part that may be made a config option. Memory would
    still be protected on a page basis with a set of keys. So, it is not
    possible to access the memory page without the correct key. IMO think
    there is not that much hardware or instruction set required to support
    a system without VM. It is mainly software approaches. I added
    absolute address subroutine calls with a large address field to Q+,
    because code may be calling other code anywhere in memory if it is not
    mapped. It is a bit easier to relocate absolute addresses.

    I wonder what would be required to dynamically re-locate programs at
    run time. And have programs such that the pages making up the program
    may be relocated.

    VM gives the ability to relocate bad memory pages, and make the memory
    look contiguous. What that gives is linear execution of instructions
    across page boundaries.
    To get some sort of emulation of the capability to relocate pages, a
    jump instruction to the next page of memory could be placed at the end
    of a memory page. Having a jump at the end of a page is not going to
    impact performance significantly. A table of addresses that need to be
    relocated for a given page could be placed at the end of the page of
    memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Fri Feb 23 15:27:33 2024
    Robert Finch <robfi680@gmail.com> writes:
    On 2024-02-22 2:55 p.m., BGB wrote:

    There may be a need to relocate programs due to the OS defragmenting
    memory. If the OS might be able to create a larger free contiguous space
    if it can shift a program to one side or the other of space that it is
    in the middle of. Even better if it is allowed to rearrange code into >discontiguous blocks. Getting a program relocated is probably easier
    than ensuring that there is a large enough data space. Most programs are >small compared to the size of memory, so maybe relocating in
    discontiguous blocks is not necessary. There are hundreds or thousands
    of pieces of code running in memory. I can see the code moving around >periodically to open up free space for data. A background defrag task
    could be running, which would burn up CPU cycles, but it is cycles that
    would otherwise be burned up by VM. A trick would be to keep the program
    code runnable while it is being relocated. Lots of trampolines or memory >indirect calls.

    Having spent the large part of a decade working on operating system
    on a multiple processor system with no paging:
    DON'T GO THERE!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Scott Lurndal on Fri Feb 23 18:35:00 2024
    In article <FP2CN.411932$PuZ9.183438@fx11.iad>, scott@slp53.sl.home
    (Scott Lurndal) wrote:

    Having spent the large part of a decade working on operating system
    on a multiple processor system with no paging:
    DON'T GO THERE!

    I have not done that, but I have done programming on two related classes
    of operating systems:

    Virtual memory systems without swapping or demand paging. You are limited
    to physical memory, but the ability of the OS to re-map pages to avoid fragmentation makes life reasonably simple. But only reasonably: you
    really get to exercise your out-of-memory handlers.

    Non-virtual memory systems without swapping or demand paging. For small
    systems with only one application program running, this isn't too bad.
    But as soon as the OS needs to allocate memory in response to calls to it
    (say, if there's a GUI that needs bitblt buffers) it becomes a nightmare.
    Not because of code being shuffled, that can be done transparently, but
    because of *data* having to be moved around.

    That immediately means you can't store pointers to data in memory, so
    linked lists become very complicated and even slower. The way memory
    worked on the OSes were I did this (Classic MacOS and 16-bit Windows)
    involved a two level allocation scheme: you allocate memory for data, and
    get back a "handle". If you want to access that memory, you have to "pin"
    the handle, which makes the memory block unmovable until you un-pin it.
    Stacks have to be pinned at all times, of course, and saving pointers to functions becomes ... complicated.

    It's normally impractical to port code written for conventional virtual
    memory operating systems to handle-based systems. It's pretty hard to
    move code in the opposite direction, too. Nobody who has experienced both
    forms wants anything to do with handle-based systems.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Fri Feb 23 18:54:05 2024
    jgd@cix.co.uk (John Dallman) writes:
    In article <FP2CN.411932$PuZ9.183438@fx11.iad>, scott@slp53.sl.home
    (Scott Lurndal) wrote:

    Having spent the large part of a decade working on operating system
    on a multiple processor system with no paging:
    DON'T GO THERE!

    I have not done that, but I have done programming on two related classes
    of operating systems:

    Virtual memory systems without swapping or demand paging. You are limited
    to physical memory, but the ability of the OS to re-map pages to avoid >fragmentation makes life reasonably simple. But only reasonably: you
    really get to exercise your out-of-memory handlers.

    Non-virtual memory systems without swapping or demand paging. For small >systems with only one application program running, this isn't too bad.
    But as soon as the OS needs to allocate memory in response to calls to it >(say, if there's a GUI that needs bitblt buffers) it becomes a nightmare.
    Not because of code being shuffled, that can be done transparently, but >because of *data* having to be moved around.

    The mainframe system mentioned above had base-limit registers
    (with eight active at any one time) describing the application memory
    regions. Region zero was the primary data region (containing
    processor data (index registers, stack pointer, etc) in the first
    50 bytes) and user data (plus the stack) in the remaining space
    in the segment (max seg size 500KB (1,000,000 4-bit digits).
    Region one contained code.
    Regions 2 through 7 were optional and contained application
    data (up to 500KB each).

    A collection of eight regions was called an Environment, of
    which one could active at any time. A Virtual Enter (far call)
    instruction would switch to a new environment, loading a new
    set of regions into the base limit registers (region zero
    would remain the same as it contained the stack [which grew
    towards higher addresses] and would automatically be extended
    by the OS if the stack reached the current region limit, up to 500KB).

    A program could have up to 10,000 environments.

    Three instructions were available to move data between
    the active regions and inactive regions (move string,
    compare string, hash string).

    Addresses were context-relative (unindexed data accesses would use
    region zero and code accesses (e.g. branchs) would use region 1)
    or relative to one of the 8 base registers (selected
    by the high order digit in an index register, of which there
    were the three at the base of region zero, and four more stored
    in internal registers in the processor - these "Mobile" index
    registers were added (along with extensions to the
    address operand in the instrution stream) to the architecture 15 years after it was originally developed while maintaining compatability with old binaries).

    Making space in fragmented memory would, in those days, roll the entire
    process out to drum/disk (and later SSD based on RAM chips). When
    the ability to support more than 1 region per program was added,
    the OS was updated to roll out individual regions). The MCP
    would also move regions for inactive processes to make space for
    new allocations. The roll-out/roll-in and move code was originally written
    in assembler and was called 'hiho'. As in, hi-ho, hi-ho it's off
    to work we go.... The assembler labels were the names of the dwarves,
    snow, white, prince, etc.

    Should have cross-posted to folklore :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Feb 23 19:37:14 2024
    BGB wrote:

    On 2/22/2024 4:41 PM, MitchAlsup1 wrote:
    BGB wrote:


    Where, say, one can interpret the ASID as:
    abcd-efxx-xxxx-xxxx

    And, then compose an index from (Addr(21:14)^{f,e,d,c,b,a,f,e}).

    A good HW hash often takes a field of bits and reverses the order--brain
    dead easy in Verilog::

    hash = addr[63..48] ^ addr[32..47] ...

    Where, in this case, no two consecutive addresses may land in the same
    place in the TLB.

    Indeed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to BGB on Fri Feb 23 14:44:52 2024
    BGB wrote:

    But, if ability to dynamically move things seems needed, a segmented addressing like approach would make more sense.

    Or, one possibility even could have a sort of "segmented virtual flat addressing" scheme, say:
    (47:36): Index into a program segment table;
    (35: 0): Address within segment.

    This is like the PDP-11 or 68000 MMU method: use some upper address
    bits as an index into a hardware table to select a mapping register.
    By setting base relocation register values OS can move whole logical
    segments around in a physical space and apply privilege checking.

    PDP-11 used 3 msb address bits plus the S/U privilege mode as an index
    into a 16 entry HW table to map into a 18-bit physical space.

    68000 had the 68451 which was similar but had 32 segments. https://en.wikipedia.org/wiki/Motorola_68451

    I've never used either of them but imagine most of their complexity
    would be due to the very small number of mapping registers forcing
    software to do things like segment overlays, segment swapping, etc.
    If hardware can handle say up to 4096 mapping registers
    then these complexities would be unlikely to occur.

    I believe both used logical relocation - substitution of high address bits.
    I would use *arithmetic* relocation whereby the selected register adds
    a base physical address to the selected segment offset.
    That allows much easier OS memory management as you don't need to be
    concerned with alignment boundaries, just segment size.

    Where, one could potentially have an x86 style segmented addressing
    scheme faking the appearance of a linear address space, as opposed to explicit segment registers.

    Neither of the above were like the x86 segmentation as the their
    segmentation was created and managed completely inside the OS
    (and thus none of the 16-bit x86 segmentation cruft).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to John Dallman on Fri Feb 23 09:47:29 2024
    jgd@cix.co.uk (John Dallman) writes:
    Virtual memory systems without swapping or demand paging. You are limited
    to physical memory, but the ability of the OS to re-map pages to avoid fragmentation makes life reasonably simple. But only reasonably: you
    really get to exercise your out-of-memory handlers.

    In the 60s, Boeing Huntsville modified OS/360 MVT13 (real memory) to do
    just that. They had gotten a 360/67 multiprocessor for tss/360 with lots
    of 2250 graphics displays ... but tss/360 never came to production
    ... so they configured it as two 360/65s and ran os/360. Because of MVT
    storage management problems (exacerbated by long running 2250 graphics
    cad/cam progams), they modified MVT13 to run virtual memory using 360/67 hardware ... but w/o paging.

    a little more than decade ago, I was asked to tract down decision to add virtual memory to all IBM 370s. Turns out MVT storage management was so
    bad, that program execution "regions" had to be specified four times
    larger than used ... a typical one mbyte 370/165 would only run four concurrently executing regions, insufficient to keep machine busy and justified. Running MVT in a 16mbyte virtual memory would allow
    increasing concurrently executing regions by four times with little or
    no paging (similar to running MVT in CP67 16mbyte virtual machine).

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Feb 26 23:20:56 2024
    BGB wrote:

    On 2/26/2024 3:31 AM, Robert Finch wrote:
    On 2024-02-23 1:20 a.m., Robert Finch wrote:
    On 2024-02-22 2:55 p.m., BGB wrote:


    Fixed a couple more bugs in the PTW and now I think it works. So its
    back to virtual memory support.


    Possible.

    Might make sense to consider a page-walker at some point, but at the
    moment TLB Miss handling is a relatively low percentage of the CPU time.

    Granted, my TLB is, from what I can gather, fairly large...
    Roughly on par with modern ARM64 cores;
    Significantly larger than most of the 90s era RISC's.

    Where it seemed in the 90s, 32/64/128 entry fully associative designs
    were popular (as opposed to 256x 4-way, so 1024 TLBE's in the TLB).

    The general nomenclature is that a 64KB cache with 4-way set associativity
    has 64KB of total storage. You are using 256× 4-way as having 1024 entries
    of storage.

    And yes, that is a significant amount of PTEs.

    The fully associative were in the range 32-64 elements and generally
    configured to produce a result that could be used in the same cycle.
    Once these got bigger (i.e., slower) it was/is natural to switch from
    CAM to SRAM losing associativity while gaining capacity.

    As noted elsewhere, seem to have gotten a minor speedup by adding a
    small set-associative cache between the L1 and L2 caches, whose main job
    is mostly to try to absorb conflict misses (pros/cons; breaks a cheap
    but potentially questionable mechanism for inter-processor memory
    coherence).

    Did require a bit of debugging, mostly in the area of what to cache, and extra checks to exclude things from being cached. Say, for example, if
    one line in a way gets flushed, all of the other ways need to be flushed
    as well (along with excluding some areas outside the main RAM area, *, etc).

    *: There were some garbage writes into ROM areas, it seems though that
    ROM must be seen as ROM. With the direct-mapped caches, these garbage
    writes were being discarded, but with a set-associative intermediate
    cache, a garbage write into a ROM area could still see the written value
    on a later access.

    LoL.

    Had experimented with 4-way and 8-way:
    Disabled: L1 Bridge module disappears entirely;
    4-way: Exists, roughly 600 LUTs;
    8-way: Exists, roughly 2400 LUTs

    Why is this 2× more than simply 2× larger ?

    (also "timing slack" gets visibly worse).

    The 4-way vs 8-way difference makes a relatively smaller impact though
    on hit-rare, so 8-way is likely not worth it here. Both cases seem to
    give benefit more by causing the L2 cache to miss less often, rather
    than in getting data back to the L1 faster (it seems like many of the conflict-misses coming out of the direct-mapped L1 cache were also
    leading to conflict misses in the direct-mapped L2 cache).

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)