• About the full paper on the Swarm chip.. (2/2)

    From Wisdom90@21:1/5 to Bonita Montero on comp.programming. on Tue Feb 11 19:46:24 2020
    [continued from previous message]

    each has its own place. First (and most obviously), one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user level with in-kernel locks held;
    the locks used to implement the system itself are entirely behind the
    system call interface that constitutes the interface to the system. More generally, this model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the
    caller with locks held, the subsystem will remain composable.

    Second (and perhaps counterintuitively), one can achieve concurrency and composability by having no locks whatsoever. In this case, there must be
    no global subsystem state—subsystem state must be captured in
    per-instance state, and it must be up to consumers of the subsystem to
    assure that they do not access their instance in parallel. By leaving
    locking up to the client of the subsystem, the subsystem itself can be
    used concurrently by different subsystems and in different contexts. A
    concrete example of this is the AVL tree implementation used extensively
    in the Solaris kernel. As with any balanced binary tree, the
    implementation is sufficiently complex to merit componentization, but by
    not having any global state, the implementation may be used concurrently
    by disjoint subsystems—the only constraint is that manipulation of a
    single AVL tree instance must be serialized.

    Read more here:

    https://queue.acm.org/detail.cfm?id=1454462

    And about Message Passing Process Communication Model and Shared Memory
    Process Communication Model:

    An advantage of shared memory model is that memory communication is
    faster as compared to the message passing model on the same machine.

    Read the following to notice it:

    Why did Windows NT move away from the microkernel?

    "The main reason that Windows NT became a hybrid kernel is speed. A microkernel-based system puts only the bare minimum system components in
    the kernel and runs the rest of them as user mode processes, known as
    servers. A form of inter-process communication (IPC), usually message
    passing, is used for communication between servers and the kernel.

    Microkernel-based systems are more stable than others; if a server
    crashes, it can be restarted without affecting the entire system, which couldn't be done if every system component was part of the kernel.
    However, because of the overhead incurred by IPC and context-switching, microkernels are slower than traditional kernels. Due to the performance
    costs of a microkernel, Microsoft decided to keep the structure of a microkernel, but run the system components in kernel space. Starting in
    Windows Vista, some drivers are also run in user mode."


    More about message passing..

    An advantage of shared memory model is that memory communication is
    faster as compared to the message passing model on the same machine.

    Read the following to notice it:

    "One problem that plagues microkernel implementations is relatively poor performance. The message-passing layer that connects
    different operating system components introduces an extra layer of
    machine instructions. The machine instruction overhead introduced
    by the message-passing subsystem manifests itself as additional
    execution time. In a monolithic system, if a kernel component needs
    to talk to another component, it can make direct function calls
    instead of going through a third party."

    However, shared memory model may create problems such as synchronization
    and memory protection that need to be addressed.

    Message passing’s major flaw is the inversion of control–it is a moral equivalent of gotos in un-structured programming (it’s about time
    somebody said that message passing is considered harmful).

    Also some research shows that the total effort to write an MPI
    application is significantly higher than that required to write a
    shared-memory version of it.

    And more about my scalable reference counting with efficient support for
    weak references:

    My invention that is my scalable reference counting with efficient
    support for weak references version 1.37 is here..

    Here i am again, i have just updated my scalable reference counting with efficient support for weak references to version 1.37, I have just added
    a TAMInterfacedPersistent that is a scalable reference counted version,
    and now i think i have just made it complete and powerful.

    Because I have just read the following web page:

    https://www.codeproject.com/Articles/1252175/Fixing-Delphis-Interface-Limitations

    But i don't agree with the writting of the guy of the above web page,
    because i think you have to understand the "spirit" of Delphi, here is why:

    A component is supposed to be owned and destroyed by something else, "typically" a form (and "typically" means in english: in "most" cases,
    and this is the most important thing to understand). In that scenario, reference count is not used.

    If you pass a component as an interface reference, it would be very
    unfortunate if it was destroyed when the method returns.

    Therefore, reference counting in TComponent has been removed.

    Also because i have just added TAMInterfacedPersistent to my invention.

    To use scalable reference counting with Delphi and FreePascal, just
    replace TInterfacedObject with my TAMInterfacedObject that is the
    scalable reference counted version, and just replace
    TInterfacedPersistent with my TAMInterfacedPersistent that is the
    scalable reference counted version, and you will find both my TAMInterfacedObject and my TAMInterfacedPersistent
    inside the AMInterfacedObject.pas file, and to know how to use weak
    references please take a look at the demo that i have included called example.dpr and look inside my zip file at the tutorial about weak
    references, and to know how to use delegation take a look at the demo
    that i have included called test_delegation.pas, and take a look inside
    my zip file at the tutorial about delegation that learns you how to use delegation.

    I think my Scalable reference counting with efficient support for
    weak references is stable and fast, and it works on both Windows and
    Linux, and my scalable reference counting scales on multicore and NUMA
    systems, and you will not find it in C++ or Rust, and i don't think you
    will find it anywhere, and you have to know that this invention of mine
    solves the problem of dangling pointers and it solves the problem of
    memory leaks and my scalable reference counting is "scalable".

    And please read the readme file inside the zip file that i have just
    extended to make you understand more.

    You can download my new scalable reference counting with efficient
    support for weak references version 1.37 from:

    https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references

    And now i will talk about data dependency and parallel loops..

    For a loop to be parallelized, every iteration must be independent of
    the others, one way to be sure of it is to execute the loop
    in the direction of the incremented index of the loop and in the
    direction of the decremented index of the loop and verify if the results
    are the same. A data dependency happens if memory is modified: a loop
    has a data dependency if an iteration writes a variable that is read or
    write in another iteration of the loop. There is no data dependency if
    only one iteration reads or writes a variable or if many iterations read
    the same variable without modifying it. So this is the "general" "rules".

    Now there remains to know that you have for example to know how to
    construct the parallel for loop if there is an induction variable or if
    there is a reduction operation, i will give an example of them:

    If we have the following (the code looks like Algol or modern Object
    Pascal):


    IND:=0

    For I:=1 to N
    Do
    Begin
    IND := IND + 1;
    A[I]:=B[IND];
    End;


    So as you are noticing since IND is an induction variable , so
    to parallelize the loop you have to do the following:


    For I:=1 to N
    Do
    Begin
    IND:=(I*(I+1))/2;
    A[I]:=B[IND];
    End;


    Now for the reduction operation example, you will notice that my
    invention that is my Threadpool with priorities that scales very well (
    read about it below) supports a Parallel For that scales very well that supports "grainsize", and you will notice that the grainsize can be used
    in the ParallelFor() with a reduction operation and you will notice that
    my following powerful scalable Adder is also used in this scenario, here
    it is:

    https://sites.google.com/site/scalable68/scalable-adder-for-delphi-and-freepascal


    So here is the example with a reduction operation in modern Object Pascal:


    TOTAL:=0.0
    For I := 1 to N
    Do
    Begin
    TOTAL:=TOTAL+A[I]
    End;


    So with my powerful scalable Adder and with my powerful invention that
    is my ParallelFor() that scales very well, you will parallelize the
    above like this:

    procedure test1(j:integer;ptr:pointer);
    begin

    t.add(A[J]); // "t" is my scalable Adder object

    end;

    // Let's suppose that N is 100000
    // In the following, 10000 is the grainsize

    obj.ParallelFor(1,N,test1,10000,pointer(0));

    TOTAL:=T.get();


    And read the following to understand how to use grainsize of my Parallel
    for that scales well:


    About my ParallelFor() that scales very well that uses my efficient
    Threadpool that scales very well:

    With ParallelFor() you have to:

    1- Ensure Sufficient Work

    Each iteration of a loop involves a certain amount of work,
    so you have to ensure a sufficient amount of the work,
    read below about "grainsize" that i have implemented.

    2- In OpenMP we have that:

    Static and Dynamic Scheduling

    One basic characteristic of a loop schedule is whether it is static or
    dynamic:

    • In a static schedule, the choice of which thread performs a particular iteration is purely a function of the iteration number and number of
    threads. Each thread performs only the iterations assigned to it at the beginning of the loop.

    • In a dynamic schedule, the assignment of iterations to threads can
    vary at runtime from one execution to another. Not all iterations are
    assigned to threads at the start of the loop. Instead, each thread
    requests more iterations after it has completed the work already
    assigned to it.

    But with my ParallelFor() that scales very well, since it is using my
    efficient Threadpool that scales very well, so it is using Round-robin scheduling and it uses also work stealing, so i think that this is
    sufficient.

    Read the rest:

    My Threadpool engine with priorities that scales very well is really
    powerful because it scales very well on multicore and NUMA systems, also
    it comes with a ParallelFor() that scales very well on multicores and
    NUMA systems.

    You can download it from:

    https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well


    Here is the explanation of my ParallelFor() that scales very well:

    I have also implemented a ParallelFor() that scales very well, here is
    the method:

    procedure ParallelFor(nMin, nMax:integer;aProc: TParallelProc;GrainSize:integer=1;Ptr:pointer=nil;pmode:TParallelMode=pmBlocking;Priority:TPriorities=NORMAL_PRIORITY);

    nMin and nMax parameters of the ParallelFor() are the minimum and
    maximum integer values of the variable of the ParallelFor() loop, aProc parameter of ParallelFor() is the procedure to call, and GrainSize
    integer parameter of ParallelFor() is the following:

    The grainsize sets a minimum threshold for parallelization.

    A rule of thumb is that grainsize iterations should take at least
    100,000 clock cycles to execute.

    For example, if a single iteration takes 100 clocks, then the grainsize
    needs to be at least 1000 iterations. When in doubt, do the following experiment:

    1- Set the grainsize parameter higher than necessary. The grainsize is specified in units of loop iterations.

    If you have no idea of how many clock cycles an iteration might take,
    start with grainsize=100,000.

    The rationale is that each iteration normally requires at least one
    clock per iteration. In most cases, step 3 will guide you to a much
    smaller value.

    2- Run your algorithm.

    3- Iteratively halve the grainsize parameter and see how much the
    algorithm slows down or speeds up as the value decreases.

    A drawback of setting a grainsize too high is that it can reduce
    parallelism. For example, if the grainsize is 1000 and the loop has 2000 iterations, the ParallelFor() method distributes the loop across only
    two processors, even if more are available.

    And you can pass a parameter in Ptr as pointer to ParallelFor(), and you
    can set pmode parameter of to pmBlocking so that ParallelFor() is
    blocking or to pmNonBlocking so that ParallelFor() is non-blocking, and
    the Priority parameter is the priority of ParallelFor(). Look inside the test.pas example to see how to use it.


    Thank you,
    Amine Moulay Ramdane.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Wisdom90@21:1/5 to Bonita Montero on comp.programming. on Tue Feb 11 19:43:00 2020
    [continued from previous message]

    each has its own place. First (and most obviously), one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user level with in-kernel locks held;
    the locks used to implement the system itself are entirely behind the
    system call interface that constitutes the interface to the system. More generally, this model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the
    caller with locks held, the subsystem will remain composable.

    Second (and perhaps counterintuitively), one can achieve concurrency and composability by having no locks whatsoever. In this case, there must be
    no global subsystem state—subsystem state must be captured in
    per-instance state, and it must be up to consumers of the subsystem to
    assure that they do not access their instance in parallel. By leaving
    locking up to the client of the subsystem, the subsystem itself can be
    used concurrently by different subsystems and in different contexts. A
    concrete example of this is the AVL tree implementation used extensively
    in the Solaris kernel. As with any balanced binary tree, the
    implementation is sufficiently complex to merit componentization, but by
    not having any global state, the implementation may be used concurrently
    by disjoint subsystems—the only constraint is that manipulation of a
    single AVL tree instance must be serialized.

    Read more here:

    https://queue.acm.org/detail.cfm?id=1454462

    And about Message Passing Process Communication Model and Shared Memory
    Process Communication Model:

    An advantage of shared memory model is that memory communication is
    faster as compared to the message passing model on the same machine.

    Read the following to notice it:

    Why did Windows NT move away from the microkernel?

    "The main reason that Windows NT became a hybrid kernel is speed. A microkernel-based system puts only the bare minimum system components in
    the kernel and runs the rest of them as user mode processes, known as
    servers. A form of inter-process communication (IPC), usually message
    passing, is used for communication between servers and the kernel.

    Microkernel-based systems are more stable than others; if a server
    crashes, it can be restarted without affecting the entire system, which couldn't be done if every system component was part of the kernel.
    However, because of the overhead incurred by IPC and context-switching, microkernels are slower than traditional kernels. Due to the performance
    costs of a microkernel, Microsoft decided to keep the structure of a microkernel, but run the system components in kernel space. Starting in
    Windows Vista, some drivers are also run in user mode."


    More about message passing..

    An advantage of shared memory model is that memory communication is
    faster as compared to the message passing model on the same machine.

    Read the following to notice it:

    "One problem that plagues microkernel implementations is relatively poor performance. The message-passing layer that connects
    different operating system components introduces an extra layer of
    machine instructions. The machine instruction overhead introduced
    by the message-passing subsystem manifests itself as additional
    execution time. In a monolithic system, if a kernel component needs
    to talk to another component, it can make direct function calls
    instead of going through a third party."

    However, shared memory model may create problems such as synchronization
    and memory protection that need to be addressed.

    Message passing’s major flaw is the inversion of control–it is a moral equivalent of gotos in un-structured programming (it’s about time
    somebody said that message passing is considered harmful).

    Also some research shows that the total effort to write an MPI
    application is significantly higher than that required to write a
    shared-memory version of it.

    And more about my scalable reference counting with efficient support for
    weak references:

    My invention that is my scalable reference counting with efficient
    support for weak references version 1.37 is here..

    Here i am again, i have just updated my scalable reference counting with efficient support for weak references to version 1.37, I have just added
    a TAMInterfacedPersistent that is a scalable reference counted version,
    and now i think i have just made it complete and powerful.

    Because I have just read the following web page:

    https://www.codeproject.com/Articles/1252175/Fixing-Delphis-Interface-Limitations

    But i don't agree with the writting of the guy of the above web page,
    because i think you have to understand the "spirit" of Delphi, here is why:

    A component is supposed to be owned and destroyed by something else, "typically" a form (and "typically" means in english: in "most" cases,
    and this is the most important thing to understand). In that scenario, reference count is not used.

    If you pass a component as an interface reference, it would be very
    unfortunate if it was destroyed when the method returns.

    Therefore, reference counting in TComponent has been removed.

    Also because i have just added TAMInterfacedPersistent to my invention.

    To use scalable reference counting with Delphi and FreePascal, just
    replace TInterfacedObject with my TAMInterfacedObject that is the
    scalable reference counted version, and just replace
    TInterfacedPersistent with my TAMInterfacedPersistent that is the
    scalable reference counted version, and you will find both my TAMInterfacedObject and my TAMInterfacedPersistent
    inside the AMInterfacedObject.pas file, and to know how to use weak
    references please take a look at the demo that i have included called example.dpr and look inside my zip file at the tutorial about weak
    references, and to know how to use delegation take a look at the demo
    that i have included called test_delegation.pas, and take a look inside
    my zip file at the tutorial about delegation that learns you how to use delegation.

    I think my Scalable reference counting with efficient support for
    weak references is stable and fast, and it works on both Windows and
    Linux, and my scalable reference counting scales on multicore and NUMA
    systems, and you will not find it in C++ or Rust, and i don't think you
    will find it anywhere, and you have to know that this invention of mine
    solves the problem of dangling pointers and it solves the problem of
    memory leaks and my scalable reference counting is "scalable".

    And please read the readme file inside the zip file that i have just
    extended to make you understand more.

    You can download my new scalable reference counting with efficient
    support for weak references version 1.37 from:

    https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references

    And now i will talk about data dependency and parallel loops..

    For a loop to be parallelized, every iteration must be independent of
    the others, one way to be sure of it is to execute the loop
    in the direction of the incremented index of the loop and in the
    direction of the decremented index of the loop and verify if the results
    are the same. A data dependency happens if memory is modified: a loop
    has a data dependency if an iteration writes a variable that is read or
    write in another iteration of the loop. There is no data dependency if
    only one iteration reads or writes a variable or if many iterations read
    the same variable without modifying it. So this is the "general" "rules".

    Now there remains to know that you have for example to know how to
    construct the parallel for loop if there is an induction variable or if
    there is a reduction operation, i will give an example of them:

    If we have the following (the code looks like Algol or modern Object
    Pascal):


    IND:=0

    For I:=1 to N
    Do
    Begin
    IND := IND + 1;
    A[I]:=B[IND];
    End;


    So as you are noticing since IND is an induction variable , so
    to parallelize the loop you have to do the following:


    For I:=1 to N
    Do
    Begin
    IND:=(I*(I+1))/2;
    A[I]:=B[IND];
    End;


    Now for the reduction operation example, you will notice that my
    invention that is my Threadpool with priorities that scales very well (
    read about it below) supports a Parallel For that scales very well that supports "grainsize", and you will notice that the grainsize can be used
    in the ParallelFor() with a reduction operation and you will notice that
    my following powerful scalable Adder is also used in this scenario, here
    it is:

    https://sites.google.com/site/scalable68/scalable-adder-for-delphi-and-freepascal


    So here is the example with a reduction operation in modern Object Pascal:


    TOTAL:=0.0
    For I := 1 to N
    Do
    Begin
    TOTAL:=TOTAL+A[I]
    End;


    So with my powerful scalable Adder and with my powerful invention that
    is my ParallelFor() that scales very well, you will parallelize the
    above like this:

    procedure test1(j:integer;ptr:pointer);
    begin

    t.add(A[J]); // "t" is my scalable Adder object

    end;

    // Let's suppose that N is 100000
    // In the following, 10000 is the grainsize

    obj.ParallelFor(1,N,test1,10000,pointer(0));

    TOTAL:=T.get();


    And read the following to understand how to use grainsize of my Parallel
    for that scales well:


    About my ParallelFor() that scales very well that uses my efficient
    Threadpool that scales very well:

    With ParallelFor() you have to:

    1- Ensure Sufficient Work

    Each iteration of a loop involves a certain amount of work,
    so you have to ensure a sufficient amount of the work,
    read below about "grainsize" that i have implemented.

    2- In OpenMP we have that:

    Static and Dynamic Scheduling

    One basic characteristic of a loop schedule is whether it is static or
    dynamic:

    • In a static schedule, the choice of which thread performs a particular iteration is purely a function of the iteration number and number of
    threads. Each thread performs only the iterations assigned to it at the beginning of the loop.

    • In a dynamic schedule, the assignment of iterations to threads can
    vary at runtime from one execution to another. Not all iterations are
    assigned to threads at the start of the loop. Instead, each thread
    requests more iterations after it has completed the work already
    assigned to it.

    But with my ParallelFor() that scales very well, since it is using my
    efficient Threadpool that scales very well, so it is using Round-robin scheduling and it uses also work stealing, so i think that this is
    sufficient.

    Read the rest:

    My Threadpool engine with priorities that scales very well is really
    powerful because it scales very well on multicore and NUMA systems, also
    it comes with a ParallelFor() that scales very well on multicores and
    NUMA systems.

    You can download it from:

    https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well


    Here is the explanation of my ParallelFor() that scales very well:

    I have also implemented a ParallelFor() that scales very well, here is
    the method:

    procedure ParallelFor(nMin, nMax:integer;aProc: TParallelProc;GrainSize:integer=1;Ptr:pointer=nil;pmode:TParallelMode=pmBlocking;Priority:TPriorities=NORMAL_PRIORITY);

    nMin and nMax parameters of the ParallelFor() are the minimum and
    maximum integer values of the variable of the ParallelFor() loop, aProc parameter of ParallelFor() is the procedure to call, and GrainSize
    integer parameter of ParallelFor() is the following:

    The grainsize sets a minimum threshold for parallelization.

    A rule of thumb is that grainsize iterations should take at least
    100,000 clock cycles to execute.

    For example, if a single iteration takes 100 clocks, then the grainsize
    needs to be at least 1000 iterations. When in doubt, do the following experiment:

    1- Set the grainsize parameter higher than necessary. The grainsize is specified in units of loop iterations.

    If you have no idea of how many clock cycles an iteration might take,
    start with grainsize=100,000.

    The rationale is that each iteration normally requires at least one
    clock per iteration. In most cases, step 3 will guide you to a much
    smaller value.

    2- Run your algorithm.

    3- Iteratively halve the grainsize parameter and see how much the
    algorithm slows down or speeds up as the value decreases.

    A drawback of setting a grainsize too high is that it can reduce
    parallelism. For example, if the grainsize is 1000 and the loop has 2000 iterations, the ParallelFor() method distributes the loop across only
    two processors, even if more are available.

    And you can pass a parameter in Ptr as pointer to ParallelFor(), and you
    can set pmode parameter of to pmBlocking so that ParallelFor() is
    blocking or to pmNonBlocking so that ParallelFor() is non-blocking, and
    the Priority parameter is the priority of ParallelFor(). Look inside the test.pas example to see how to use it.


    Thank you,
    Amine Moulay Ramdane.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)