• More about WaitAny() and WaitAll() and more.. (1/2)

    From World90@21:1/5 to All on Sat Apr 24 16:48:53 2021
    Hello..


    More about WaitAny() and WaitAll() and more..

    Look at the following concurrency abstractions of Microsoft:

    https://docs.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.waitany?view=netframework-4.8

    https://docs.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.waitall?view=netframework-4.8

    They look like the following WaitForAny() and WaitForAll() of Delphi,
    here they are:

    http://docwiki.embarcadero.com/Libraries/Sydney/en/System.Threading.TTask.WaitForAny

    http://docwiki.embarcadero.com/Libraries/Sydney/en/System.Threading.TTask.WaitForAll

    So the WaitForAll() is easy and i have implemented it in my Threadpool
    engine that scales very well and that i have invented, you can read my
    html tutorial inside The zip file of it to know how to do it, you can
    download it from my website here:

    https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well

    And about the WaitForAny(), you can also do it using my SemaMonitor,
    and i will soon give you an example of how to do it, and you can
    download my SemaMonitor invention from my website here:

    https://sites.google.com/site/scalable68/semacondvar-semamonitor

    Here is my other just new software inventions..

    I have just looked at the source code of the following multiplatform pevents

    https://github.com/neosmart/pevents

    And notice that the WaitForMultipleEvents() is implemented with pthread
    but it is not scalable on multicores. So i have just invented a WaitForMultipleObjects() that looks like the Windows
    WaitForMultipleObjects() and that is fully "scalable" on multicores and
    that works on Windows and Linux and MacOSX and that is blocking when
    waiting for the objects as WaitForMultipleObjects(), so it doesn't
    consume CPU cycles when waiting and it works with events and futures and
    tasks.

    Here is my other just new software inventions..

    I have just invented a fully "scalable" on multicores latch and a
    fully scalable on multicores thread barrier, they are really powerful.

    Read about the latches and thread barriers that are not scalable on
    multicores of C++ here:

    https://www.modernescpp.com/index.php/latches-and-barriers


    Here is my other software inventions:


    More about my scalable math Linear System Solver Library...

    As you have just noticed i have just spoken about my Linear System
    Solver Library(read below), right now it scales very well, but i will
    soon make it "fully" scalable on multicores using one of my scalable
    algorithm that i have invented and i will extend it much more to also
    support efficient scalable on multicores matrix operations and more, and
    since it will come with one of my scalable algorithms that i have
    invented, i think i will sell it too.

    More about mathematics and about scalable Linear System Solver Libraries
    and more..

    I have just noticed that a software architect from Austria
    called Michael Rabatscher has designed and implemented MrMath Library
    that is also a parallelized Library:

    Here he is:

    https://at.linkedin.com/in/michael-rabatscher-6821702b

    And here is his MrMath Library for Delphi and Freepascal:

    https://github.com/mikerabat/mrmath

    But i think that he is not so smart, and i think i am smart like
    a genius and i say that his MrMath Library is not scalable on
    multicores, and notice that the Linear System Solver of his MrMath
    Library is not scalable on multicores too, and notice that the threaded
    matrix operations of his Library are not scalable on multicores too,
    this is why i have invented a scalable on multicores Conjugate Gradient
    Linear System Solver Library for C++ and Delphi and Freepascal, and here
    it is, read about it in my following thoughts(also i will soon extend
    more my Library to support scalable matrix operations):

    About SOR and Conjugate gradient mathematical methods..

    I have just looked at SOR(Successive Overrelaxation Method),
    and i think it is much less powerful than Conjugate gradient method,
    read the following to notice it:

    COMPARATIVE PERFORMANCE OF THE CONJUGATE GRADIENT AND SOR METHODS
    FOR COMPUTATIONAL THERMAL HYDRAULICS

    https://inis.iaea.org/collection/NCLCollectionStore/_Public/19/055/19055644.pdf?r=1&r=1


    This is why i have implemented in both C++ and Delphi my Parallel
    Conjugate Gradient Linear System Solver Library that scales very well,
    read my following thoughts about it to understand more:


    About the convergence properties of the conjugate gradient method

    The conjugate gradient method can theoretically be viewed as a direct
    method, as it produces the exact solution after a finite number of
    iterations, which is not larger than the size of the matrix, in the
    absence of round-off error. However, the conjugate gradient method is
    unstable with respect to even small perturbations, e.g., most directions
    are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate gradient method can be used as an iterative
    method as it provides monotonically improving approximations to the
    exact solution, which may reach the required tolerance after a
    relatively small (compared to the problem size) number of iterations.
    The improvement is typically linear and its speed is determined by the condition number κ(A) of the system matrix A: the larger is κ(A), the
    slower the improvement.

    Read more here:

    http://pages.stat.wisc.edu/~wahba/stat860public/pdf1/cj.pdf


    So i think my Conjugate Gradient Linear System Solver Library
    that scales very well is still very useful, read about it
    in my writing below:

    Read the following interesting news:

    The finite element method finds its place in games

    Read more here:

    https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Fhpc.developpez.com%2Factu%2F288260%2FLa-methode-des-elements-finis-trouve-sa-place-dans-les-jeux-AMD-propose-la-bibliotheque-FEMFX-pour-une-simulation-en-temps-reel-des-
    deformations%2F

    But you have to be aware that finite element method uses Conjugate
    Gradient Method for Solution of Finite Element Problems, read here to
    notice it:

    Conjugate Gradient Method for Solution of Large Finite Element Problems
    on CPU and GPU

    https://pdfs.semanticscholar.org/1f4c/f080ee622aa02623b35eda947fbc169b199d.pdf


    This is why i have also designed and implemented my Parallel Conjugate
    Gradient Linear System Solver library that scales very well,
    here it is:

    My Parallel C++ Conjugate Gradient Linear System Solver Library
    that scales very well version 1.76 is here..

    Author: Amine Moulay Ramdane

    Description:

    This library contains a Parallel implementation of Conjugate Gradient
    Dense Linear System Solver library that is NUMA-aware and cache-aware
    that scales very well, and it contains also a Parallel implementation of Conjugate Gradient Sparse Linear System Solver library that is
    cache-aware that scales very well.

    Sparse linear system solvers are ubiquitous in high performance
    computing (HPC) and often are the most computational intensive parts in scientific computing codes. A few of the many applications relying on
    sparse linear solvers include fusion energy simulation, space weather simulation, climate modeling, and environmental modeling, and finite
    element method, and large-scale reservoir simulations to enhance oil
    recovery by the oil and gas industry.

    Conjugate Gradient is known to converge to the exact solution in n steps
    for a matrix of size n, and was historically first seen as a direct
    method because of this. However, after a while people figured out that
    it works really well if you just stop the iteration much earlier - often
    you will get a very good approximation after much fewer than n steps. In
    fact, we can analyze how fast Conjugate gradient converges. The end
    result is that Conjugate gradient is used as an iterative method for
    large linear systems today.

    Please download the zip file and read the readme file inside the zip to
    know how to use it.

    You can download it from:

    https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library

    Language: GNU C++ and Visual C++ and C++Builder

    Operating Systems: Windows, Linux, Unix and Mac OS X on (x86)


    --

    Thread Barrier for Delphi and Freepascal version 1.0 is here..

    I have added my condition variable implementation and my scalable Lock
    called scalable MLock that both work with both Windows and Linux and i
    have made the Thread Barrier work with both Windows and Linux, and now
    you can pass a parameter to the constructor of the Thread Barrier as
    ctMutex to use a Mutex or ctMLock to use a scalable Lock called MLock or ctCriticalSection to use a Crital Section.

    You can download it from my website here:

    https://sites.google.com/site/scalable68/thread-barrier-for-delphi-and-freepascal

    Yet more precision about my inventions that are my SemaMonitor and
    SemaCondvar and my Monitor..

    My inventions that are my SemaMonitor and SemaCondvar are fast pathed
    when the count of my SemaMonitor or my SemaCondvar is greater than 0, so
    in this case the wait() method stays on the user mode and it doesn't
    switch from user mode to kernel mode that costs around 1500 CPU cycles
    and that is expensive, the signal() method is also fast pathed when
    there is no item in the queue and count is less than MaximumCount, read
    here about what is the cost (in CPU cycles) to switch between windows
    user mode and kernel mode:

    https://stackoverflow.com/questions/1368061/whats-the-cost-in-cycles-to-switch-between-windows-kernel-and-user-mode#:~:text=1%20Answer&text=Switching%20from%20%E2%80%9Cuser%20mode%E2%80%9D%20to,rest%20is%20%22kernel%20overhead%22.

    You can read about and download my inventions of SemaMonitor and
    SemaCondvar from here:

    https://sites.google.com/site/scalable68/semacondvar-semamonitor

    And the light weight version is here:

    https://sites.google.com/site/scalable68/light-weight-semacondvar-semamonitor

    And i have implemented an efficient Monitor over my SemaCondvar.

    Here is the description of my efficient Monitor inside the Monitor.pas
    file that you will find inside the zip file:

    Description:

    This is my implementation of a Monitor over my SemaCondvar.

    You will find the Monitor class inside the Monitor.pas file inside the
    zip file.

    When you set the first parameter of the constructor to true, the signal
    will not be lost if the threads are not waiting with wait() method, but
    when you set the first parameter of the construtor to false, if the
    threads are not waiting with the wait() method, the signal will be lost..

    Second parameter of the constructor is the kind of Lock, you can set it
    to ctMLock to use my scalable node based lock called MLock, or you can
    set it to ctMutex to use a Mutex or you can set it to ctCriticalSection
    to use the TCriticalSection.

    Here is the methods of my efficient Monitor that i have implemented:

    TMonitor = class
    private
    cache0:typecache0;
    lock1:TSyncLock;
    obj:TSemaCondvar;
    cache1:typecache0;

    public

    constructor Create(bool:boolean=true;lock:TMyLocks=ctMLock);
    destructor Destroy; override;
    procedure Enter();
    procedure Leave();
    function Signal():boolean;overload;
    function Signal(nbr:long;var remains:long):boolean;overload;
    procedure Signal_All();
    function Wait(const AMilliseconds:longword=INFINITE): boolean;
    function WaitersBlocked():long;

    end;


    The wait() method is for the threads to wait on the Monitor object for
    the signal to be signaled. If wait() fails, that can be that the number
    of waiters is greater than high(longword).

    And the signal() method will signal one time a waiting thread on the
    Monitor object, but if signal() fails , the returned value is false.

    the signal_all() method will signal all the waiting threads on
    the Monitor object.

    The signal(nbr:long;var remains:long) method will signal nbr of
    waiting threads, but if signal() fails, the remaining number of signals
    that were not signaled will be returned in the remains variable.

    and WaitersBlocked() will return the number of waiting threads on
    the Monitor object.

    and Enter() and Leave() methods to enter and leave the monitor's Lock.


    You can download the zip files from:

    https://sites.google.com/site/scalable68/semacondvar-semamonitor

    and the lightweight version is here:

    https://sites.google.com/site/scalable68/light-weight-semacondvar-semamonitor


    More about my powerful inventions of scalable reference counting
    algorithm and of my scalable algorithms..

    I invite you to read the following web page:

    Why is memory reclamation so important?

    https://concurrencyfreaks.blogspot.com/search?q=resilience+and+urcu

    Notice that it is saying the following about RCU:

    "Reason number 4, resilience

    Another reason to go with lock-free/wait-free data structures is because
    they are resilient to failures. On a shared memory system with multiples processes accessing the same data structure, even if one of the
    processes dies, the others will be able to progress in their work. This
    is the true gem of lock-free data structures: progress in the presence
    of failure. Blocking data structures (typically) do not have this
    property (there are exceptions though). If we add a blocking memory
    reclamation (like URCU) to a lock-free/wait-free data structure, we are
    loosing this resilience because one dead process will prevent further
    memory reclamation and eventually bring down the whole system.
    There goes the resilience advantage out the window."

    So i think that RCU can not be used as reference counting,
    since it is blocking on the writer side, so it is not resilient to
    failures since it is not lock-free on the writer side.

    So this is why i have invented my powerful Scalable reference counting
    with efficient support for weak references that is lock-free for its
    scalable reference counting, and here it is:

    https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references

    And my scalable reference counting algorithm is of the SCU(0,1) Class of Algorithms, so under scheduling conditions which approximate those found
    in commercial hardware architectures, it becomes wait-free with a system latency of time O(sqrt(k)) and with an individual latency of
    O(k*sqrt(k)), and k number of threads.

    The proof is here on the following PhD paper:

    https://arxiv.org/pdf/1311.3200.pdf

    This paper suggests a simple solution to this problem. We show that, for
    a large class of lock- free algorithms, under scheduling conditions
    which approximate those found in commercial hardware architectures,
    lock-free algorithms behave as if they are wait-free. In other words, programmers can keep on designing simple lock-free algorithms instead of complex wait-free ones, and in practice, they will get wait-free
    progress. It says on the Analysis of the Class SCU(q, s):

    "Given an algorithm in SCU(q, s) on k correct processes under a uniform stochastic scheduler, the system latency is O(q + s*sqrt(k), and the
    individual latency is O(k(q + s*sqrt(k))."

    More precision about my new inventions of scalable algorithms..

    And look at my below powerful inventions of LW_Fast_RWLockX and
    Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair
    and Starvation-free and costless on the reader side
    (that means with no atomics and with no fences on the reader side), they
    use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and if you look at the source code of my LW_Fast_RWLockX.pas
    and Fast_RWLockX.pas inside the zip file, you will notice that in Linux
    they call two functions that are membarrier1() and membarrier2(), the membarrier1() registers the process's intent to use MEMBARRIER_CMD_PRIVATE_EXPEDITED and membarrier2() executes a memory
    barrier on each running thread belonging to the same process as the
    calling thread.

    Read more here to understand:

    https://man7.org/linux/man-pages/man2/membarrier.2.html

    Here is my new powerful inventions of scalable algorithms..

    I have just updated my powerful inventions of LW_Fast_RWLockX and
    Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair
    and Starvation-free and costless on the reader side (that means with no
    atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and now
    they work with both Linux and Windows, and i think my inventions are
    really smart, since read the following PhD researcher,
    he says the following:

    "Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

    Read more here:

    http://concurrencyfreaks.blogspot.com/2019/04/onefile-and-tail-latency.html

    So as you have just noticed he says the following:

    "Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

    So i think that my above powerful inventions of scalable reader-writer
    locks are efficient and FIFO fair and Starvation-free.

    LW_Fast_RWLockX that is a lightweight scalable Reader-Writer Mutex that
    uses a technic that looks like Seqlock without looping on the reader
    side like Seqlock, and this has permitted the reader side to be
    costless, it is fair and it is of course Starvation-free and it does
    spin-wait, and also Fast_RWLockX a lightweight scalable Reader-Writer
    Mutex that uses a technic that looks like Seqlock without looping on the
    reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does not spin-wait, but waits on my SemaMonitor, so it is energy efficient.

    You can read about them and download them from my website here:

    https://sites.google.com/site/scalable68/scalable-rwlock

    Also my other inventions are the following scalable RWLocks that are
    FIFO fair and starvation-free:

    Here is my invention of a scalable and starvation-free and FIFO fair and lightweight Multiple-Readers-Exclusive-Writer Lock called LW_RWLockX, it
    works across processes and threads:

    https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads

    And here is my inventions of New variants of Scalable RWLocks that are
    FIFO fair and Starvation-free:

    https://sites.google.com/site/scalable68/new-variants-of-scalable-rwlocks

    More about the energy efficiency of Transactional memory and more..

    I have just read the following PhD paper, it is also about energy
    efficiency of Transactional memory, here it is:

    Techniques for Enhancing the Efficiency of Transactional Memory Systems

    http://kth.diva-portal.org/smash/get/diva2:1258335/FULLTEXT02.pdf

    And i think it is the best known energy efficient algorithm for
    Transactional memory, but i think it is not good, since
    look at how for 64 cores the Beta parameter can be 16 cores,
    so i think i am smart and i have just invented a much more energy
    efficient and powerful scalable fast Mutex and i have also just invented scalable RWLocks that are starvation-free and fair, read about them in
    my below writing and thoughts:

    More about deadlocks and lock-based systems and more..

    I have just read the following from an software engineer from Quebec Canada:

    A deadlock-detecting mutex

    https://faouellet.github.io/ddmutex/

    And i have just understood rapidly his algorithm, but i think
    his algorithm is not efficient at all, since we can find
    if a graph has a strongly connected component in around a time
    complexity O(V+E), so then the algorithm above of the engineer from
    Quebec Canada takes around a time complexity of O(n*(V+E)), so it is not
    good.

    So a much better way is to use my following way of detecting deadlocks:

    DelphiConcurrent and FreepascalConcurrent are here

    Read more here in my website:

    https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent

    And i will soon enhance much more DelphiConcurrent and
    FreepascalConcurrent to support both Communication deadlocks
    and Resource deadlocks.

    About Transactional memory and locks..


    I have just read the following paper about Transactional memory and locks:

    http://sunnydhillon.net/assets/docs/concurrency-tm.pdf


    I don't agree with the above paper, since read my following thoughts
    to understand:

    I have just invented a new powerful scalable fast mutex, and it has the following characteristics:

    1- Starvation-free
    2- Tunable fairness
    3- It keeps efficiently and very low its cache coherence traffic
    4- Very good fast path performance
    5- And it has a good preemption tolerance.
    6- It is faster than scalable MCS lock
    7- It solves the problem of lock convoying

    So my new invention also solves the following problem:

    The convoy phenomenon

    https://blog.acolyer.org/2019/07/01/the-convoy-phenomenon/

    And here is my other new invention of a Scalable RWLock that works
    across processes and threads that is starvation-free and fair and i will
    soon enhance it much more and it will become really powerful:

    https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads

    And about Lock-free versus Lock, read my following post:

    https://groups.google.com/forum/#!topic/comp.programming.threads/F_cF4ft1Qic

    And about deadlocks, here is also how i have solved it, and i will soon
    enhance much more DelphiConcurrent and FreepacalConcurrent:

    DelphiConcurrent and FreepascalConcurrent are here

    Read more here in my website:

    https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent


    So i think with my above scalable fast mutex and my scalable RWLocks
    that are starvation-free and fair and by reading the following about composability of lock-based systems, you will notice that lock-based
    systems are still useful.


    "About composability of lock-based systems..


    Design your systems to be composable. Among the more galling claims of
    the detractors of lock-based systems is the notion that they are somehow uncomposable: “Locks and condition variables do not support modular programming,” reads one typically brazen claim, “building large programs
    by gluing together smaller programs[:] locks make this impossible.”9 The claim, of course, is incorrect. For evidence one need only point at the composition of lock-based systems such as databases and operating
    systems into larger systems that remain entirely unaware of lower-level locking.

    There are two ways to make lock-based systems completely composable, and
    each has its own place. First (and most obviously), one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user level with in-kernel locks held;
    the locks used to implement the system itself are entirely behind the
    system call interface that constitutes the interface to the system. More generally, this model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the
    caller with locks held, the subsystem will remain composable.

    Second (and perhaps counterintuitively), one can achieve concurrency and composability by having no locks whatsoever. In this case, there must be
    no global subsystem state—subsystem state must be captured in
    per-instance state, and it must be up to consumers of the subsystem to
    assure that they do not access their instance in parallel. By leaving
    locking up to the client of the subsystem, the subsystem itself can be
    used concurrently by different subsystems and in different contexts. A
    concrete example of this is the AVL tree implementation used extensively
    in the Solaris kernel. As with any balanced binary tree, the
    implementation is sufficiently complex to merit componentization, but by
    not having any global state, the implementation may be used concurrently
    by disjoint subsystems—the only constraint is that manipulation of a
    single AVL tree instance must be serialized."

    Read more here:

    https://queue.acm.org/detail.cfm?id=1454462


    About mathematics and about abstraction..


    I think my specialization is also that i have invented many software
    algorithms and software scalable algorithms and i am still inventing
    other software scalable algorithms and algorithms, those scalable
    algorithms and algorithms that i have invented are like inventing
    mathematical theorems that you prove and present in a higher level
    abstraction, but not only that but those algorithms and scalable
    algorithms of mine are presented in a form of higher level software
    abstraction that abstract the complexity of my scalable algorithms and algorithms, it is the most important part that interests me, for example
    notice how i am constructing higher level abstraction in my following
    tutorial as methodology that, first, permits to model the
    synchronization objects of parallel programs with logic primitives with If-Then-OR-AND so that to make it easy to translate to Petri nets so
    that to detect deadlocks in parallel programs, please take a look at it
    in my following web link because this tutorial of mine is the way of
    learning by higher level abstraction:


    How to analyse parallel applications with Petri Nets


    https://sites.google.com/site/scalable68/how-to-analyse-parallel-applications-with-petri-nets

    So notice that my methodology is a generalization that solves
    communication deadlocks and resource deadlocks in parallel programs.

    1- Communication deadlocks that result from incorrect use of
    event objects or condition variables (i.e. wait-notify
    synchronization).


    2- Resource deadlocks, a common kind of deadlock in which a set of
    threads blocks forever because each thread in the set is waiting to
    acquire a lock held by another thread in the set.


    This is what interests me in mathematics, i want to work efficiently in mathematics in a much higher level of abstraction, i give you
    an example of what i am doing in mathematics so that you understand,
    look at how i am implementing mathematics as a software parallel
    conjugate gradient system solvers that scale very well, and i am
    presenting them in a higher level of abstraction, this is how i am
    abstracting the mathematics of them, read the following about it to
    notice it:

    About SOR and Conjugate gradient mathematical methods..

    I have just looked at SOR(Successive Overrelaxation Method),
    and i think it is much less powerful than Conjugate gradient method,
    read the following to notice it:

    COMPARATIVE PERFORMANCE OF THE CONJUGATE GRADIENT AND SOR METHODS
    FOR COMPUTATIONAL THERMAL HYDRAULICS

    https://inis.iaea.org/collection/NCLCollectionStore/_Public/19/055/19055644.pdf?r=1&r=1


    This is why i have implemented in both C++ and Delphi my Parallel
    Conjugate Gradient Linear System Solver Library that scales very well,
    read my following thoughts about it to understand more:


    About the convergence properties of the conjugate gradient method

    The conjugate gradient method can theoretically be viewed as a direct
    method, as it produces the exact solution after a finite number of
    iterations, which is not larger than the size of the matrix, in the
    absence of round-off error. However, the conjugate gradient method is
    unstable with respect to even small perturbations, e.g., most directions
    are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate gradient method can be used as an iterative
    method as it provides monotonically improving approximations to the
    exact solution, which may reach the required tolerance after a
    relatively small (compared to the problem size) number of iterations.
    The improvement is typically linear and its speed is determined by the condition number κ(A) of the system matrix A: the larger is κ(A), the
    slower the improvement.

    Read more here:

    http://pages.stat.wisc.edu/~wahba/stat860public/pdf1/cj.pdf


    So i think my Conjugate Gradient Linear System Solver Library
    that scales very well is still very useful, read about it
    in my writing below:

    Read the following interesting news:

    The finite element method finds its place in games

    Read more here:

    https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Fhpc.developpez.com%2Factu%2F288260%2FLa-methode-des-elements-finis-trouve-sa-place-dans-les-jeux-AMD-propose-la-bibliotheque-FEMFX-pour-une-simulation-en-temps-reel-des-
    deformations%2F

    But you have to be aware that finite element method uses Conjugate
    Gradient Method for Solution of Finite Element Problems, read here to
    notice it:

    Conjugate Gradient Method for Solution of Large Finite Element Problems
    on CPU and GPU

    https://pdfs.semanticscholar.org/1f4c/f080ee622aa02623b35eda947fbc169b199d.pdf


    This is why i have also designed and implemented my Parallel Conjugate
    Gradient Linear System Solver library that scales very well,
    here it is:

    My Parallel C++ Conjugate Gradient Linear System Solver Library
    that scales very well version 1.76 is here..

    Author: Amine Moulay Ramdane

    Description:

    This library contains a Parallel implementation of Conjugate Gradient
    Dense Linear System Solver library that is NUMA-aware and cache-aware
    that scales very well, and it contains also a Parallel implementation of Conjugate Gradient Sparse Linear System Solver library that is
    cache-aware that scales very well.

    Sparse linear system solvers are ubiquitous in high performance
    computing (HPC) and often are the most computational intensive parts in scientific computing codes. A few of the many applications relying on
    sparse linear solvers include fusion energy simulation, space weather simulation, climate modeling, and environmental modeling, and finite
    element method, and large-scale reservoir simulations to enhance oil
    recovery by the oil and gas industry.

    Conjugate Gradient is known to converge to the exact solution in n steps
    for a matrix of size n, and was historically first seen as a direct
    method because of this. However, after a while people figured out that
    it works really well if you just stop the iteration much earlier - often
    you will get a very good approximation after much fewer than n steps. In
    fact, we can analyze how fast Conjugate gradient converges. The end
    result is that Conjugate gradient is used as an iterative method for
    large linear systems today.

    Please download the zip file and read the readme file inside the zip to
    know how to use it.

    You can download it from:

    https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library

    Language: GNU C++ and Visual C++ and C++Builder

    Operating Systems: Windows, Linux, Unix and Mac OS X on (x86)

    --

    As you have noticed i have just written above about my Parallel C++
    Conjugate Gradient Linear System Solver Library that scales very well,
    but here is my Parallel Delphi and Freepascal Conjugate Gradient Linear
    System Solvers Libraries that scale very well:

    Parallel implementation of Conjugate Gradient Dense Linear System solver library that is NUMA-aware and cache-aware that scales very well

    https://sites.google.com/site/scalable68/scalable-parallel-implementation-of-conjugate-gradient-dense-linear-system-solver-library-that-is-numa-aware-and-cache-aware

    PARALLEL IMPLEMENTATION OF CONJUGATE GRADIENT SPARSE LINEAR SYSTEM
    SOLVER LIBRARY THAT SCALES VERY WELL

    https://sites.google.com/site/scalable68/scalable-parallel-implementation-of-conjugate-gradient-sparse-linear-system-solver

    More of my philosophy about Unix and Linux and more..

    I am a white arab and i think i am smart since i have also invented
    many scalable algorithms and algorithms..

    I invite you to look at the following interesting video:

    Unix vs Linux

    https://www.youtube.com/watch?v=jowCUo_UGts

    My Diploma is a university level Diploma, my school in Morocco where i
    have studied and gotten my university level Diploma in Microelectronics
    and informatics was under the control of Paris Academie in France (we
    call it Académie de Paris), and here it is:

    https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Ffr.wikipedia.org%2Fwiki%2FAcad%25C3%25A9mie_de_Paris

    I have started my studies in Microelectronics and informatics in 1986,
    and in my studies of informatics in my university level school i have

    [continued in next message]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)