Hello,
My new inventions of Scalable RWLocks that work across processes and
threads version 5.0 that are starvation-free and fair are here..
My new inventions of Scalable RWLocks that works across processes and
threads version 5.0 that are starvation-free and fair are here, read
my following description about them because i have just enhanced them
more, and they work both on Windows and Linux:
Description:
Those are my inventions of a fast, and scalable and starvation-free and
fair and lightweight Multiple-Readers-Exclusive-Writer Lock that
spin-wait called LW_RWLockX, it works across processes and threads, and
of a fast, and scalable and starvation-free and fair and Multiple-Readers-Exclusive-Writer Lock that doesn't spin-wait called
RWLockX, it works across processes and threads.
The parameters of the constructor of LW_RWLockX are: first parameter is
the name of the scalable RWLock to be used across processes, if the name
is empty, it will only be used across threads. The second parameter is
the size of the array of the readers, so if the size of the array is
equal to the number of parallel readers, so it will be scalable, but if
the number of readers are greater than the size of the array , you will
start to have contention. The third parameter is the size of the array
of my scalable Lock that is FIFO fair that is called AMLock, the number
of threads can go beyond the size of the array of the scalable AMLock,
please look at the source code of my scalable algorithms to understand.
The parameters of the constructor of RWLockX are: first parameter is the
name of the scalable RWLock to be used across processes, if the name is
empty, it will only be used across threads. The second parameter is the
mode that is the input Permission flags of the named semaphore that is
used as a named Mutex, here is how to set it:
mode
(input) Permission flags.
The mode parameter value is either zero or is obtained by
performing an OR operation on one or more of the following list of
constants. For another process to open the semaphore, the process's
effective UIDd must be able to open the semaphore in both read and write
mode.
'0x0100' or S_IRUSR Permits the creator of the named semaphore to open
the semaphore in read mode.
'0x0080' or S_IWUSR Permits the creator of the named semaphore to open
the semaphore in write mode.
'0x0020' or S_IRGRP Permits the group associated with the named
semaphore to open the semaphore in read mode.
'0x0010' or S_IWGRP Permits the group associated with the named
semaphore to open the semaphore in write mode.
'0x0004' or S_IROTH Permits others to open the named semaphore in read mode.
'0x0002' or S_IWOTH Permits others to open the named semaphore in write mode.
The third parameter is the size of the array of the readers, so if the
size of the array is equal to the number of parallel readers, so it will
be scalable, but if the number of readers are greater than the size of
the array , you will start to have contention. Please look at the source
code of my scalable algorithms to understand.
Here is how to use my new inventions that are my scalable LW_RWLockX and RWLockX across processes:
Just create a scalable rwlock object by giving a name in one process by
calling the constructor like this:
scalable_rwlock.create('amine');
And you can use the scalable rwlock object from another process by
calling the constructor by using the name like this:
scalable_rwlock.create('amine');
So as you are noticing i have abstracted it efficiently..
I have also used my following implementation of FNV1a hash function to
make my new variants of RWLocks scalable (since FNV1a is a hash
algorithm that has good dispersion):
function FNV1aHash(key:int64): UInt64;
var
i: Integer;
key1:uint64;
const
FNV_offset_basis: UInt64 = 14695981039346656037;
FNV_prime: UInt64 = 1099511628211;
begin
//FNV-1a hash
Result := FNV_offset_basis;
for i := 1 to 8 do
begin
key1:=(key shr ((i-1)*8)) and $00000000000000ff;
Result := (Result xor key1) * FNV_prime;
end;
end;
You can download my Scalable RWLocks that works across processes and
threads version 5.0 from:
https://sites.google.com/site/scalable68/scalable-rwlocks-that-work-accross-processes-and-threads
More precision about Atomic reference counting and about my scalable
reference counting algorithm..
I think that Rust and C++ and Clang and Swift use Atomic reference
counting and it is an atomic increment and atomic decrement that
is not scalable, so then my powerful invention of scalable reference
counting algorithm below is much more powerful since it is scalable
and wait-free, read about it in my following thoughts:
More about my powerful inventions of scalable reference counting
algorithm and of my scalable algorithms..
I invite you to read the following web page:
Why is memory reclamation so important?
https://concurrencyfreaks.blogspot.com/search?q=resilience+and+urcu
Notice that it is saying the following about RCU:
"Reason number 4, resilience
Another reason to go with lock-free/wait-free data structures is because
they are resilient to failures. On a shared memory system with multiples processes accessing the same data structure, even if one of the
processes dies, the others will be able to progress in their work. This
is the true gem of lock-free data structures: progress in the presence
of failure. Blocking data structures (typically) do not have this
property (there are exceptions though). If we add a blocking memory
reclamation (like URCU) to a lock-free/wait-free data structure, we are
loosing this resilience because one dead process will prevent further
memory reclamation and eventually bring down the whole system.
There goes the resilience advantage out the window."
So i think that RCU can not be used as reference counting,
since it is blocking on the writer side, so it is not resilient to
failures since it is not lock-free on the writer side.
So this is why i have invented my powerful Scalable reference counting
with efficient support for weak references that is lock-free for its
scalable reference counting, and here it is:
https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references
And my scalable reference counting algorithm is of the SCU(0,1) Class of Algorithms, so under scheduling conditions which approximate those found
in commercial hardware architectures, it becomes wait-free with a system latency of time O(sqrt(k)) and with an individual latency of
O(k*sqrt(k)), and k number of threads.
The proof is here on the following PhD paper:
https://arxiv.org/pdf/1311.3200.pdf
This paper suggests a simple solution to this problem. We show that, for
a large class of lock- free algorithms, under scheduling conditions
which approximate those found in commercial hardware architectures,
lock-free algorithms behave as if they are wait-free. In other words, programmers can keep on designing simple lock-free algorithms instead of complex wait-free ones, and in practice, they will get wait-free
progress. It says on the Analysis of the Class SCU(q, s):
"Given an algorithm in SCU(q, s) on k correct processes under a uniform stochastic scheduler, the system latency is O(q + s*sqrt(k), and the
individual latency is O(k(q + s*sqrt(k))."
More precision about my new inventions of scalable algorithms..
And look at my below powerful inventions of LW_Fast_RWLockX and
Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair
and Starvation-free and costless on the reader side
(that means with no atomics and with no fences on the reader side), they
use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and if you look at the source code of my LW_Fast_RWLockX.pas
and Fast_RWLockX.pas inside the zip file, you will notice that in Linux
they call two functions that are membarrier1() and membarrier2(), the membarrier1() registers the process's intent to use MEMBARRIER_CMD_PRIVATE_EXPEDITED and membarrier2() executes a memory
barrier on each running thread belonging to the same process as the
calling thread.
Read more here to understand:
https://man7.org/linux/man-pages/man2/membarrier.2.html
Here is my new powerful inventions of scalable algorithms..
I have just updated my powerful inventions of LW_Fast_RWLockX and
Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair
and Starvation-free and costless on the reader side (that means with no
atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and now
they work with both Linux and Windows, and i think my inventions are
really smart, since read the following PhD researcher,
he says the following:
"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"
Read more here:
http://concurrencyfreaks.blogspot.com/2019/04/onefile-and-tail-latency.html
So as you have just noticed he says the following:
"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"
So i think that my above powerful inventions of scalable reader-writer
locks are efficient and FIFO fair and Starvation-free.
LW_Fast_RWLockX that is a lightweight scalable Reader-Writer Mutex that
uses a technic that looks like Seqlock without looping on the reader
side like Seqlock, and this has permitted the reader side to be
costless, it is fair and it is of course Starvation-free and it does
spin-wait, and also Fast_RWLockX a lightweight scalable Reader-Writer
Mutex that uses a technic that looks like Seqlock without looping on the
reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does not spin-wait, but waits on my SemaMonitor, so it is energy efficient.
You can read about them and download them from my website here:
https://sites.google.com/site/scalable68/scalable-rwlock
Also my other inventions are the following scalable RWLocks that are
FIFO fair and starvation-free:
And here is my inventions of New variants of Scalable RWLocks that are
FIFO fair and Starvation-free:
https://sites.google.com/site/scalable68/new-variants-of-scalable-rwlocks
More about the energy efficiency of Transactional memory and more..
I have just read the following PhD paper, it is also about energy
efficiency of Transactional memory, here it is:
Techniques for Enhancing the Efficiency of Transactional Memory Systems
http://kth.diva-portal.org/smash/get/diva2:1258335/FULLTEXT02.pdf
And i think it is the best known energy efficient algorithm for
Transactional memory, but i think it is not good, since
look at how for 64 cores the Beta parameter can be 16 cores,
so i think i am smart and i have just invented a much more energy
efficient and powerful scalable fast Mutex and i have also just invented scalable RWLocks that are starvation-free and fair, read about them in
my below writing and thoughts:
More about deadlocks and lock-based systems and more..
I have just read the following from an software engineer from Quebec Canada:
A deadlock-detecting mutex
https://faouellet.github.io/ddmutex/
And i have just understood rapidly his algorithm, but i think
his algorithm is not efficient at all, since we can find
if a graph has a strongly connected component in around a time
complexity O(V+E), so then the algorithm above of the engineer from
Quebec Canada takes around a time complexity of O(n*(V+E)), so it is not
good.
So a much better way is to use my following way of detecting deadlocks:
DelphiConcurrent and FreepascalConcurrent are here
Read more here in my website:
https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent
And i will soon enhance much more DelphiConcurrent and
FreepascalConcurrent to support both Communication deadlocks
and Resource deadlocks.
About Transactional memory and locks..
I have just read the following paper about Transactional memory and locks:
http://sunnydhillon.net/assets/docs/concurrency-tm.pdf
I don't agree with the above paper, since read my following thoughts
to understand:
I have just invented a new powerful scalable fast mutex, and it has the following characteristics:
1- Starvation-free
2- Tunable fairness
3- It keeps efficiently and very low its cache coherence traffic
4- Very good fast path performance
5- And it has a good preemption tolerance.
6- It is faster than scalable MCS lock
7- It solves the problem of lock convoying
So my new invention also solves the following problem:
The convoy phenomenon
https://blog.acolyer.org/2019/07/01/the-convoy-phenomenon/
And here is my other new invention of a Scalable RWLock that works
across processes and threads that is starvation-free and fair and i will
soon enhance it much more and it will become really powerful:
https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads
And about Lock-free versus Lock, read my following post:
https://groups.google.com/forum/#!topic/comp.programming.threads/F_cF4ft1Qic
And about deadlocks, here is also how i have solved it, and i will soon
enhance much more DelphiConcurrent and FreepacalConcurrent:
DelphiConcurrent and FreepascalConcurrent are here
Read more here in my website:
https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent
So i think with my above scalable fast mutex and my scalable RWLocks
that are starvation-free and fair and by reading the following about composability of lock-based systems, you will notice that lock-based
systems are still useful.
"About composability of lock-based systems..
Design your systems to be composable. Among the more galling claims of
the detractors of lock-based systems is the notion that they are somehow uncomposable: “Locks and condition variables do not support modular programming,” reads one typically brazen claim, “building large programs
by gluing together smaller programs[:] locks make this impossible.”9 The claim, of course, is incorrect. For evidence one need only point at the composition of lock-based systems such as databases and operating
systems into larger systems that remain entirely unaware of lower-level locking.
There are two ways to make lock-based systems completely composable, and
each has its own place. First (and most obviously), one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user level with in-kernel locks held;
the locks used to implement the system itself are entirely behind the
system call interface that constitutes the interface to the system. More generally, this model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the
caller with locks held, the subsystem will remain composable.
Second (and perhaps counterintuitively), one can achieve concurrency and composability by having no locks whatsoever. In this case, there must be
no global subsystem state—subsystem state must be captured in
per-instance state, and it must be up to consumers of the subsystem to
assure that they do not access their instance in parallel. By leaving
locking up to the client of the subsystem, the subsystem itself can be
used concurrently by different subsystems and in different contexts. A
concrete example of this is the AVL tree implementation used extensively
in the Solaris kernel. As with any balanced binary tree, the
implementation is sufficiently complex to merit componentization, but by
not having any global state, the implementation may be used concurrently
by disjoint subsystems—the only constraint is that manipulation of a
single AVL tree instance must be serialized."
Read more here:
https://queue.acm.org/detail.cfm?id=1454462
About mathematics and about abstraction..
I think my specialization is also that i have invented many software
algorithms and software scalable algorithms and i am still inventing
other software scalable algorithms and algorithms, those scalable
algorithms and algorithms that i have invented are like inventing
mathematical theorems that you prove and present in a higher level
abstraction, but not only that but those algorithms and scalable
algorithms of mine are presented in a form of higher level software
abstraction that abstract the complexity of my scalable algorithms and algorithms, it is the most important part that interests me, for example
notice how i am constructing higher level abstraction in my following
tutorial as methodology that, first, permits to model the
synchronization objects of parallel programs with logic primitives with If-Then-OR-AND so that to make it easy to translate to Petri nets so
that to detect deadlocks in parallel programs, please take a look at it
in my following web link because this tutorial of mine is the way of
learning by higher level abstraction:
How to analyse parallel applications with Petri Nets
https://sites.google.com/site/scalable68/how-to-analyse-parallel-applications-with-petri-nets
So notice that my methodology is a generalization that solves
communication deadlocks and resource deadlocks in parallel programs.
1- Communication deadlocks that result from incorrect use of
event objects or condition variables (i.e. wait-notify
synchronization).
2- Resource deadlocks, a common kind of deadlock in which a set of
threads blocks forever because each thread in the set is waiting to
acquire a lock held by another thread in the set.
This is what interests me in mathematics, i want to work efficiently in mathematics in a much higher level of abstraction, i give you
an example of what i am doing in mathematics so that you understand,
look at how i am implementing mathematics as a software parallel
conjugate gradient system solvers that scale very well, and i am
presenting them in a higher level of abstraction, this is how i am
abstracting the mathematics of them, read the following about it to
notice it:
About SOR and Conjugate gradient mathematical methods..
I have just looked at SOR(Successive Overrelaxation Method),
and i think it is much less powerful than Conjugate gradient method,
read the following to notice it:
COMPARATIVE PERFORMANCE OF THE CONJUGATE GRADIENT AND SOR METHODS
FOR COMPUTATIONAL THERMAL HYDRAULICS
https://inis.iaea.org/collection/NCLCollectionStore/_Public/19/055/19055644.pdf?r=1&r=1
This is why i have implemented in both C++ and Delphi my Parallel
Conjugate Gradient Linear System Solver Library that scales very well,
read my following thoughts about it to understand more:
About the convergence properties of the conjugate gradient method
The conjugate gradient method can theoretically be viewed as a direct
method, as it produces the exact solution after a finite number of
iterations, which is not larger than the size of the matrix, in the
absence of round-off error. However, the conjugate gradient method is
unstable with respect to even small perturbations, e.g., most directions
are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate gradient method can be used as an iterative
method as it provides monotonically improving approximations to the
exact solution, which may reach the required tolerance after a
relatively small (compared to the problem size) number of iterations.
The improvement is typically linear and its speed is determined by the condition number κ(A) of the system matrix A: the larger is κ(A), the
slower the improvement.
Read more here:
http://pages.stat.wisc.edu/~wahba/stat860public/pdf1/cj.pdf
So i think my Conjugate Gradient Linear System Solver Library
that scales very well is still very useful, read about it
in my writing below:
Read the following interesting news:
The finite element method finds its place in games
Read more here:
https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Fhpc.developpez.com%2Factu%2F288260%2FLa-methode-des-elements-finis-trouve-sa-place-dans-les-jeux-AMD-propose-la-bibliotheque-FEMFX-pour-une-simulation-en-temps-reel-des-
deformations%2F
But you have to be aware that finite element method uses Conjugate
Gradient Method for Solution of Finite Element Problems, read here to
notice it:
Conjugate Gradient Method for Solution of Large Finite Element Problems
on CPU and GPU
https://pdfs.semanticscholar.org/1f4c/f080ee622aa02623b35eda947fbc169b199d.pdf
This is why i have also designed and implemented my Parallel Conjugate
Gradient Linear System Solver library that scales very well,
here it is:
My Parallel C++ Conjugate Gradient Linear System Solver Library
that scales very well version 1.76 is here..
Author: Amine Moulay Ramdane
Description:
This library contains a Parallel implementation of Conjugate Gradient
Dense Linear System Solver library that is NUMA-aware and cache-aware
that scales very well, and it contains also a Parallel implementation of Conjugate Gradient Sparse Linear System Solver library that is
cache-aware that scales very well.
Sparse linear system solvers are ubiquitous in high performance
computing (HPC) and often are the most computational intensive parts in scientific computing codes. A few of the many applications relying on
sparse linear solvers include fusion energy simulation, space weather simulation, climate modeling, and environmental modeling, and finite
element method, and large-scale reservoir simulations to enhance oil
recovery by the oil and gas industry.
Conjugate Gradient is known to converge to the exact solution in n steps
for a matrix of size n, and was historically first seen as a direct
method because of this. However, after a while people figured out that
it works really well if you just stop the iteration much earlier - often
you will get a very good approximation after much fewer than n steps. In
fact, we can analyze how fast Conjugate gradient converges. The end
result is that Conjugate gradient is used as an iterative method for
large linear systems today.
Please download the zip file and read the readme file inside the zip to
know how to use it.
You can download it from:
https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library
Language: GNU C++ and Visual C++ and C++Builder
Operating Systems: Windows, Linux, Unix and Mac OS X on (x86)
--
As you have noticed i have just written above about my Parallel C++
Conjugate Gradient Linear System Solver Library that scales very well,
but here is my Parallel Delphi and Freepascal Conjugate Gradient Linear
System Solvers Libraries that scale very well:
Parallel implementation of Conjugate Gradient Dense Linear System solver library that is NUMA-aware and cache-aware that scales very well
https://sites.google.com/site/scalable68/scalable-parallel-implementation-of-conjugate-gradient-dense-linear-system-solver-library-that-is-numa-aware-and-cache-aware
PARALLEL IMPLEMENTATION OF CONJUGATE GRADIENT SPARSE LINEAR SYSTEM
SOLVER LIBRARY THAT SCALES VERY WELL
https://sites.google.com/site/scalable68/scalable-parallel-implementation-of-conjugate-gradient-sparse-linear-system-solver
Thank you,
Amine Moulay Ramdane.
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)