• About Lock elision and Transactional memory..

    From Wisdom91@21:1/5 to All on Thu Jun 18 12:12:29 2020
    Hello...


    About Lock elision and Transactional memory..

    I have just read the following:

    Lock elision in the GNU C library

    https://lwn.net/Articles/534758/

    So it says the following:

    "Lock elision uses the same programming model as normal locks, so it can
    be directly applied to existing programs. The programmer keeps using
    locks, but the locks are faster as they can use hardware transactional
    memory internally for more parallelism. Lock elision uses memory
    transactions as a fast path, while the slow path is still a normal lock. Deadlocks and other classic locking problems are still possible, because
    the transactions may fall back to a real lock at any time."

    So i think Lthis is not good, because one of the benefits of
    Transactional memory is that it solves the deadlock problem, but
    with Lock elision you bring back the deadlock problem.

    More about Locks and Transactional memory..

    I have just looked at the following webpage about understanding
    Transactional memory performance:

    https://www.cs.utexas.edu/users/witchel/pubs/porter10ispass-tm-slides.pdf

    And as you are noticing, it says that in practice Transactional memory
    is worse than Locks at high contention, and it says that in practice Transactional memory is 40% worse than Locks at 100% contention.

    This is why i have invented scalable Locks and scalable RWLocks, read
    my following thoughts to notice it:


    About beating Moore's Law with software..

    bmoore has responded to me the following:

    https://groups.google.com/forum/#!topic/soc.culture.china/Uu15FIknU0s

    So as you are noticing he is asking me the following:

    "Are you talking about beating Moore's Law with software?"

    But i think that there is some of the following constraints:

    "Modern programing environments contribute to the problem of software
    bloat by placing ease of development and portable code above speed or
    memory usage. While this is a sound business model in a commercial
    environment, it does not make sense where IT resources are constrained. Languages such as Java, C-Sharp, and Python have opted for code
    portability and software development speed above execution speed and
    memory usage, while modern data storage and transfer standards such as
    XML and JSON place flexibility and readability above efficiency."

    Read the following:

    https://smallwarsjournal.com/jrnl/art/overcoming-death-moores-law-role-software-advances-and-non-semiconductor-technologies

    Also there remains the following to also beat Moores's Law:

    "Improved Algorithms

    Hardware improvements mean little if software cannot effectively use the resources available to it. The Army should shape future software
    algorithms by funding basic research on improved software algorithms to
    meet its specific needs. The Army should also search for new algorithms
    and techniques which can be applied to meet specific needs and develop a learning culture within its software community to disseminate this information."


    And about scalable algorithms, as you know i am a white arab
    that is an inventor of many scalable algorithms and there
    implementations, read my following thoughts to notice it:

    About my new invention that is a scalable algorithm..

    I am a white arab, and i think i am more smart,
    and i think i am like a genius, because i have again just invented
    a new scalable algorithm, but i will briefly talk about the following
    best scalable reader-writer lock inventions, the first one is the following:

    Scalable Read-mostly Synchronization Using Passive Reader-Writer Locks

    https://www.usenix.org/system/files/conference/atc14/atc14-paper-liu.pdf

    You will notice that it has a first weakness that it is for TSO hardware
    memory model and the second weakness is that the writers latency is very expensive when there is few readers.

    And here is the other best scalable reader-writer lock invention of
    Facebook:

    SharedMutex is a reader-writer lock. It is small, very fast, scalable
    on multi-core

    Read here:

    https://github.com/facebook/folly/blob/master/folly/SharedMutex.h


    But you will notice that the weakness of this scalable reader-writer
    lock is that the priority can only be configured as the following:

    SharedMutexReadPriority gives priority to readers,
    SharedMutexWritePriority gives priority to writers.


    So the weakness of this scalable reader-writer lock is that
    you can have starvation with it.

    So this is why i have just invented a scalable algorithm that is
    a scalable reader-writer lock that is better than the above and that is starvation-free and that is fair and that has a small writers latency.

    So i think mine is the best and i will sell many of my scalable
    algorithms to software companies such as Microsoft or Google or Embardero..


    What is it to be an inventor of many scalable algorithms ?

    The Holy Grail of parallel programming is to provide good speedup while
    hiding or avoiding the pitfalls of concurrency. You have to understand
    it to be able to understand what i am doing, i am an inventor of
    many scalable algorithms and there implementations, but how can we
    define the kind of inventor like me? i think there is the following
    kinds of inventors, the ones that are PhD researchers and inventors like
    Albert Einstein, and the ones that are engineers and inventors like
    Nikola Tesla, and i think that i am of the kind of inventor of Nikola
    Tesla, i am not a PhD researcher like Albert Einstein, i am like an
    engineer who invented many scalable algorithms and there
    implementations, so i am like the following inventor that we call Nikola
    Tesla:

    https://en.wikipedia.org/wiki/Nikola_Tesla

    But i think that both those PhD researchers that are inventors and those Engineers that are inventors are powerful.

    You have to understand deeply what is to invent my scalable algorithms
    and there implementations so that to understand that it is powerful,
    i give you an example: So i have invented a scalable algorithm that is a scalable Mutex that is remarkable and that is the Holy Grail of scalable
    Locks, it has the following characteristics, read my following thoughts
    to understand:

    About fair and unfair locking..

    I have just read the following lead engineer at Amazon:

    Highly contended and fair locking in Java

    https://brooker.co.za/blog/2012/09/10/locking.html

    So as you are noticing that you can use unfair locking that can have
    starvation or fair locking that is slower than unfair locking.

    I think that Microsoft synchronization objects like the Windows critical section uses unfair locking, but they still can have starvation.

    But i think that this not the good way to do, because i am an inventor
    and i have invented a scalable Fast Mutex that is much more powerful ,
    because with my Fast Mutex you are capable to tune the "fairness" of the
    lock, and my Fast Mutex is capable of more than that, read about it on
    my following thoughts:

    More about research and software development..

    I have just looked at the following new video:

    Why is coding so hard...

    https://www.youtube.com/watch?v=TAAXwrgd1U8

    I am understanding this video, but i have to explain my work:

    I am not like this techlead in the video above, because i am also an
    "inventor" that has invented many scalable algorithms and there
    implementions, i am also inventing effective abstractions, i give you an example:

    Read the following of the senior research scientist that is called Dave
    Dice:

    Preemption tolerant MCS locks

    https://blogs.oracle.com/dave/preemption-tolerant-mcs-locks

    As you are noticing he is trying to invent a new lock that is preemption tolerant, but his lock lacks some important characteristics, this is why
    i have just invented a new Fast Mutex that is adaptative and that is
    much much better and i think mine is the "best", and i think you will
    not find it anywhere, my new Fast Mutex has the following characteristics:

    1- Starvation-free
    2- Tunable fairness
    3- It keeps efficiently and very low its cache coherence traffic
    4- Very good fast path performance
    5- And it has a good preemption tolerance.
    6- It is faster than scalable MCS lock

    this is how i am an "inventor", and i have also invented other scalable algorithms such as a scalable reference counting with efficient support
    for weak references, and i have invented a fully scalable Threadpool,
    and i have also invented a Fully scalable FIFO queue, and i have
    also invented other scalable algorithms and there implementations, and i
    think i will sell some of them to Microsoft or to Google or Embarcadero
    or such software companies.

    And here is my other previous new invention of a scalable algorithm:

    I have just read the following PhD paper about the invention that we
    call counting networks and they are better than Software combining trees:

    Counting Networks

    http://people.csail.mit.edu/shanir/publications/AHS.pdf

    And i have read the following PhD paper:

    http://people.csail.mit.edu/shanir/publications/HLS.pdf

    So as you are noticing they are saying in the conclusion that:

    "Software combining trees and counting networks which are the only
    techniques we observed to be truly scalable"

    But i just found that this counting networks algorithm is not generally scalable, and i have the logical proof here, this is why i have just
    come with a new invention that enhance the counting networks algorithm
    to be generally scalable. And i think i will sell my new algorithm
    of a generally scalable counting networks to Microsoft or Google or
    Embarcadero or such software companies.

    So you have to be careful with the actual counting networks algorithm
    that is not generally scalable.

    My other new invention is my scalable reference counting and here it is:

    https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references

    And here is my just new invention of a scalable algorithm:

    My Scalable RWLock that works across processes and threads was updated
    to version 4.62

    Now i think it is working correctly in both Windows and Linux..

    You can download it from my website here:

    https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads

    More about me as an inventor of many scalable algorithms..

    I am a white arab and i think i am like a genius, because i have
    invented many scalable algorithms and there implementations, and look
    for example at my just new invention of a scalable algorithm here:

    https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads

    As you have noticed, you have to be like a genius to be able to invent
    my above scalable algorithm of a scalable RWLock, because it has the
    following characteristics:

    1- It is Scalable
    2- It is Starvation-free
    3- It is fair
    4- It can be used across processes and threads
    5- It can be used as a scalable Lock across processes and threads
    by using my scalable AMLock that is FIFO fair on the writers side,
    or it can be
    used as a scalable RWLock.

    I am using my scalable Lock that is FIFO fair that is called scalable
    AMLock on the writers side.

    Here is why scalable Locks are really important:

    https://queue.acm.org/detail.cfm?id=2698990

    So all in all it is a really good invention of mine.

    Read my previous thoughts:

    Here is how to use my new invention that is my scalable RWLock
    across processes:

    Just create an scalable rwlock object by giving a name in one process by calling the constructor like this:

    scalable_rwlock.create('amine');


    And you can use the scalable rwlock object from another process by
    calling the constructor by using the name like this:

    scalable_rwlock.create('amine');


    So as you are noticing i have abstracted it efficiently..


    Read the rest of my previous thoughts:

    My new invention of a Scalable RWLock that works across processes and
    threads is here, and now it works on both Windows and Linux..

    Please download my source code and take a look at how i am making it
    work across processes by using FNV1a hash on both process ID and thread
    ID, FNV1a has a good dispersion, and FNV1a hash permits also my RWLock
    to be scalable.


    You can download it from my website here:

    https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads

    Description:

    This is my invention of a fast, and scalable and starvation-free and
    fair and lightweight Multiple-Readers-Exclusive-Writer Lock called
    LW_RWLockX, it works across processes and threads.

    The parameters of the constructor are: first parameter is the name of
    the scalable RWLock to be used across processes, if the name is empty,
    it will only be used across threads. The second parameter is the size of
    the array of the readers, so if the size of the array is equal to the
    number of parallel readers, so it will be scalable, but if the number of readers are greater than the size of the array , you will start to have contention. The third parameter is the size of the array of my scalable
    Lock that is called AMLock, the number of threads can go beyond the size
    of the array of the scalable AMLock, please look at the source code of
    my scalable algorithms to understand.

    I have also used my following implementation of FNV1a hash function to
    make my new variants of RWLocks scalable (since FNV1a is a hash
    algorithm that has good dispersion):

    function FNV1aHash(key:int64): UInt64;

    var
    i: Integer;
    key1:uint64;

    const

    FNV_offset_basis: UInt64 = 14695981039346656037;
    FNV_prime: UInt64 = 1099511628211;

    begin

    //FNV-1a hash

    Result := FNV_offset_basis;

    for i := 1 to 8 do
    begin
    key1:=(key shr ((i-1)*8)) and $00000000000000ff;
    Result := (Result xor key1) * FNV_prime;
    end;

    end;

    - Platform: Windows, Unix and Linux on x86

    Required FPC switches: -O3 -Sd

    -Sd for delphi mode....

    Required Delphi switches: -$H+ -DDelphi

    For Delphi XE-XE7 and Delphi tokyo use the -DXE switch

    You can configure it as follows from inside defines.inc file:

    {$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems
    {$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems

    --'

    I am a white arab, and why have i invented scalable RWLocks and scalable
    Locks ?

    Because there is a disadvantage with Transactional memory and
    here it is:

    About Hardware Transactional Memory:

    "As someone who has used TSX to optimize synchronization primitives, you
    can expect to see a ~15-20% performance increase, if (big if) your
    program is heavy on disjoint data access, i.e. a lock is needed for correctness, but conflicts are rare in practice. If you have a lot of
    threads frequently writing the same cache lines, you are probably going
    to see worse performance with TSX as opposed to traditional locking. It
    helps to think about TSX as transparently performing optimistic
    concurrency control, which is actually pretty much how it is implemented
    under the hood."

    Read more here:

    https://news.ycombinator.com/item?id=8169697


    So as you are noticing, HTM (hardware transactional memory) and TM can
    not replace locks when doing IO and when we have a highly contended
    critical section.


    Read the rest:


    I have just read the following article that appeared in C/C++ Users
    Journal, 23(3), March 2005

    The Trouble With Locks

    http://gotw.ca/publications/mill36.htm


    And here is my thoughts about how to avoid deadlocks and race conditions
    in lock-based systems:

    https://community.idera.com/developer-tools/general-development/f/getit-and-third-party/71464/about-turing-completeness-and-parallel-programming

    Also i don't agree with him about composability of lock-based systems,
    read the following to understand:

    "About composability of lock-based systems now:

    Design your systems to be composable. Among the more galling claims of
    the detractors of lock-based systems is the notion that they are somehow uncomposable:

    “Locks and condition variables do not support modular programming,”
    reads one typically brazen claim, “building large programs by gluing
    together smaller programs[:] locks make this impossible.”9 The claim, of course, is incorrect. For evidence one need only point at the
    composition of lock-based systems such as databases and operating
    systems into larger systems that remain entirely unaware of lower-level locking.

    There are two ways to make lock-based systems completely composable, and
    each has its own place. First (and most obviously), one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user level with in-kernel locks held;
    the locks used to implement the system itself are entirely behind the
    system call interface that constitutes the interface to the system. More generally, this model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the
    caller with locks held, the subsystem will remain composable.

    Second (and perhaps counterintuitively), one can achieve concurrency and composability by having no locks whatsoever. In this case, there must be
    no global subsystem state—subsystem state must be captured in
    per-instance state, and it must be up to consumers of the subsystem to
    assure that they do not access their instance in parallel. By leaving
    locking up to the client of the subsystem, the subsystem itself can be
    used concurrently by different subsystems and in different contexts. A
    concrete example of this is the AVL tree implementation used extensively
    in the Solaris kernel. As with any balanced binary tree, the
    implementation is sufficiently complex to merit componentization, but by
    not having any global state, the implementation may be used concurrently
    by disjoint subsystems—the only constraint is that manipulation of a
    single AVL tree instance must be serialized."

    Read more here:

    https://queue.acm.org/detail.cfm?id=1454462



    Thank you,
    Amine Moulat Ramdane.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)