• Re: Graphics cards are different

    From Stefan Monnier@21:1/5 to All on Wed Jan 24 18:38:46 2024
    MitchAlsup1 [2024-01-24 23:18:56] wrote:
    And all of this stuff only works efficiently when there is an embar- rassingly large amount of parallelism.

    Not only that: this parallelism needs to be "static".
    An important part of the scheduling needs to be done at compilation time.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Jan 24 23:18:56 2024
    Thomas Koenig wrote:

    Quadibloc <quadibloc@servername.invalid> schrieb:

    Above a certain performance level, _all_ cores are out-of-order.

    That is true for general-purpose CPUs, but not for GPUs - these
    are in-order. I think AMD and NVIDIA differ in their handling
    of register hazards - AMD handles them, NVIDIA depends on the
    compiler (well, whatever you want to call the piece of software
    that translates the intermediate PTX into whatever the graphics
    card itself understands) to do this.


    In general, GPUs handle hazards by context switching to a different
    WARP. In some weird sense--this is maximally OoO; in other sense it
    it completely in order.

    GPUs context switch between instructions, and any instruction which
    is not 1-cycle* uses the context switch mechanism. (*) instructions
    which are 1 cycle can still context switch to a different WARP every
    cycle.

    Also, all 32 (or 64) threads per WARP execute the same instruction.

    And all of this stuff only works efficiently when there is an embar-
    rassingly large amount of parallelism.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Jan 24 23:42:55 2024
    MitchAlsup1 wrote:
    Thomas Koenig wrote:

    Quadibloc <quadibloc@servername.invalid> schrieb:

    Above a certain performance level, _all_ cores are out-of-order.

    That is true for general-purpose CPUs, but not for GPUs - these
    are in-order. I think AMD and NVIDIA differ in their handling
    of register hazards - AMD handles them, NVIDIA depends on the
    compiler (well, whatever you want to call the piece of software
    that translates the intermediate PTX into whatever the graphics
    card itself understands) to do this.


    In general, GPUs handle hazards by context switching to a different
    WARP. In some weird sense--this is maximally OoO; in other sense it
    it completely in order.

    It's barrel switching.
    A rose by any other name would smell as sweet.

    GPUs context switch between instructions, and any instruction which
    is not 1-cycle* uses the context switch mechanism. (*) instructions
    which are 1 cycle can still context switch to a different WARP every
    cycle.

    Also, all 32 (or 64) threads per WARP execute the same instruction.

    Ok, it's Illiac IV with barrel switching.

    And all of this stuff only works efficiently when there is an embar- rassingly large amount of parallelism.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jan 25 16:40:02 2024
    EricP wrote:

    MitchAlsup1 wrote:
    Thomas Koenig wrote:

    Quadibloc <quadibloc@servername.invalid> schrieb:

    Above a certain performance level, _all_ cores are out-of-order.

    That is true for general-purpose CPUs, but not for GPUs - these
    are in-order. I think AMD and NVIDIA differ in their handling
    of register hazards - AMD handles them, NVIDIA depends on the
    compiler (well, whatever you want to call the piece of software
    that translates the intermediate PTX into whatever the graphics
    card itself understands) to do this.


    In general, GPUs handle hazards by context switching to a different
    WARP. In some weird sense--this is maximally OoO; in other sense it
    it completely in order.

    It's barrel switching.
    A rose by any other name would smell as sweet.

    GPUs context switch between instructions, and any instruction which
    is not 1-cycle* uses the context switch mechanism. (*) instructions
    which are 1 cycle can still context switch to a different WARP every
    cycle.

    Also, all 32 (or 64) threads per WARP execute the same instruction.

    Ok, it's Illiac IV with barrel switching.

    Or BSP with barrel switching.

    And all of this stuff only works efficiently when there is an embar-
    rassingly large amount of parallelism.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)