• Re: ROCm installation

    From Maxime Chambonnet@21:1/5 to M. Zhou on Wed Jan 12 21:10:01 2022
    XPost: linux.debian.devel.mentors

    On 1/12/22 20:17, M. Zhou wrote:
    Hi,

    Thanks for the updates.

    On Wed, 2022-01-12 at 18:14 +0100, Maxime Chambonnet wrote:
    "Native" Debian packages are starting to cover a significant portion of
    the
    stack [2], and it would be great to figure out the installation topic

    The word "native" is ambiguous to a portion of developers as it may
    also refer a native (debian/source/format) package.
    For other readers: it's "offician debian package" in contrast to
    "third-party debian packages by upstream.


    on how to install ROCm today.

    After skimming through the mail I realize what you actually meant
    is the "ROCm file installation layout" right?
    Yes, totally, I forgot a bit to extract myself from my point of view!
    The installation options and paths generally looked for by CMake Lists/configs
    are currently:
    - various cmake project-specific flags for the install paths of the components
       HIP_CLANG_PATH, HIP_DEVICE_LIB_PATH, HIP_PATH, ROCM_PATH, ... see
    [5]


    Headers and libraries should installed under the standard path,
    so that the compiler and linker should be able to find them without additional flags. Just install all stuff to /usr should be enough.
    Currently for example rocm-hipamd installs to /usr/hip, and
    lintian yells a lot. All to /usr is quite not clear enough.
    - /opt/rocm as a default backup

    There is no way for `/opt` as official debian package. If any component breaks without any specific file under /opt, then it is a bug to fix.
    Right!
    I see at least three choices, and sub-decisions to be made:
    - Multi-arch or not
       nvidia toolkit supports aarch64 and a few others.
       Cross-compiling ROCm from Debian could be interesting in a near- future.

    The rocm libraries and binary executables are architecture dependent.
    Most of them should have Architecture: any in d/control.

    Cross-compiling ROCm is not something worth being looked at IMHO.
    ROCm targets on high performance computing. A hardware architecture
    really capable of "high performance computing" can't be too weak
    to compile ROCm itself.
    That said, making the installation layout Multi-Arch aware is a
    good practice. Most of the packages may have Multi-Arch: same
    as long as they contain architecture-dependent files.

    - Nested or not
       Other stacks and relatively important projects, such as postgresql
    or
    llvm go
       nested (there is a central /usr/lib/{llvm-13, postgresql} directory,
       often with a sub ./bin, ...)

    I did not understand this question. Do you mean something like /usr/lib/rocm-{4.5.2,5.0.0},
    or
    /usr/lib/rocm-4.5.2/llvm ?
    Rather the first, not sure I see a difference, in all cases, it looks
    nested under "rocm-something" to me. And we further down agree
    that nested is probably not the way.
    - Where to install machine-readable GPU code
       There is at least 3 types of device-side (aka GPU) binary files -
         .bc for bitcode,
         .hsaco for HSA code object and
         .co for code object.

    How are these files read by ROCm? Is there anything like
    "PYTHONPATH" for the gpu code files? We should choose a
    supported path compatible to debian policy.
    There is a cmake flag / environment variable for now,
    HIP_DEVICE_LIB_PATH :<
    The current preferred layout is /usr/amdgcn/*.bc
    BTW, are these files architecture-independent? Namely,
    can arm64 and amd64 produce the exactly the same (e.g.
    md5sum-identical) output?
    I don't know, we discussed it last jitsi meeting and
    I believe that no one tried yet :)
       Bitcode files are the machine readable form of the LLVM intermediate
       representation. HSA (Heterogeneous System Architecture) and other
    code object
       files are AMD containers for GPU machine code. PostgreSQL does use
    llvm
       bitcode files: since the install path is nested, they are in
       /usr/lib/postgresql/14/lib/bitcode.
       Since it is arch-independent in the sense of the CPU architecture, I have
       been proposed that such code should reside in /usr/share.

    Nested layout for llvm and postgresql intends to allow multiple
    versions of the software co-exist on the same system. For example, llvm-{11,12,13} may be installed simultaneously on Debian.

    We debian rocm team does not have so many contributors to support
    multiple versions. Just do it the simplest way as we can.

    The official repacked nvidia-cuda-toolkit is not relevant
    to such nested layout.
    Agreed

    What I tried to keep in mind is that:
    - shared libraries should be easily discoverable in paths looked by
       /etc/ld.so.conf
    - there are only so much paths that cmake find_package in config mode
       looks for [8].

    Shared objects from Multi-arch aware library packages should be
    put at /usr/lib/<multiarch-triplet>/ as long as they are indended
    for public usage.

    Don't be misled by complicated setups such as llvm, postgresql or
    the upstream non-standard installation path. In the standard setup
    everything is likely becoming simpler. When you started to think
    about ld.so.conf for a regular official debian shlib package, I
    doubt something had been going wrong.

    Gentoo has basically finished their ROCm packaging. Feel free to
    borrow them as their license permits.
    Will look further at it!
    I attached as an image a direct comparison between some arbitrary combinations
    of these decisions. The directories are bundled in the attached archive
    too.
    - install_layout_proposal_v1 goes
       multi-arch, flattened, and with GPU code in /usr/share
    - install_layout_proposal_v2 goes
       "ante-multi-arch", nested, and with GPU code in /usr/lib

    1. header.

    installation path of architecture-dependent headers should contain
    multi-arch triplet (e.g. x86_64-linux-gnu). In this case,
    Architecture: any, Multi-Arch: same

    if the headers are identical across all architectures, the multi-arch
    triplet should be stripped.
    Architecture: all. Multi-Arch: no (default)
    I am not sure, maybe Cordell could help.
    2. shared objects.

    No need to nest as /usr/lib/rocm/lib. Just install every shared objects
    to /usr/lib/<multi-arch-triplet>/ . Private shared objects (such as
    plugins) may go to /usr/lib/<multi-arch-triplet/rocm/ .

    Nested installation layout is really pointless unless you are
    determined to support the co-existence of multiple ROCm versions
    on Debian.

    My vote on "maintaining co-existence of multiple versions of ROCm"
    is disagree.

    Understood and agreed!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From M. Zhou@21:1/5 to All on Wed Jan 12 20:20:02 2022
    XPost: linux.debian.devel.mentors

    Hi,

    Thanks for the updates.

    On Wed, 2022-01-12 at 18:14 +0100, Maxime Chambonnet wrote:
    "Native" Debian packages are starting to cover a significant portion of
    the
    stack [2], and it would be great to figure out the installation topic

    The word "native" is ambiguous to a portion of developers as it may
    also refer a native (debian/source/format) package.
    For other readers: it's "offician debian package" in contrast to
    "third-party debian packages by upstream.


    on how to install ROCm today.

    After skimming through the mail I realize what you actually meant
    is the "ROCm file installation layout" right?

    The installation options and paths generally looked for by CMake
    Lists/configs
    are currently:
    - various cmake project-specific flags for the install paths of the
    components
       HIP_CLANG_PATH, HIP_DEVICE_LIB_PATH, HIP_PATH, ROCM_PATH, ... see
    [5]


    Headers and libraries should installed under the standard path,
    so that the compiler and linker should be able to find them without
    additional flags. Just install all stuff to /usr should be enough.

    - /opt/rocm as a default backup

    There is no way for `/opt` as official debian package. If any component
    breaks without any specific file under /opt, then it is a bug to fix.


    I see at least three choices, and sub-decisions to be made:
    - Multi-arch or not
       nvidia toolkit supports aarch64 and a few others.
       Cross-compiling ROCm from Debian could be interesting in a near-
    future.

    The rocm libraries and binary executables are architecture dependent.
    Most of them should have Architecture: any in d/control.

    Cross-compiling ROCm is not something worth being looked at IMHO.
    ROCm targets on high performance computing. A hardware architecture
    really capable of "high performance computing" can't be too weak
    to compile ROCm itself.

    That said, making the installation layout Multi-Arch aware is a
    good practice. Most of the packages may have Multi-Arch: same
    as long as they contain architecture-dependent files.

    - Nested or not
       Other stacks and relatively important projects, such as postgresql
    or
    llvm go
       nested (there is a central /usr/lib/{llvm-13, postgresql} directory,
       often with a sub ./bin, ...)

    I did not understand this question. Do you mean something like /usr/lib/rocm-{4.5.2,5.0.0},
    or
    /usr/lib/rocm-4.5.2/llvm ?

    - Where to install machine-readable GPU code
       There is at least 3 types of device-side (aka GPU) binary files -
         .bc for bitcode,
         .hsaco for HSA code object and
         .co for code object.

    How are these files read by ROCm? Is there anything like
    "PYTHONPATH" for the gpu code files? We should choose a
    supported path compatible to debian policy.

    BTW, are these files architecture-independent? Namely,
    can arm64 and amd64 produce the exactly the same (e.g.
    md5sum-identical) output?

       Bitcode files are the machine readable form of the LLVM intermediate
       representation. HSA (Heterogeneous System Architecture) and other
    code object
       files are AMD containers for GPU machine code. PostgreSQL does use
    llvm
       bitcode files: since the install path is nested, they are in
       /usr/lib/postgresql/14/lib/bitcode.
       Since it is arch-independent in the sense of the CPU architecture, I
    have
       been proposed that such code should reside in /usr/share.

    Nested layout for llvm and postgresql intends to allow multiple
    versions of the software co-exist on the same system. For example, llvm-{11,12,13} may be installed simultaneously on Debian.

    We debian rocm team does not have so many contributors to support
    multiple versions. Just do it the simplest way as we can.

    The official repacked nvidia-cuda-toolkit is not relevant
    to such nested layout.

    What I tried to keep in mind is that:
    - shared libraries should be easily discoverable in paths looked by
       /etc/ld.so.conf
    - there are only so much paths that cmake find_package in config mode
       looks for [8].

    Shared objects from Multi-arch aware library packages should be
    put at /usr/lib/<multiarch-triplet>/ as long as they are indended
    for public usage.

    Don't be misled by complicated setups such as llvm, postgresql or
    the upstream non-standard installation path. In the standard setup
    everything is likely becoming simpler. When you started to think
    about ld.so.conf for a regular official debian shlib package, I
    doubt something had been going wrong.

    Gentoo has basically finished their ROCm packaging. Feel free to
    borrow them as their license permits.

    I attached as an image a direct comparison between some arbitrary
    combinations
    of these decisions. The directories are bundled in the attached archive
    too.
    - install_layout_proposal_v1 goes
       multi-arch, flattened, and with GPU code in /usr/share
    - install_layout_proposal_v2 goes
       "ante-multi-arch", nested, and with GPU code in /usr/lib

    1. header.

    installation path of architecture-dependent headers should contain
    multi-arch triplet (e.g. x86_64-linux-gnu). In this case,
    Architecture: any, Multi-Arch: same

    if the headers are identical across all architectures, the multi-arch
    triplet should be stripped.
    Architecture: all. Multi-Arch: no (default)

    2. shared objects.

    No need to nest as /usr/lib/rocm/lib. Just install every shared objects
    to /usr/lib/<multi-arch-triplet>/ . Private shared objects (such as
    plugins) may go to /usr/lib/<multi-arch-triplet/rocm/ .

    Nested installation layout is really pointless unless you are
    determined to support the co-existence of multiple ROCm versions
    on Debian.

    My vote on "maintaining co-existence of multiple versions of ROCm"
    is disagree.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From M. Zhou@21:1/5 to Maxime Chambonnet on Wed Jan 12 22:20:01 2022
    XPost: linux.debian.devel.mentors

    On Wed, 2022-01-12 at 21:06 +0100, Maxime Chambonnet wrote:

    Headers and libraries should installed under the standard path,
    so that the compiler and linker should be able to find them without additional flags. Just install all stuff to /usr should be enough.
    Currently for example rocm-hipamd installs to /usr/hip, and
    lintian yells a lot. All to /usr is quite not clear enough.

    Then it sounds like that the upstream CMake installation targets
    are primarily written for somewhere like /opt instead of /usr.
    I looked into one Gentoo ebuild for ROCm and the problem is
    rather distinct.

    https://github.com/gentoo/gentoo/blob/2ed748a3b6412f99bc249e089e9221e38417a8f8/dev-util/hip/hip-4.1.0.ebuild

    If shlibs are installed to somewhere like /usr/lib/rocm/lib/,
    we are still able to tamper with ld.so.conf.
    If binary executables are installed to /usr/lib/rocm/bin/,
    then we are screwing up with the default shell PATH.
    This is a deadend because we are not going to patch all
    POSIX and non-POSIX shell configs. Neither do we introduce weird
    scripts for the user to source.

    Standarlizing the upstream install target is inevitable
    to some extent.
    A flag can be introduced for the upstream cmake file along
    with some code, which by default install things to /usr/local
    like most other existing software.



    I did not understand this question. Do you mean something like /usr/lib/rocm-{4.5.2,5.0.0},
    or
    /usr/lib/rocm-4.5.2/llvm ?
    Rather the first, not sure I see a difference, in all cases, it looks
    nested under "rocm-something" to me. And we further down agree
    that nested is probably not the way.

    Yes. We should just stay away from nesting things.

    How are these files read by ROCm? Is there anything like
    "PYTHONPATH" for the gpu code files? We should choose a
    supported path compatible to debian policy.
    There is a cmake flag / environment variable for now,
    HIP_DEVICE_LIB_PATH :<
    The current preferred layout is /usr/amdgcn/*.bc

    Anything like
    /usr/share/amdgcn/ (in case they are arch-indep)
    or
    [/usr/lib/amdgcn, /var/lib/amdgcm, /var/cache/amdgcn]
    (in ase they are arch-dep) could be better.

    BTW, are these files architecture-independent? Namely,
    can arm64 and amd64 produce the exactly the same (e.g.
    md5sum-identical) output?
    I don't know, we discussed it last jitsi meeting and
    I believe that no one tried yet :)

    Then we regard them as architecture-dependent for initial
    debian packaging.

    I looked around in the Gentoo ebuild repository, https://github.com/gentoo/gentoo/search?q=hip&type=commits https://github.com/gentoo/gentoo/search?q=rocm&type=commits
    from which we can borrow a lot. Namely, starting from
    scratch by ourselves is not necessary.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)