• [vim] Jumping from current Unicode string to next/prev appearance

    From Janis Papanagnou@21:1/5 to All on Thu Dec 28 02:52:50 2023
    In Vim I frequently jump from string to the next equal string using the commands '*' (forward search'n'jump) and '#' (backward search'n'jump).

    With Unicode characters that doesn't seem to always work (at least not
    per default).

    In the following (UTF-8 encoded) test sample there is one subset of
    Omega words where * and # works correctly and one where it doesn't
    (starting with the cursor on the first letter of any word)

    Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega

    The difference is only the encoding of the first character of that
    word ('\x03A9' versus '\x2126'). For words with Ω=\x03A9 it works but
    not for words with Ω=\x2126.

    Is there a way to fix or achieve that function for all UTF-8 encoded
    words?

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to janis_papanagnou+ng@hotmail.com on Thu Dec 28 02:36:58 2023
    In comp.editors, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    In Vim I frequently jump from string to the next equal string using the commands '*' (forward search'n'jump) and '#' (backward search'n'jump).

    With Unicode characters that doesn't seem to always work (at least not
    per default).

    In the following (UTF-8 encoded) test sample there is one subset of
    Omega words where * and # works correctly and one where it doesn't
    (starting with the cursor on the first letter of any word)

    Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega

    This is like complaining that a search for "MISS" does not also match "МІЅЅ". They are completely different strings that just happen to look alike with certain font choices. Some of those are "ohm sign", "Latin
    small letter m", "Latin small letter e", "Latin small letter g", "Latin
    small letter a" and the others are "Greek capital letter omega",
    "Latin small letter m", "Latin small letter e", "Latin small letter g",
    "Latin small letter a".

    Your "difference is only the encoding" fails to grasp that Unicode is
    semiotics aware, even if users might not be.

    Elijah
    ------
    https://www.unicode.org/reports/tr36/#visual_spoofing

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Julieta Shem@21:1/5 to Eli the Bearded on Wed Dec 27 23:45:07 2023
    Eli the Bearded <*@eli.users.panix.com> writes:

    In comp.editors, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    In Vim I frequently jump from string to the next equal string using the
    commands '*' (forward search'n'jump) and '#' (backward search'n'jump).

    With Unicode characters that doesn't seem to always work (at least not
    per default).

    In the following (UTF-8 encoded) test sample there is one subset of
    Omega words where * and # works correctly and one where it doesn't
    (starting with the cursor on the first letter of any word)

    Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega

    This is like complaining that a search for "MISS" does not also match "МІЅЅ". They are completely different strings that just happen to look alike with certain font choices.

    It looks very much alike with Google's ``Fira Code''.

    Some of those are "ohm sign", "Latin small letter m", "Latin small
    letter e", "Latin small letter g", "Latin small letter a" and the
    others are "Greek capital letter omega", "Latin small letter m",
    "Latin small letter e", "Latin small letter g", "Latin small letter
    a".

    Your "difference is only the encoding" fails to grasp that Unicode is semiotics aware, even if users might not be.

    There's a package for the GNU EMACS that implements the search as the OP desires. You can invoke it with saying

    C-u 42 S E M I O T I C A W A R E RET C-c A I RET A W Y E A H RET

    to the minibuffer. (Then press * and # as you wish.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Janis Papanagnou on Thu Dec 28 04:55:59 2023
    On 28.12.2023 04:40, Janis Papanagnou wrote:

    Try to copy/paste the line into a Vim session, then move the cursor
    onto the first character of the first word, then type * repeatedly.
    Then do the same starting with the first character of the third word,
    and observe the difference! - Tell me what you think about that.

    Here's the effect visualized, where ^ indicates the cursor position
    after a '*' operation


    Case 1 (cursor starting at first character of the _third_ word):

    Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega
    ^ ^ ^ ^

    (All okay, the four matching words are addressed correctly.)


    Case 2 (cursor starting at first character of the _first_ word):

    Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega
    ^ ^ ^ ^ first turn
    ^ ^ ^ ^ second turn

    (Not okay: in all subsequent words the first character is skipped.)


    This is what annoys me and where I am looking for a solution (or a
    hint that this is, maybe, an unavoidable flaw).

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Eli the Bearded on Thu Dec 28 04:40:44 2023
    On 28.12.2023 03:36, Eli the Bearded wrote:
    In comp.editors, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    In Vim I frequently jump from string to the next equal string using the
    commands '*' (forward search'n'jump) and '#' (backward search'n'jump).

    With Unicode characters that doesn't seem to always work (at least not
    per default).

    In the following (UTF-8 encoded) test sample there is one subset of
    Omega words where * and # works correctly and one where it doesn't
    (starting with the cursor on the first letter of any word)

    Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega

    This is like complaining that a search for "MISS" does not also match "МІЅЅ". They are completely different strings that just happen to look alike with certain font choices.

    No, unfortunately you seem to have MISSed the point. It's not about
    same looking but different strings. It's about different behavior of
    the same Vim operations (* and #) on _two types_ of words.

    Try to copy/paste the line into a Vim session, then move the cursor
    onto the first character of the first word, then type * repeatedly.
    Then do the same starting with the first character of the third word,
    and observe the difference! - Tell me what you think about that.

    (You can adjust the test-case to use these two letters in different
    contexts, or work on single characters.)

    Janis

    Some of those are "ohm sign", "Latin
    small letter m", "Latin small letter e", "Latin small letter g", "Latin
    small letter a" and the others are "Greek capital letter omega",
    "Latin small letter m", "Latin small letter e", "Latin small letter g", "Latin small letter a".

    Your "difference is only the encoding" fails to grasp that Unicode is semiotics aware, even if users might not be.

    Elijah
    ------
    https://www.unicode.org/reports/tr36/#visual_spoofing


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Janis Papanagnou on Thu Dec 28 05:14:16 2023
    On 28.12.2023 02:52, Janis Papanagnou wrote:
    In Vim I frequently jump from string to the next equal string using the commands '*' (forward search'n'jump) and '#' (backward search'n'jump).

    With Unicode characters that doesn't seem to always work (at least not
    per default).

    In the following (UTF-8 encoded) test sample there is one subset of
    Omega words where * and # works correctly and one where it doesn't
    (starting with the cursor on the first letter of any word)

    Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega

    The difference is only the encoding of the first character of that
    word ('\x03A9' versus '\x2126'). For words with Ω=\x03A9 it works but
    not for words with Ω=\x2126.

    Is there a way to fix or achieve that function for all UTF-8 encoded
    words?

    I noticed that the effect is not depending on Unicode characters but
    behaves similar to this ASCII-only test-case

    'help' 'help' 'help'

    If the cursor starts at the first quote we see the same effect

    'help' 'help' 'help'
    ^ ^ ^ first turn
    ^ ^ ^ second turn

    The quote seems to be excluded from consideration of the * command,
    and the cursor jumps to the next word part. - Can this be explained?

    So one of the Unicode characters mentioned above is not considered
    part of the word while the other one is. And only words seem to be
    considered, at least in this case.

    But on the other hand, I can navigate with * also within non-alpha
    characters like

    §%" §%" §%" §%"
    ^ ^ ^ ^

    So this also works.

    I'm not pleased by that behavior. Looks also inconsistent to me.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to janis_papanagnou+ng@hotmail.com on Thu Dec 28 08:13:21 2023
    In comp.editors, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    Case 2 (cursor starting at first character of the _first_ word):

    Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega
    ^ ^ ^ ^ first turn
    ^ ^ ^ ^ second turn


    :help *

    *star* *E348* *E349*
    * Search forward for the [count]'th occurrence of the
    word nearest to the cursor. The word used for the
    search is the first of:
    1. the keyword under the cursor |'iskeyword'|
    2. the first keyword after the cursor, in the
    current line
    ...

    :help iskeyword
    *'iskeyword'* *'isk'* 'iskeyword' 'isk' string (Vim default for MS-DOS and Win32:
    "@,48-57,_,128-167,224-235"
    otherwise: "@,48-57,_,192-255"
    Vi default: "@,48-57,_")
    local to buffer
    Keywords are used in searching and recognizing with many commands:
    "w", "*", "[i", etc. It is also used for "\k" in a |pattern|. See
    'isfname' for a description of the format of this option. For '@'
    characters above 255 check the "word" character class.
    For C programs you could use "a-z,A-Z,48-57,_,.,-,>".
    ...

    I think it is a bug that "word" is not a link to somewhere in pattern.txt

    In any case, it is clear that # and * recognize alphabetic characters
    like Greek capital *letter* omega differently from non-alphabet symbol characters like ohm *sign*. If you move along the line with "w" to jump
    between "words" you see the differences. The # and * searches use word boundaries, so word definitions are very important there.

    You are still looking at an ohm sign and thinking of a letter which is
    the trap of Unicode "look alikes", not something vim is doing wrong.

    Elijah
    ------
    has vim's * remapped to _ and nearly used that writing this

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Eli the Bearded on Thu Dec 28 16:54:19 2023
    On 28.12.2023 09:13, Eli the Bearded wrote:
    [snip]

    In any case, it is clear that # and * recognize alphabetic characters
    like Greek capital *letter* omega differently from non-alphabet symbol characters like ohm *sign*. If you move along the line with "w" to jump between "words" you see the differences. The # and * searches use word boundaries, so word definitions are very important there.

    Right.


    You are still looking at an ohm sign and thinking of a letter which is
    the trap of Unicode "look alikes", not something vim is doing wrong.

    Erm, no. (I already explained elsethread that it's not about characters
    that are looking alike; the issue turned out to not be about Unicode,
    although it got apparent there. That's why I changed the test sample to
    a plain ASCII test case.)

    Your quotes (from the Vim help) helps explaining the behavior with the
    'help' sample I posted: 'help' 'help' 'help'

    I still think the behavior of Vim's * command is counterintuitive and inconsistent. See this example (a file with two lines):

    §%" §%" *+*+ §%" §%"
    §%" a §%" a *+*+ §%" a §%" a

    Starting from the first character of the first word we see the command
    '*' jump words as depicted by the ^ symbols:

    §%" §%" *+*+ §%" §%"
    ^ ^ ^ ^ # search-jumps on first line
    §%" a §%" a *+*+ §%" a §%" a
    ^ ^ ^ ^ # continuing/changing on second line
    ^ ^ ^ ^

    It means that * is first identifying the §%" string, and it continues
    the search on the next line. But after it located the first §%" on the
    second line it ad hoc changes the search pattern. - I would call that
    undesired and inconsistent behavior.

    We can "explain" (sort of) what happens. As in, say,
    "If no alpha character is on the line * tries to match the next string
    that matches the current one, but as soon as this search reaches or is
    on a line that contains an alpha character the search pattern changes
    and * jumps to the next alpha character on that line."

    Okay, is it as it is. But shouldn't that feature be straightened? It's
    not the first time that I missed a more coherent behavior in contexts
    of non-alpha character strings, and I think that it would be generally
    useful. - Is there, on the other hand, some sensible use-case for that
    current [inconsistent] behavior (of ad hoc changing the pattern)?

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to janis_papanagnou+ng@hotmail.com on Fri Dec 29 01:53:33 2023
    In comp.editors, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    Is there, on the other hand, some sensible use-case for that
    current [inconsistent] behavior (of ad hoc changing the pattern)?

    It is a keyword search tool, not a random object search tool. The word boundaries should be the indicator.

    Elijah
    ------
    printf, eg, is different than sprintf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Eli the Bearded on Fri Dec 29 16:36:39 2023
    On 29.12.2023 02:53, Eli the Bearded wrote:
    In comp.editors, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    Is there, on the other hand, some sensible use-case for that
    current [inconsistent] behavior (of ad hoc changing the pattern)?

    It is a keyword search tool, not a random object search tool.

    Yes, obviously. And that's IMO an unnecessary restriction.
    YMMV, of course.

    And even as an artificially restricted "keyword search tool"
    it's not working consistent if applied to the two lines of
    test data that I posted.

    I suppose there's little use to discuss that since it won't
    change if not widely accepted as a useful generalization of
    the * and # command.

    In my book it was certainly often a nuisance in the restricted
    and inconsistent form and I would have appreciated if it works
    also on other (non-alphanumeric) keywords (i.e. on strings).

    The word boundaries should be the indicator.

    Janis

    PS: Historically (IIRC), in Vi, there was just the # command
    (but not the * which I saw later in Vim). A typical use was to
    jump from a C function call backwards to find its declaration.
    Application of Vi(m) broadened since then, and yet more useful
    features and changes entered the Vim command base.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to janis_papanagnou+ng@hotmail.com on Sat Dec 30 07:00:12 2023
    In comp.editors, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    PS: Historically (IIRC), in Vi, there was just the # command
    (but not the * which I saw later in Vim).

    I do not believe you. For starters, nvi has a completely different
    function bound to #, and nvi tries to be backwards compatible with vi.

    jump from a C function call backwards to find its declaration.
    Application of Vi(m) broadened since then, and yet more useful
    features and changes entered the Vim command base.

    It occurs to me that you may like the boundary free versions of * and #:
    prefix them with a g.

    :noremap * g*
    :noremap # g#

    Elijah
    ------
    uses very few of the g_ library of commands

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Eli the Bearded on Sat Dec 30 19:35:45 2023
    On 30.12.2023 08:00, Eli the Bearded wrote:
    In comp.editors, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    PS: Historically (IIRC), in Vi, there was just the # command
    (but not the * which I saw later in Vim).

    I do not believe you. For starters, nvi has a completely different
    function bound to #, and nvi tries to be backwards compatible with vi.

    I don't think that the '#' command (with the current semantic) was in
    the _original_ Vi. (If that is how you interpreted "historically"). I
    observed the command # with the current behavior when I regularly used
    Vi starting around 1990 on AIX (and HPUX). And I'm positive - since I
    recall to have been looking for that - that at these days there was no
    '*' (as counterpart that matches in the opposite direction). - But
    please correct me if I am wrong.


    jump from a C function call backwards to find its declaration.
    Application of Vi(m) broadened since then, and yet more useful
    features and changes entered the Vim command base.

    It occurs to me that you may like the boundary free versions of * and #: prefix them with a g.

    :noremap * g*
    :noremap # g#

    I didn't know of the 'g' variants, but 'g*' seems to behave equivalent
    to '*' on my two-line test sample; i.e. when reaching the second line
    it jumps from the punctuation character block to the letter a.

    §%" §%" *+*+ §%" §%"
    ^ ^ ^ ^
    §%" a §%" a *+*+ §%" a §%" a
    ^ ^ ^ ^
    ^ ^ ^ ^

    So while 'g*' doesn't address the issue it is actually even worse since
    without the \< and \> it then also matches other appearing 'a' in the
    text.


    I want to provide two more examples to explain my desire for a "better" behavior with non-alpha character blocks.[*]

    1) Matching (non-alpha) shell keywords (or other non-alpha constructs
    that are so typical in shells)

    f() {
    : ${1:?}
    }
    : ${1:?}
    echo "a: b"

    Positioning at the first colon I want to find other standalone ones.

    2) Matching ASN.1 identifiers (or other not pure-alpha identifiers)

    direct-reference OBJECT IDENTIFIER OPTIONAL,
    indirect-reference INTEGER OPTIONAL,

    Positioning it in one of the "reference" substrings I want to find
    the whole identifier (e.g. "direct-reference"), but not any string
    with the substring reference.

    In other words, a keyword and an identifier (beyond C and alike) has a
    broader definition generally, and a quick-match for non-alpha strings
    would be very convenient as I regularly observe in various editing
    contexts.

    I am aware that we cannot cover all matching combinations - e.g. how
    should "an-id: 'a value'" be parsed; it might get non-trivial - but
    a quick-search for space-separated entities would already be very
    convenient as I've often experienced in my editing contexts.

    Vim already supports a lot of such settings (breakat, isfname, isident, iskeyword, and yet more even language specifics), so maybe there's a
    not too complex way to achieve that.

    Janis

    [*] Note: Of course all searching can be done with regular search/regexp
    but as I use * for quick match convenience I'd like to have it not only
    for alpha sequences.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)