• :keywords metadata item?

    From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Wed Aug 10 10:29:29 2022
    XPost: news.software.readers

    Hi all,

    INN (and perhaps other servers) has the possibility to provide keywords
    in overview data. It advertises "Keywords:full" in response to LIST OVERVIEW.FMT and then adds "Keywords: a,b,c,d" in OVER responses.
    No Keywords header field is added in the articles, and the contents of
    an existing one is kept at the beginning of the generated one in overview.

    I'm wondering whether:
    - it shouldn't be advertised as ":keywords" instead of "Keywords:full"
    as the header field is not in the original article.

    I am unsure though if such a change would break implementations that
    look for it in overview (but is there any such news client? ...)


    - and naturally before that, the question of whether the feature should
    remain in INN has to be raised.
    Currently, it only takes all the words, removes punctuation, removes a
    list of "known" words (like pronouns...), strips non ASCII characters
    (sic) and lists the sorted result by number of appearances.
    This should obviously be improved to be smarter (but it there any need
    for that?)
    So I would suggest code for generating such basic keywords be removed
    unless there's a real current use case behind. (Which does not prevent
    from a possible reintegration in the future with a smarter algorithm.)


    Here are examples of what is currently generated:

    [from latest discussion "Re: naming concept of newsgroups"]
    Keywords: newsgroup,news,net,eagle,eyrie,used,org,hierarchies,questions,wondering,messages,general,periods,always,https,names,taken,think,back,don,dot,www,configuration,distinction,distributed,punctuation,introduced,presumably,processing,lowercase,sethhurst,
    although,choosing,directly,explains,original,predates,renaming,software,truscott,allbery,analogy,control,current,however,mailing,prevent,colons,daniel,dashes,domain,insead,levels,naming,picked,please,prefix,rather,scheme,trivia,aware,based

    [from news.lists.filters!]
    Keywords: message,ncm,begin,body,spam,notice,pgp,googlegroups,spamassassin,signature,pasdenom,signed,usenet,info,com,end,est,lkabxolsadsxuurahpalo,trfvzkfamlybfeacgkqie,fxquozybxsows,hfcozufhlkorn,pothgfqhddwoc,tcbhokunxbviy,probablement,xkcmgcjvkghy,wzykxofxigf,
    xjjqrzmwsth,iezppdjfhb,referenced,tememrtmfh,utzjdlgunu,vznwaqwahg,akefkllaj,cyimrhktz,following,plmxkyvqo,satellite,xijwqvhwm,zzpginwcr,detected,ethernet,followup,koxyluau,pikaxokm,probably,english,gzkzeqt,headers,reseaux,tdczttb,version

    [from a spam...]
    Keywords: drug,www,channel,running,https,com,ucdtdenqhwst,xfzsllvprc,bitchute,exorcist,military,brendon,connell,talpiot,youtube,zealand,mafias,anzus,below,dsfug,endtx,world,best,html,http,runs,bet,new,ops,org

    [from an article written in French]
    Keywords: crit,nous,pas,recommencer,comptons,magicien,chaines,chanson,declara,oubliez,cessit,gestes,ubuntu,actes,barri,faire,acte,gump,joli,mage,mais,marc,pour,vous,cet,des,les,lou,non,res,sur,une


    Obviously, in messages written in another language than English, the
    generation is totally wrong and unusable. And even for English, I am
    unsure the generated keywords are really usable (too many of them, and
    not enough specific).

    --
    Julien ÉLIE

    « J'ai un copain, il est pilote d'essai… Enfin, il ne l'est pas encore ;
    pour l'instant, il essaie d'être pilote ! » (Raymond Devos)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Franck@21:1/5 to All on Sat Aug 20 11:23:57 2022
    XPost: news.software.readers

    Hello Julien,


    INN (and perhaps other servers) has the possibility to provide keywords

    Not mine.

    I'm wondering whether:
    - it shouldn't be advertised as ":keywords" instead of "Keywords:full"
    as the header field is not in the original article.

    Article "metadata" is data about articles that does not occur within the article itself, so I will go for ":keywords".

    BUT only two metadata items are defined in RFC 3977, ":lines" and
    ":bytes". RFC 3977 say "To avoid the risk of a clash with a future
    registered extension, the names of METADATA items defined by private
    extensions SHOULD begin with ":x-".

    So, perhalps it's better to name it ":x-keywords"?

    For LIST OVERVIEW.FMT, I think you have choice because RFC 3977 say that metadata items ":bytes" and ":lines" MAY be instead "Lines" and "Bytes",
    even though they refer to metadata, not headers.

    So, I'll go for ":x-keywords" MAY be instead "Keywords", even though it
    refer to metadata, not header :-)


    For OVER, I think the value of this metadata item SOULD consist of the
    metadata name, a single space, and then the value ; as explained in RFC
    3977 : "For all subsequent fields that contain headers, the content MUST
    be the entire header line other than the trailing CRLF. For all
    subsequent fields that contain metadata, the field consists of the
    metadata name, a single space, and then the value.)


    But, of course, I may be wrong!

    Franck

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Thu Sep 1 20:40:54 2022
    XPost: news.software.readers

    Hi Franck,

    I'm wondering whether:
    - it shouldn't be advertised as ":keywords" instead of "Keywords:full"
    as the header field is not in the original article.

    Article "metadata" is data about articles that does not occur within the article itself, so I will go for ":keywords".

    Agreed as, like you say, it is not in the article itself.


    BUT only two metadata items are defined in RFC 3977, ":lines" and
    ":bytes". RFC 3977 say "To avoid the risk of a clash with a future
    registered extension, the names of METADATA items defined by private extensions SHOULD begin with ":x-".

    So, perhalps it's better to name it ":x-keywords"?

    Since RFC 3977, there has been RFC 6648 which deprecates the use of "X-"
    prefix and similar constructs in application protocols. That's why I
    did not propose that name but directly ":keywords".


    So, I'll go for ":x-keywords" MAY be instead "Keywords"

    I would be in favour not to provide 2 different ways to advertise that, especially when it had never been standardized before in overview data :-)


    For OVER, I think the value of this metadata item SOULD consist of the metadata name, a single space, and then the value ; as explained in RFC
    3977 : "For all subsequent fields that contain headers, the content MUST
    be the entire header line other than the trailing CRLF.  For all
    subsequent fields that contain metadata, the field consists of the
    metadata name, a single space, and then the value.)

    Oh you're right, I missed that and thought the rule only applied to
    header names, not metadata.


    I am half-tempted to advertise ":keywords" instead of Keywords in the
    next release so as to comply with the protocol (the keywords are not
    present in the article itself), and properly handle "HDR :keywords" vs
    "HDR Keywords" results, the same way "HDR Lines" return the real header
    field if present.

    --
    Julien ÉLIE

    « Il ne faut jamais gifler un sourd : il perd la moitié du plaisir. Il
    sent la gifle mais il ne l'entend pas. » (Georges Courteline)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Franck@21:1/5 to All on Sat Sep 3 21:21:05 2022
    XPost: news.software.readers

    Hello Julien,

    So, perhalps it's better to name it ":x-keywords"?

    Since RFC 3977, there has been RFC 6648 which deprecates the use of "X-" prefix and similar constructs in application protocols.  That's why I
    did not propose that name but directly ":keywords".

    You're right!

    I am half-tempted to advertise ":keywords" instead of Keywords in the
    next release so as to comply with the protocol (the keywords are not
    present in the article itself), and properly handle "HDR :keywords" vs
    "HDR Keywords" results, the same way "HDR Lines" return the real header
    field if present.

    I think it's the right choice even if I don't see how this header can be
    useful in any way (because the words are totally unusable).

    Perhaps it would be better to encode the words rather than remove the
    non-ASCII characters?

    Franck

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Sat Sep 10 12:19:28 2022
    XPost: news.software.readers

    Bonjour Franck,

    I am half-tempted to advertise ":keywords" instead of Keywords in the
    next release so as to comply with the protocol (the keywords are not
    present in the article itself), and properly handle "HDR :keywords" vs
    "HDR Keywords" results, the same way "HDR Lines" return the real
    header field if present.

    I think it's the right choice even if I don't see how this header can be useful in any way (because the words are totally unusable).

    There are either too many words or truncated words (in non-full-ACSII languages), indeed.


    Perhaps it would be better to encode the words rather than remove the non-ASCII characters?

    Having MIME-encoded words in this overview field could indeed be a
    solution, or a UTF-8 encoding. However, it would imply extra complexity
    in the server code to handle that encoding: find out the encoding of the
    word (using the right Content-Type in headers or multipart messages...),
    and convert it for the overview field.
    It is a bit of work, and besides I am unsure clients are currently using Keywords when present; otherwise I guess the problem of
    internationalized messages would already have popped up!


    As a side note, only having ASCII chars as is currently done in the
    keywords generation is compatible with a possible use of future
    MIME-encoded words or direct UTF-8, if we ever do that in a standardized :keywords metadata item.
    So, in order to comply with the NNTP protocol, :keywords would already
    be a better choice (instead of Keywords), and I could just leave a note
    in the INN documentation of keywords generation that it is still
    experimental code, essentially usable for messages using only ASCII
    characters as other characters are stripped by the algorithm.

    --
    Julien ÉLIE

    « J'aime les calculs faux car ils donnent des résultats plus justes. »
    (Jean Arp)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to iulius@nom-de-mon-site.com.invalid on Mon Sep 19 15:32:56 2022
    XPost: news.software.readers

    Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> writes:

    INN (and perhaps other servers) has the possibility to provide keywords
    in overview data. It advertises "Keywords:full" in response to LIST OVERVIEW.FMT and then adds "Keywords: a,b,c,d" in OVER responses. No Keywords header field is added in the articles, and the contents of an existing one is kept at the beginning of the generated one in overview.

    I'm wondering whether:
    - it shouldn't be advertised as ":keywords" instead of "Keywords:full" as
    the header field is not in the original article.

    I believe that's correct. Keywords:full would imply that it's a copy of a header in the article named Keywords.

    Astonishingly, we don't seem to have set up an IANA registry for metadata
    names in LIST OVERVIEW.FMT, which would have been the normal way of doing
    it, so I think we can just use :keywords without telling anybody.

    I am unsure though if such a change would break implementations that look
    for it in overview (but is there any such news client? ...)

    My guess is that no one uses this. It's been in INN for eons, but I think
    it was added in the early days of more open development by one person who
    was enthused about it. It tends to go untouched for long periods of time
    until someone else finds it, thinks it might solve some problems for them,
    and sends in a few fixes. My subjective impression is that most of the
    people who try it end up not continuing to use it. I've periodically
    unbroken it or done some refactoring at various points, but just because
    the code was there, not because anyone was asking for it.

    It's kind of an interesting idea, but text tokenization is a lot more complicated than that code, as you're discovering with its total lack of understanding of anything other than English. If the body is
    base64-encoded (or even quoated-printable), I suspect it will similarly collapse like a house of cards, since I doubt it understands MIME
    structure. And let's not even mention trying to tokenize languages that
    are farther afield from English.

    I'm honestly not sure it's worth the effort of trying to fix, although of course now that we've talked about it someone will probably wonder if it
    will solve their problems and experiment with it again. :)

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Sun Sep 25 09:34:22 2022
    XPost: news.software.readers

    Hi Russ,

    I'm wondering whether:
    - it shouldn't be advertised as ":keywords" instead of "Keywords:full" as
    the header field is not in the original article.

    I believe that's correct. Keywords:full would imply that it's a copy of a header in the article named Keywords.

    Astonishingly, we don't seem to have set up an IANA registry for metadata names in LIST OVERVIEW.FMT, which would have been the normal way of doing
    it, so I think we can just use :keywords without telling anybody.

    There's indeed no IANA registry for overview metadata items. It should
    be created next time a new metadata item is standardized (if that time
    ever comes for a new useful need).


    It's kind of an interesting idea, but text tokenization is a lot more complicated than that code, as you're discovering with its total lack of understanding of anything other than English. If the body is
    base64-encoded (or even quoated-printable), I suspect it will similarly collapse like a house of cards, since I doubt it understands MIME
    structure. And let's not even mention trying to tokenize languages that
    are farther afield from English.

    I'm honestly not sure it's worth the effort of trying to fix, although of course now that we've talked about it someone will probably wonder if it
    will solve their problems and experiment with it again. :)

    Switching to ":keywords" and adapting the code to properly distinguish
    "OVER Keywords" vs "OVER :keywords" requests will demand a bit of
    effort, as well as testing. This would be used only on the very few
    news servers, if any, which generate keywords. As you say, it's not
    worth it yet, and still less fixing text tokenization.

    I'll just go to mention in the documentation of that keyword generation
    that it is experimental and works only on plain-text ASCII; and add a
    bullet in the nnrpd documentation to recall that protocol difference
    from the standard.

    --
    Julien ÉLIE

    « Je préfère glisser ma peau sous des draps pour le plaisir des sens que
    de la risquer sous les drapeaux pour le prix de l'essence. » (Raymond
    Devos)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)