Forum: >>> Magnum BBS <<<

:keywords metadata item?

From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Wed Aug 10 10:29:29 2022

XPost: news.software.readers

Hi all,

INN (and perhaps other servers) has the possibility to provide keywords
in overview data. It advertises "Keywords:full" in response to LIST OVERVIEW.FMT and then adds "Keywords: a,b,c,d" in OVER responses.
No Keywords header field is added in the articles, and the contents of
an existing one is kept at the beginning of the generated one in overview.

I'm wondering whether:
- it shouldn't be advertised as ":keywords" instead of "Keywords:full"
as the header field is not in the original article.

I am unsure though if such a change would break implementations that
look for it in overview (but is there any such news client? ...)

- and naturally before that, the question of whether the feature should
remain in INN has to be raised.
Currently, it only takes all the words, removes punctuation, removes a
list of "known" words (like pronouns...), strips non ASCII characters
(sic) and lists the sorted result by number of appearances.
This should obviously be improved to be smarter (but it there any need
for that?)
So I would suggest code for generating such basic keywords be removed
unless there's a real current use case behind. (Which does not prevent
from a possible reintegration in the future with a smarter algorithm.)

Here are examples of what is currently generated:

[from latest discussion "Re: naming concept of newsgroups"]
Keywords: newsgroup,news,net,eagle,eyrie,used,org,hierarchies,questions,wondering,messages,general,periods,always,https,names,taken,think,back,don,dot,www,configuration,distinction,distributed,punctuation,introduced,presumably,processing,lowercase,sethhurst,
although,choosing,directly,explains,original,predates,renaming,software,truscott,allbery,analogy,control,current,however,mailing,prevent,colons,daniel,dashes,domain,insead,levels,naming,picked,please,prefix,rather,scheme,trivia,aware,based

[from news.lists.filters!]
Keywords: message,ncm,begin,body,spam,notice,pgp,googlegroups,spamassassin,signature,pasdenom,signed,usenet,info,com,end,est,lkabxolsadsxuurahpalo,trfvzkfamlybfeacgkqie,fxquozybxsows,hfcozufhlkorn,pothgfqhddwoc,tcbhokunxbviy,probablement,xkcmgcjvkghy,wzykxofxigf,
xjjqrzmwsth,iezppdjfhb,referenced,tememrtmfh,utzjdlgunu,vznwaqwahg,akefkllaj,cyimrhktz,following,plmxkyvqo,satellite,xijwqvhwm,zzpginwcr,detected,ethernet,followup,koxyluau,pikaxokm,probably,english,gzkzeqt,headers,reseaux,tdczttb,version

[from a spam...]
Keywords: drug,www,channel,running,https,com,ucdtdenqhwst,xfzsllvprc,bitchute,exorcist,military,brendon,connell,talpiot,youtube,zealand,mafias,anzus,below,dsfug,endtx,world,best,html,http,runs,bet,new,ops,org

[from an article written in French]
Keywords: crit,nous,pas,recommencer,comptons,magicien,chaines,chanson,declara,oubliez,cessit,gestes,ubuntu,actes,barri,faire,acte,gump,joli,mage,mais,marc,pour,vous,cet,des,les,lou,non,res,sur,une

Obviously, in messages written in another language than English, the
generation is totally wrong and unusable. And even for English, I am
unsure the generated keywords are really usable (too many of them, and
not enough specific).

--
Julien ÉLIE

« J'ai un copain, il est pilote d'essai… Enfin, il ne l'est pas encore ;
pour l'instant, il essaie d'être pilote ! » (Raymond Devos)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Franck@21:1/5 to All on Sat Aug 20 11:23:57 2022

XPost: news.software.readers

Hello Julien,

INN (and perhaps other servers) has the possibility to provide keywords

Not mine.

I'm wondering whether:
- it shouldn't be advertised as ":keywords" instead of "Keywords:full"
as the header field is not in the original article.

Article "metadata" is data about articles that does not occur within the article itself, so I will go for ":keywords".

BUT only two metadata items are defined in RFC 3977, ":lines" and
":bytes". RFC 3977 say "To avoid the risk of a clash with a future
registered extension, the names of METADATA items defined by private
extensions SHOULD begin with ":x-".

So, perhalps it's better to name it ":x-keywords"?

For LIST OVERVIEW.FMT, I think you have choice because RFC 3977 say that metadata items ":bytes" and ":lines" MAY be instead "Lines" and "Bytes",
even though they refer to metadata, not headers.

So, I'll go for ":x-keywords" MAY be instead "Keywords", even though it
refer to metadata, not header :-)

For OVER, I think the value of this metadata item SOULD consist of the
metadata name, a single space, and then the value ; as explained in RFC
3977 : "For all subsequent fields that contain headers, the content MUST
be the entire header line other than the trailing CRLF. For all
subsequent fields that contain metadata, the field consists of the
metadata name, a single space, and then the value.)

But, of course, I may be wrong!

Franck

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Thu Sep 1 20:40:54 2022

XPost: news.software.readers

Hi Franck,

I'm wondering whether:
- it shouldn't be advertised as ":keywords" instead of "Keywords:full"
as the header field is not in the original article.

Article "metadata" is data about articles that does not occur within the article itself, so I will go for ":keywords".

Agreed as, like you say, it is not in the article itself.

BUT only two metadata items are defined in RFC 3977, ":lines" and
":bytes". RFC 3977 say "To avoid the risk of a clash with a future
registered extension, the names of METADATA items defined by private extensions SHOULD begin with ":x-".

So, perhalps it's better to name it ":x-keywords"?

Since RFC 3977, there has been RFC 6648 which deprecates the use of "X-"
prefix and similar constructs in application protocols. That's why I
did not propose that name but directly ":keywords".

So, I'll go for ":x-keywords" MAY be instead "Keywords"

I would be in favour not to provide 2 different ways to advertise that, especially when it had never been standardized before in overview data :-)

For OVER, I think the value of this metadata item SOULD consist of the metadata name, a single space, and then the value ; as explained in RFC
3977 : "For all subsequent fields that contain headers, the content MUST
be the entire header line other than the trailing CRLF. For all
subsequent fields that contain metadata, the field consists of the
metadata name, a single space, and then the value.)

Oh you're right, I missed that and thought the rule only applied to
header names, not metadata.

I am half-tempted to advertise ":keywords" instead of Keywords in the
next release so as to comply with the protocol (the keywords are not
present in the article itself), and properly handle "HDR :keywords" vs
"HDR Keywords" results, the same way "HDR Lines" return the real header
field if present.

--
Julien ÉLIE

« Il ne faut jamais gifler un sourd : il perd la moitié du plaisir. Il
sent la gifle mais il ne l'entend pas. » (Georges Courteline)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Franck@21:1/5 to All on Sat Sep 3 21:21:05 2022

XPost: news.software.readers

Hello Julien,

So, perhalps it's better to name it ":x-keywords"?

Since RFC 3977, there has been RFC 6648 which deprecates the use of "X-" prefix and similar constructs in application protocols. That's why I
did not propose that name but directly ":keywords".

You're right!

I am half-tempted to advertise ":keywords" instead of Keywords in the
next release so as to comply with the protocol (the keywords are not
present in the article itself), and properly handle "HDR :keywords" vs
"HDR Keywords" results, the same way "HDR Lines" return the real header
field if present.

I think it's the right choice even if I don't see how this header can be
useful in any way (because the words are totally unusable).

Perhaps it would be better to encode the words rather than remove the
non-ASCII characters?

Franck

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Sat Sep 10 12:19:28 2022

XPost: news.software.readers

Bonjour Franck,

I am half-tempted to advertise ":keywords" instead of Keywords in the
next release so as to comply with the protocol (the keywords are not
present in the article itself), and properly handle "HDR :keywords" vs
"HDR Keywords" results, the same way "HDR Lines" return the real
header field if present.

I think it's the right choice even if I don't see how this header can be useful in any way (because the words are totally unusable).

There are either too many words or truncated words (in non-full-ACSII languages), indeed.

Perhaps it would be better to encode the words rather than remove the non-ASCII characters?

Having MIME-encoded words in this overview field could indeed be a
solution, or a UTF-8 encoding. However, it would imply extra complexity
in the server code to handle that encoding: find out the encoding of the
word (using the right Content-Type in headers or multipart messages...),
and convert it for the overview field.
It is a bit of work, and besides I am unsure clients are currently using Keywords when present; otherwise I guess the problem of
internationalized messages would already have popped up!

As a side note, only having ASCII chars as is currently done in the
keywords generation is compatible with a possible use of future
MIME-encoded words or direct UTF-8, if we ever do that in a standardized :keywords metadata item.
So, in order to comply with the NNTP protocol, :keywords would already
be a better choice (instead of Keywords), and I could just leave a note
in the INN documentation of keywords generation that it is still
experimental code, essentially usable for messages using only ASCII
characters as other characters are stripped by the algorithm.

--
Julien ÉLIE

« J'aime les calculs faux car ils donnent des résultats plus justes. »
(Jean Arp)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to iulius@nom-de-mon-site.com.invalid on Mon Sep 19 15:32:56 2022

XPost: news.software.readers

Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> writes:

INN (and perhaps other servers) has the possibility to provide keywords
in overview data. It advertises "Keywords:full" in response to LIST OVERVIEW.FMT and then adds "Keywords: a,b,c,d" in OVER responses. No Keywords header field is added in the articles, and the contents of an existing one is kept at the beginning of the generated one in overview.

I'm wondering whether:
- it shouldn't be advertised as ":keywords" instead of "Keywords:full" as
the header field is not in the original article.

I believe that's correct. Keywords:full would imply that it's a copy of a header in the article named Keywords.

Astonishingly, we don't seem to have set up an IANA registry for metadata
names in LIST OVERVIEW.FMT, which would have been the normal way of doing
it, so I think we can just use :keywords without telling anybody.

I am unsure though if such a change would break implementations that look
for it in overview (but is there any such news client? ...)

My guess is that no one uses this. It's been in INN for eons, but I think
it was added in the early days of more open development by one person who
was enthused about it. It tends to go untouched for long periods of time
until someone else finds it, thinks it might solve some problems for them,
and sends in a few fixes. My subjective impression is that most of the
people who try it end up not continuing to use it. I've periodically
unbroken it or done some refactoring at various points, but just because
the code was there, not because anyone was asking for it.

It's kind of an interesting idea, but text tokenization is a lot more complicated than that code, as you're discovering with its total lack of understanding of anything other than English. If the body is
base64-encoded (or even quoated-printable), I suspect it will similarly collapse like a house of cards, since I doubt it understands MIME
structure. And let's not even mention trying to tokenize languages that
are farther afield from English.

I'm honestly not sure it's worth the effort of trying to fix, although of course now that we've talked about it someone will probably wonder if it
will solve their problems and experiment with it again. :)

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Sun Sep 25 09:34:22 2022

XPost: news.software.readers

Hi Russ,

I'm wondering whether:
- it shouldn't be advertised as ":keywords" instead of "Keywords:full" as
the header field is not in the original article.

I believe that's correct. Keywords:full would imply that it's a copy of a header in the article named Keywords.

Astonishingly, we don't seem to have set up an IANA registry for metadata names in LIST OVERVIEW.FMT, which would have been the normal way of doing
it, so I think we can just use :keywords without telling anybody.

There's indeed no IANA registry for overview metadata items. It should
be created next time a new metadata item is standardized (if that time
ever comes for a new useful need).

It's kind of an interesting idea, but text tokenization is a lot more complicated than that code, as you're discovering with its total lack of understanding of anything other than English. If the body is
base64-encoded (or even quoated-printable), I suspect it will similarly collapse like a house of cards, since I doubt it understands MIME
structure. And let's not even mention trying to tokenize languages that
are farther afield from English.

I'm honestly not sure it's worth the effort of trying to fix, although of course now that we've talked about it someone will probably wonder if it
will solve their problems and experiment with it again. :)

Switching to ":keywords" and adapting the code to properly distinguish
"OVER Keywords" vs "OVER :keywords" requests will demand a bit of
effort, as well as testing. This would be used only on the very few
news servers, if any, which generate keywords. As you say, it's not
worth it yet, and still less fixing text tokenization.

I'll just go to mention in the documentation of that keyword generation
that it is experimental and works only on plain-text ASCII; and add a
bullet in the nnrpd documentation to recall that protocol difference
from the standard.

--
Julien ÉLIE

« Je préfère glisser ma peau sous des draps pour le plaisir des sens que
de la risquer sous les drapeaux pour le prix de l'essence. » (Raymond
Devos)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	71:26:07
Calls:	6,712
Files:	12,244
Messages:	5,356,970

:keywords metadata item?

Who's Online

System Info