Hi all
My problem is not directly related to Ada but on how to solve it in general.
Also writing via the web interface of google groups :(
I have to do statistics on the results of antimicrobial susceptibility testings.
I have to keep only one strain/patient and the most resistant one.
Until now I have been doing it manually by staring for hours at Excel sheets. I am trying to get it automated but I don't know how to solve my problem.
Sorry, but I found your problem description impossible to understand.
Try to describe more clearly the experiment that is done, the structure
of the data the experiment provides (the meaning of the Excel rows and columns), and the statistic you want to compute.
Also, if you do not intend to implement the solution in Ada, this is not
the right group to discuss it.
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote:
Sorry, but I found your problem description impossible to understand.
Try to describe more clearly the experiment that is done, the structure
of the data the experiment provides (the meaning of the Excel rows and
columns), and the statistic you want to compute.
Sorry tried to keep it short, was too short.
Columns are the antimicrobial drugs
Rows are the microorganism.
So every cell contains a result of S, I, R or simply an empty cell
S = Sensible
I = Intermediate
R = Resistant
empty cell <S<I<R
If a patient has 3 strains of the same microorganism but with
different resistance profiles I have to find the most resistant
one. Or if they are different I keep them all.
I have no idea how to explain what I am doing to the compiler.
Why I would choose result from strain B over the result from strain A.
strain A: SSSRSS
strain B: SSRRRS
Why I would choose result from strain B over the result from strain A.
strain A: SSSRSS
strain B: SSRRRS
Simply counting the number of S, I and R doesn't work. ?Checksum with/without weight for the column number doesn't
work either.
Even if I get a correct result I have still the same problem as before why result B over result A.
Until now I have been doing it manually by staring for hours at Excel
sheets. I am trying to get it automated but I don't know how to solve
my problem.
Laurent <lut...@icloud.com> writes:
Until now I have been doing it manually by staring for hours at Excel sheets. I am trying to get it automated but I don't know how to solveYou must go through some mental process while staring at the
my problem.
spreadsheets; what's that process? It can't involve checksums!
In a post below, you said you had to choose the most resistant, or if different all of them, which doesn't make sense. Are you perhaps
thinking of ties? in which case you must have some notion of scoring
profiles so you can determine which profiles come equal-first.
Does RSSSSS score higher or lower than SIIIII?
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote:
Sorry, but I found your problem description impossible to understand.
Try to describe more clearly the experiment that is done, the structure
of the data the experiment provides (the meaning of the Excel rows and
columns), and the statistic you want to compute.
Sorry tried to keep it short, was too short.
Columns are the antimicrobial drugs
Rows are the microorganism.
So every cell contains a result of S, I, R or simply an empty cell
S = Sensible
I = Intermediate
R = Resistant
empty cell <S<I<R
If a patient has 3 strains of the same microorganism but with
different resistance profiles I have to find the most resistant
one. Or if they are different I keep them all.
I have no idea how to explain what I am doing to the compiler.I think when you can explain it to people, you'll be able to code it. I
am still struggling to understand what you need.
Why I would choose result from strain B over the result from strain A.
strain A: SSSRSSLet's space it out
strain B: SSRRRS
drug 1 drug 2 drug 3 drug 4 drug 5 drug 6
strain A S S S R S S
strain B S S R R R S
You want to choose B because it has is resistant to more drugs, yes?
I think, from the ordering you give, you need a measure that treats an R
as "more important" that any "I" which is "more important" than an "S".
(We will come to empty cells later.)
I think you need to treat the number of Rs, Is and Ss like digits in a number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Now, in fact, you don't need to use base 10. The smallest base you can
use is one more than the maximum number of test results. If there can
be up to 16 tests (say) the score is
n(R)*17*17 + n(S)*17 + n(I).
If this suits your needs, we can consider empty cells later on. It's
not at all clear to me how to compare
strain C R____
strain D RRSSSS
Strain C is "less resistant" but only because there is not enough information. In fact it seems more serious as it is resistant to all
tested drugs.
And then what about
strain D SR
strain E RS
Do you need to weight the drugs to break ties? I.e. is drug x more
important than drug y if x < y?
--
Ben.
On Mon, 27 Dec 2021 04:29:06 -0800 (PST), Laurent
<lutgenl@icloud.com> declaimed the following:
Why I would choose result from strain B over the result from strain
A.
strain A: SSSRSS strain B: SSRRRS
Simply counting the number of S, I and R doesn't work. ?Checksum
with/without weight for the column number doesn't work either.
I wouldn't expect a checksum to be of any use, since the idea of
most checksums (and CRCs) is to be able to verify that a data
sequence has not been corrupted. Checksums don't "rank" data.
On 2021-12-27 19:41, Dennis Lee Bieber wrote:
On Mon, 27 Dec 2021 04:29:06 -0800 (PST), Laurent
<lut...@icloud.com> declaimed the following:
Why I would choose result from strain B over the result from strain
A.
strain A: SSSRSS strain B: SSRRRS
Simply counting the number of S, I and R doesn't work. ?Checksum
with/without weight for the column number doesn't work either.
I wouldn't expect a checksum to be of any use, since the idea of
most checksums (and CRCs) is to be able to verify that a data
sequence has not been corrupted. Checksums don't "rank" data.
I believe that Laurent does not mean "checksum" in its usual meaning,
but a numerical "score" computed as a sum of terms multiplied by
weights. Whether such a score can solve Laurent's problem is not clear.
Yes those are the cases which are annoying me.
That's why I came up withe idea of multiplying the value of the result (S=1, I=2 and R=3) with the position of the value.
Tried it with triplets but there will still be cases where different results will give the same numeric value.
Ignoring empty cells for the moment.
Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12 will be the same numerical value but they are different resistance profiles
I would in this case keep both.
The results are way longer than only 3 values so the possibilities for collisions are higher.
R R R R R S R R R S S S R S S => numeric:1812180608
R R R R R S R R R R S S S S S => numeric:1812180806
I have to keep both and that was an easy one. Only 2 to compare not 5.
Yes there is a hierarchy in the drugs but that information is not available in the exported results I work with.
On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote:
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote:I think when you can explain it to people, you'll be able to code it. I
Sorry, but I found your problem description impossible to understand.
Try to describe more clearly the experiment that is done, the structure >> >> of the data the experiment provides (the meaning of the Excel rows and
columns), and the statistic you want to compute.
Sorry tried to keep it short, was too short.
Columns are the antimicrobial drugs
Rows are the microorganism.
So every cell contains a result of S, I, R or simply an empty cell
S = Sensible
I = Intermediate
R = Resistant
empty cell <S<I<R
If a patient has 3 strains of the same microorganism but with
different resistance profiles I have to find the most resistant
one. Or if they are different I keep them all.
I have no idea how to explain what I am doing to the compiler.
am still struggling to understand what you need.
Why I would choose result from strain B over the result from strain A.Let's space it out
strain A: SSSRSS
strain B: SSRRRS
drug 1 drug 2 drug 3 drug 4 drug 5 drug 6
strain A S S S R S S
strain B S S R R R S
You want to choose B because it has is resistant to more drugs, yes?
Yes indeed
I think, from the ordering you give, you need a measure that treats an R
as "more important" that any "I" which is "more important" than an "S".
(We will come to empty cells later.)
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Now, in fact, you don't need to use base 10. The smallest base you can
use is one more than the maximum number of test results. If there can
be up to 16 tests (say) the score is
n(R)*17*17 + n(S)*17 + n(I).
If this suits your needs, we can consider empty cells later on. It's
not at all clear to me how to compare
strain C R____
strain D RRSSSS
Strain C is "less resistant" but only because there is not enough
information. In fact it seems more serious as it is resistant to all
tested drugs.
Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete.
And then what about
strain D SR
strain E RS
Yes those are the cases which are annoying me.
That's why I came up withe idea of multiplying the value of the result
(S=1, I=2 and R=3) with the position of the value. Tried it with
triplets but there will still be cases where different results will
give the same numeric value. Ignoring empty cell able tps for the moment.
Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12
will be the same numerical value but they are different resistance
profiles I would in this case keep both.
How to prevent that from happening.
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote:
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote:I think when you can explain it to people, you'll be able to code it. I
Sorry, but I found your problem description impossible to understand. >> >> Try to describe more clearly the experiment that is done, the structure >> >> of the data the experiment provides (the meaning of the Excel rows and >> >> columns), and the statistic you want to compute.
Sorry tried to keep it short, was too short.
Columns are the antimicrobial drugs
Rows are the microorganism.
So every cell contains a result of S, I, R or simply an empty cell
S = Sensible
I = Intermediate
R = Resistant
empty cell <S<I<R
If a patient has 3 strains of the same microorganism but with
different resistance profiles I have to find the most resistant
one. Or if they are different I keep them all.
I have no idea how to explain what I am doing to the compiler.
am still struggling to understand what you need.
Why I would choose result from strain B over the result from strain A. >> >Let's space it out
strain A: SSSRSS
strain B: SSRRRS
drug 1 drug 2 drug 3 drug 4 drug 5 drug 6
strain A S S S R S S
strain B S S R R R S
You want to choose B because it has is resistant to more drugs, yes?
Yes indeed
I think, from the ordering you give, you need a measure that treats an R >> as "more important" that any "I" which is "more important" than an "S".
(We will come to empty cells later.)
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Now, in fact, you don't need to use base 10. The smallest base you can
use is one more than the maximum number of test results. If there can
be up to 16 tests (say) the score is
n(R)*17*17 + n(S)*17 + n(I).
If this suits your needs, we can consider empty cells later on. It's
not at all clear to me how to compare
strain C R____
strain D RRSSSS
Strain C is "less resistant" but only because there is not enough
information. In fact it seems more serious as it is resistant to all
tested drugs.
Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete.
And then what about
strain D SR
strain E RS
Yes those are the cases which are annoying me.
That's why I came up withe idea of multiplying the value of the result (S=1, I=2 and R=3) with the position of the value. Tried it with
triplets but there will still be cases where different results will
give the same numeric value. Ignoring empty cell able tps for the moment.
Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12
will be the same numerical value but they are different resistance
profiles I would in this case keep both.
How to prevent that from happening.Can you first say why the suggestion I made is not helpful?
--
Ben.
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Now, in fact, you don't need to use base 10. The smallest base you can
use is one more than the maximum number of test results. If there can
be up to 16 tests (say) the score is
n(R)*17*17 + n(S)*17 + n(I).
And then what about
strain D SR
strain E RS
On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote:
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote:Can you first say why the suggestion I made is not helpful?
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote:I think when you can explain it to people, you'll be able to code it. I >> >> am still struggling to understand what you need.
Sorry, but I found your problem description impossible to understand. >> >> >> Try to describe more clearly the experiment that is done, the structure
of the data the experiment provides (the meaning of the Excel rows and >> >> >> columns), and the statistic you want to compute.
Sorry tried to keep it short, was too short.
Columns are the antimicrobial drugs
Rows are the microorganism.
So every cell contains a result of S, I, R or simply an empty cell
S = Sensible
I = Intermediate
R = Resistant
empty cell <S<I<R
If a patient has 3 strains of the same microorganism but with
different resistance profiles I have to find the most resistant
one. Or if they are different I keep them all.
I have no idea how to explain what I am doing to the compiler.
Why I would choose result from strain B over the result from strain A. >> >> >Let's space it out
strain A: SSSRSS
strain B: SSRRRS
drug 1 drug 2 drug 3 drug 4 drug 5 drug 6
strain A S S S R S S
strain B S S R R R S
You want to choose B because it has is resistant to more drugs, yes?
Yes indeed
I think, from the ordering you give, you need a measure that treats an R >> >> as "more important" that any "I" which is "more important" than an "S". >> >> (We will come to empty cells later.)
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Now, in fact, you don't need to use base 10. The smallest base you can
use is one more than the maximum number of test results. If there can
be up to 16 tests (say) the score is
n(R)*17*17 + n(S)*17 + n(I).
If this suits your needs, we can consider empty cells later on. It's
not at all clear to me how to compare
strain C R____
strain D RRSSSS
Strain C is "less resistant" but only because there is not enough
information. In fact it seems more serious as it is resistant to all
tested drugs.
Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete.
And then what about
strain D SR
strain E RS
Yes those are the cases which are annoying me.
That's why I came up withe idea of multiplying the value of the result
(S=1, I=2 and R=3) with the position of the value. Tried it with
triplets but there will still be cases where different results will
give the same numeric value. Ignoring empty cell able tps for the moment. >> >
Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12
will be the same numerical value but they are different resistance
profiles I would in this case keep both.
How to prevent that from happening.
--
Ben.
You mean that one:
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Different resistance profiles same result:
Also, if you do not intend to implement the solution in Ada, this is not
the right group to discuss it.
I would very much prefer to solve it in Ada but at work I am stuck with
Excel
and VBA which is better than doing it manually. After a few hours starring
at
a screen with thousand of rows of results... If I get an Ada solution I can >adapt it. Just limited to no access/pointers in VBA which shouldn't be >required?
"Laurent" <lut...@icloud.com> wrote in message news:49538254-21ed-4fd0...@googlegroups.com...
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote:
...
Also, if you do not intend to implement the solution in Ada, this is not >> the right group to discuss it.
I would very much prefer to solve it in Ada but at work I am stuck with >ExcelHybrid Ada-spreadsheet solutions are possible. It's quite easy to read/write .csv files in Ada, and those can be easily imported/exported from any spreadsheet program (I've been using Libreoffice Calc, but Excel is
and VBA which is better than doing it manually. After a few hours starring >at
a screen with thousand of rows of results... If I get an Ada solution I can >adapt it. Just limited to no access/pointers in VBA which shouldn't be >required?
similar).
For an example, the ACATS grading tools essentially work by expecting the vendor (or a third party) to provide a tool that converts compilation
results into a .csv file. The .csv file(s) are then read by the grading tool and compared to required results to provide a grade. But it also can be read into a spreadsheet for sanity checking as well as additional analysis.
Similarly (and probably more useful to you), I've used spreadsheet data for various traffic in AdaIC (retrieved from Google) as input to Ada programs that analyze the data to provide information that Google is unable to (in particular, usage of the various Ada standards, which are split up into
usage of several hundred separate files). I then take the results of the Ada program (which is also a .csv file), open that, and paste the results into a previously created spreadsheet that generates charts for showing to management. (Even highly skilled programmers don't like looking through columns of numbers for trends. :-)
But you do have to be able to describe the results that you are looking for. Having read the entire thread, I'm more confused than I started. :-) I suspect when you can describe your problem algorithmically, the solution
will be obvious. Good luck finding a solution.
Randy.
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote:
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote:Can you first say why the suggestion I made is not helpful?
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote: >> >> >I think when you can explain it to people, you'll be able to code it. I >> >> am still struggling to understand what you need.
Sorry, but I found your problem description impossible to understand.
Try to describe more clearly the experiment that is done, the structure
of the data the experiment provides (the meaning of the Excel rows and
columns), and the statistic you want to compute.
Sorry tried to keep it short, was too short.
Columns are the antimicrobial drugs
Rows are the microorganism.
So every cell contains a result of S, I, R or simply an empty cell
S = Sensible
I = Intermediate
R = Resistant
empty cell <S<I<R
If a patient has 3 strains of the same microorganism but with
different resistance profiles I have to find the most resistant
one. Or if they are different I keep them all.
I have no idea how to explain what I am doing to the compiler.
Why I would choose result from strain B over the result from strain A.Let's space it out
strain A: SSSRSS
strain B: SSRRRS
drug 1 drug 2 drug 3 drug 4 drug 5 drug 6
strain A S S S R S S
strain B S S R R R S
You want to choose B because it has is resistant to more drugs, yes?
Yes indeed
I think, from the ordering you give, you need a measure that treats an R
as "more important" that any "I" which is "more important" than an "S". >> >> (We will come to empty cells later.)
I think you need to treat the number of Rs, Is and Ss like digits in a >> >> number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Now, in fact, you don't need to use base 10. The smallest base you can >> >> use is one more than the maximum number of test results. If there can >> >> be up to 16 tests (say) the score is
n(R)*17*17 + n(S)*17 + n(I).
If this suits your needs, we can consider empty cells later on. It's
not at all clear to me how to compare
strain C R____
strain D RRSSSS
Strain C is "less resistant" but only because there is not enough
information. In fact it seems more serious as it is resistant to all
tested drugs.
Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete.
And then what about
strain D SR
strain E RS
Yes those are the cases which are annoying me.
That's why I came up withe idea of multiplying the value of the result >> > (S=1, I=2 and R=3) with the position of the value. Tried it with
triplets but there will still be cases where different results will
give the same numeric value. Ignoring empty cell able tps for the moment.
Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12
will be the same numerical value but they are different resistance
profiles I would in this case keep both.
How to prevent that from happening.
--
Ben.
You mean that one:
I think you need to treat the number of Rs, Is and Ss like digits in a >> >> number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Different resistance profiles same result:I don't yet understand the requirements so I am taking it in stages.
The first requirement seemed to be "more or less resistant". To do that
you can use digits in a large enough base but this will make the number
of Rs, Ss and Is paramount. Is that acceptable as a first step?
In order to help people to be able to make further suggestions, maybe
you could give the relative ordering you would like to see between the following sets of profiles. For example, between SSR, SRS and RSS, I
think the order you want is RSS > SRS > SSR.
1: SSR, SRS, RSS
2: RSI, RIS, SRI, SIR, IRS, ISR
3: SSSR, SSRS, SRSS, RSSS
4: RRSSS, RSSSR, RIIII, SRIII, RSIII, IIIRS, IIISR
It's possible you could make do with an extra field (or digits) that
gives some measure of the relative ordering between otherwise similar sequences. For example, using base 10 (for convenience of arithmetic)
both RRSSI and RSRSI would score 212xx but the last xx would reflect the positioning of the results in the sequence. There are lots of way to do
this. One way would be use, as you were thinking, some sort of weighted count. Using S=0, I=1 and R=2 with weights
54321
RRSSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+4) + 0*(3+2) + 1*1 = 21219
RSRSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+3) + 0*(4+2) + 1*1 = 21217
If you absolutely must never get duplicate numbers, but you still want
to preserve a strict specified ordering, I think you will have much more
work to do.
Getting a unique number for each case it trivial (but the ordering will
be wrong) and getting an ordering that rates every R > every S > every I
is also trivial, but there will be lots of duplicates. It's finding the balance that's going to be hard.
--
Ben.
On Tuesday, 28 December 2021 at 01:29:57 UTC+1, Ben Bacarisse wrote:
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote:
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote:Can you first say why the suggestion I made is not helpful?
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote: >> >> >I think when you can explain it to people, you'll be able to code it. I
Sorry, but I found your problem description impossible to understand.
Try to describe more clearly the experiment that is done, the structure
of the data the experiment provides (the meaning of the Excel rows and
columns), and the statistic you want to compute.
Sorry tried to keep it short, was too short.
Columns are the antimicrobial drugs
Rows are the microorganism.
So every cell contains a result of S, I, R or simply an empty cell >> >> >
S = Sensible
I = Intermediate
R = Resistant
empty cell <S<I<R
If a patient has 3 strains of the same microorganism but with
different resistance profiles I have to find the most resistant
one. Or if they are different I keep them all.
I have no idea how to explain what I am doing to the compiler.
am still struggling to understand what you need.
Why I would choose result from strain B over the result from strain A.Let's space it out
strain A: SSSRSS
strain B: SSRRRS
drug 1 drug 2 drug 3 drug 4 drug 5 drug 6
strain A S S S R S S
strain B S S R R R S
You want to choose B because it has is resistant to more drugs, yes? >> >>
Yes indeed
I think, from the ordering you give, you need a measure that treats an R
as "more important" that any "I" which is "more important" than an "S".
(We will come to empty cells later.)
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Now, in fact, you don't need to use base 10. The smallest base you can
use is one more than the maximum number of test results. If there can >> >> be up to 16 tests (say) the score is
n(R)*17*17 + n(S)*17 + n(I).
If this suits your needs, we can consider empty cells later on. It's >> >> not at all clear to me how to compare
strain C R____
strain D RRSSSS
Strain C is "less resistant" but only because there is not enough
information. In fact it seems more serious as it is resistant to all >> >> tested drugs.
Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete.
And then what about
strain D SR
strain E RS
Yes those are the cases which are annoying me.
That's why I came up withe idea of multiplying the value of the result >> > (S=1, I=2 and R=3) with the position of the value. Tried it with
triplets but there will still be cases where different results will
give the same numeric value. Ignoring empty cell able tps for the moment.
Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12 >> > will be the same numerical value but they are different resistance
profiles I would in this case keep both.
How to prevent that from happening.
--
Ben.
You mean that one:
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Different resistance profiles same result:I don't yet understand the requirements so I am taking it in stages.
The first requirement seemed to be "more or less resistant". To do that
you can use digits in a large enough base but this will make the number
of Rs, Ss and Is paramount. Is that acceptable as a first step?
The requirements are one strain of a certain microorganism/patient
The most resistant one or if they have different profiles
SRS vs RRS => last one, more Rs
SRS vs RSR = both, different profiles
In order to help people to be able to make further suggestions, maybe
you could give the relative ordering you would like to see between the following sets of profiles. For example, between SSR, SRS and RSS, I
think the order you want is RSS > SRS > SSR.
1: SSR, SRS, RSS
2: RSI, RIS, SRI, SIR, IRS, ISR
3: SSSR, SSRS, SRSS, RSSS
4: RRSSS, RSSSR, RIIII, SRIII, RSIII, IIIRS, IIISR
The order of the results is given by the ID of the drug in the extraction tool.
I could probably order them by family and hierarchy of potence but
would that make a difference?
It's possible you could make do with an extra field (or digits) that
gives some measure of the relative ordering between otherwise similar sequences. For example, using base 10 (for convenience of arithmetic)
both RRSSI and RSRSI would score 212xx but the last xx would reflect the positioning of the results in the sequence. There are lots of way to do this. One way would be use, as you were thinking, some sort of weighted count. Using S=0, I=1 and R=2 with weights
54321
RRSSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+4) + 0*(3+2) + 1*1 = 21219 RSRSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+3) + 0*(4+2) + 1*1 = 21217
So to be sure that I am following:
2*(5+4) = value of R (=2) * position of R(@5 and @4)
2*(5+3) = value of R (=2) * position of R(@5 and @3)
0*(3+2) = value of S (=0) * position of S(@3 and @2)
0*(4+2) = value of S (=0) * position of S(@4 and @2)
1*1 = value of I (=1) * position of I (@1)
2*10000 + 1*1000 + 2*100 Is just used as padding? So 212 could be any other number?
But in this example I would have to keep both as drug 5,2 and 1 are common
to both results but 4 and 3 are unique.
The score would be completely misleading.
So if my table has a width of 20 columns the first column would be
10^20, the next 10^19,.... +/- a few 0s off?
I would have to implement it and see what I get as result.
If you absolutely must never get duplicate numbers, but you still want
to preserve a strict specified ordering, I think you will have much more work to do.
Getting a unique number for each case it trivial (but the ordering will
be wrong) and getting an ordering that rates every R > every S > every I
is also trivial, but there will be lots of duplicates. It's finding the balance that's going to be hard.
--I have prepared a cleaned up Excel workbook with only the duplicates which pose problems. The ones I would keep have an orange ID.
Ben.
I could upload it to Github. If that helps understanding the different cases.
Thanks for your patience
Laurent
On Tuesday, 28 December 2021 at 08:48:32 UTC+1, Laurent wrote:
On Tuesday, 28 December 2021 at 01:29:57 UTC+1, Ben Bacarisse wrote:
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote:
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote: >> >> Laurent <lut...@icloud.com> writes:Can you first say why the suggestion I made is not helpful?
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote:I think when you can explain it to people, you'll be able to code it. I
Sorry, but I found your problem description impossible to understand.
Try to describe more clearly the experiment that is done, the structure
of the data the experiment provides (the meaning of the Excel rows and
columns), and the statistic you want to compute.
Sorry tried to keep it short, was too short.
Columns are the antimicrobial drugs
Rows are the microorganism.
So every cell contains a result of S, I, R or simply an empty cell
S = Sensible
I = Intermediate
R = Resistant
empty cell <S<I<R
If a patient has 3 strains of the same microorganism but with
different resistance profiles I have to find the most resistant >> >> > one. Or if they are different I keep them all.
I have no idea how to explain what I am doing to the compiler.
am still struggling to understand what you need.
Why I would choose result from strain B over the result from strain A.Let's space it out
strain A: SSSRSS
strain B: SSRRRS
drug 1 drug 2 drug 3 drug 4 drug 5 drug 6
strain A S S S R S S
strain B S S R R R S
You want to choose B because it has is resistant to more drugs, yes?
Yes indeed
I think, from the ordering you give, you need a measure that treats an R
as "more important" that any "I" which is "more important" than an "S".
(We will come to empty cells later.)
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Now, in fact, you don't need to use base 10. The smallest base you can
use is one more than the maximum number of test results. If there can
be up to 16 tests (say) the score is
n(R)*17*17 + n(S)*17 + n(I).
If this suits your needs, we can consider empty cells later on. It's
not at all clear to me how to compare
strain C R____
strain D RRSSSS
Strain C is "less resistant" but only because there is not enough >> >> information. In fact it seems more serious as it is resistant to all
tested drugs.
Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete.
And then what about
strain D SR
strain E RS
Yes those are the cases which are annoying me.
That's why I came up withe idea of multiplying the value of the result
(S=1, I=2 and R=3) with the position of the value. Tried it with
triplets but there will still be cases where different results will >> > give the same numeric value. Ignoring empty cell able tps for the moment.
Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12
will be the same numerical value but they are different resistance >> > profiles I would in this case keep both.
How to prevent that from happening.
--
Ben.
You mean that one:
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Different resistance profiles same result:I don't yet understand the requirements so I am taking it in stages.
The first requirement seemed to be "more or less resistant". To do that you can use digits in a large enough base but this will make the number of Rs, Ss and Is paramount. Is that acceptable as a first step?
The requirements are one strain of a certain microorganism/patient
The most resistant one or if they have different profiles
SRS vs RRS => last one, more Rs
SRS vs RSR = both, different profiles
In order to help people to be able to make further suggestions, maybe
you could give the relative ordering you would like to see between the following sets of profiles. For example, between SSR, SRS and RSS, I think the order you want is RSS > SRS > SSR.
1: SSR, SRS, RSS
2: RSI, RIS, SRI, SIR, IRS, ISR
3: SSSR, SSRS, SRSS, RSSS
4: RRSSS, RSSSR, RIIII, SRIII, RSIII, IIIRS, IIISR
The order of the results is given by the ID of the drug in the extraction tool.
I could probably order them by family and hierarchy of potence but
would that make a difference?
It's possible you could make do with an extra field (or digits) that gives some measure of the relative ordering between otherwise similar sequences. For example, using base 10 (for convenience of arithmetic) both RRSSI and RSRSI would score 212xx but the last xx would reflect the positioning of the results in the sequence. There are lots of way to do this. One way would be use, as you were thinking, some sort of weighted count. Using S=0, I=1 and R=2 with weights
54321
RRSSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+4) + 0*(3+2) + 1*1 = 21219 RSRSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+3) + 0*(4+2) + 1*1 = 21217
So to be sure that I am following:
2*(5+4) = value of R (=2) * position of R(@5 and @4)
2*(5+3) = value of R (=2) * position of R(@5 and @3)
0*(3+2) = value of S (=0) * position of S(@3 and @2)
0*(4+2) = value of S (=0) * position of S(@4 and @2)
1*1 = value of I (=1) * position of I (@1)
2*10000 + 1*1000 + 2*100 Is just used as padding? So 212 could be any other number?
Eh forget the last sentence, brain fart: I have 2 R's so 2*10000, 1 I so 1*1000 and 2 S's so 2*100
But in this example I would have to keep both as drug 5,2 and 1 are common to both results but 4 and 3 are unique.
The score would be completely misleading.
So if my table has a width of 20 columns the first column would be
10^20, the next 10^19,.... +/- a few 0s off?
I would have to implement it and see what I get as result.
If you absolutely must never get duplicate numbers, but you still want
to preserve a strict specified ordering, I think you will have much more work to do.
Getting a unique number for each case it trivial (but the ordering will be wrong) and getting an ordering that rates every R > every S > every I is also trivial, but there will be lots of duplicates. It's finding the balance that's going to be hard.
--I have prepared a cleaned up Excel workbook with only the duplicates which pose problems. The ones I would keep have an orange ID.
Ben.
I could upload it to Github. If that helps understanding the different cases.
Thanks for your patience
Laurent
On Tuesday, 28 December 2021 at 01:29:57 UTC+1, Ben Bacarisse wrote:
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote:
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote:Can you first say why the suggestion I made is not helpful?
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote: >> >> >> >I think when you can explain it to people, you'll be able to code it. I
Sorry, but I found your problem description impossible to understand.
Try to describe more clearly the experiment that is done, the structure
of the data the experiment provides (the meaning of the Excel rows and
columns), and the statistic you want to compute.
Sorry tried to keep it short, was too short.
Columns are the antimicrobial drugs
Rows are the microorganism.
So every cell contains a result of S, I, R or simply an empty cell >> >> >> >
S = Sensible
I = Intermediate
R = Resistant
empty cell <S<I<R
If a patient has 3 strains of the same microorganism but with
different resistance profiles I have to find the most resistant
one. Or if they are different I keep them all.
I have no idea how to explain what I am doing to the compiler.
am still struggling to understand what you need.
Why I would choose result from strain B over the result from strain A.Let's space it out
strain A: SSSRSS
strain B: SSRRRS
drug 1 drug 2 drug 3 drug 4 drug 5 drug 6
strain A S S S R S S
strain B S S R R R S
You want to choose B because it has is resistant to more drugs, yes? >> >> >>
Yes indeed
I think, from the ordering you give, you need a measure that treats an R
as "more important" that any "I" which is "more important" than an "S".
(We will come to empty cells later.)
I think you need to treat the number of Rs, Is and Ss like digits in a >> >> >> number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Now, in fact, you don't need to use base 10. The smallest base you can >> >> >> use is one more than the maximum number of test results. If there can >> >> >> be up to 16 tests (say) the score is
n(R)*17*17 + n(S)*17 + n(I).
If this suits your needs, we can consider empty cells later on. It's >> >> >> not at all clear to me how to compare
strain C R____
strain D RRSSSS
Strain C is "less resistant" but only because there is not enough
information. In fact it seems more serious as it is resistant to all >> >> >> tested drugs.
Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete.
And then what about
strain D SR
strain E RS
Yes those are the cases which are annoying me.
That's why I came up withe idea of multiplying the value of the result >> >> > (S=1, I=2 and R=3) with the position of the value. Tried it with
triplets but there will still be cases where different results will
give the same numeric value. Ignoring empty cell able tps for the moment.
Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12 >> >> > will be the same numerical value but they are different resistance
profiles I would in this case keep both.
How to prevent that from happening.
--
Ben.
You mean that one:
I think you need to treat the number of Rs, Is and Ss like digits in a >> >> >> number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Different resistance profiles same result:
I don't yet understand the requirements so I am taking it in stages.
The first requirement seemed to be "more or less resistant". To do that
you can use digits in a large enough base but this will make the number
of Rs, Ss and Is paramount. Is that acceptable as a first step?
The requirements are one strain of a certain microorganism/patient
The most resistant one or if they have different profiles
SRS vs RRS => last one, more Rs
SRS vs RSR = both, different profiles
In order to help people to be able to make further suggestions, maybe
you could give the relative ordering you would like to see between the
following sets of profiles. For example, between SSR, SRS and RSS, I
think the order you want is RSS > SRS > SSR.
1: SSR, SRS, RSS
2: RSI, RIS, SRI, SIR, IRS, ISR
3: SSSR, SSRS, SRSS, RSSS
4: RRSSS, RSSSR, RIIII, SRIII, RSIII, IIIRS, IIISR
The order of the results is given by the ID of the drug in the extraction tool.
I could probably order them by family and hierarchy of potence but
would that make a difference?
It's possible you could make do with an extra field (or digits) that
gives some measure of the relative ordering between otherwise similar
sequences. For example, using base 10 (for convenience of arithmetic)
both RRSSI and RSRSI would score 212xx but the last xx would reflect the
positioning of the results in the sequence. There are lots of way to do
this. One way would be use, as you were thinking, some sort of weighted
count. Using S=0, I=1 and R=2 with weights
54321
RRSSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+4) + 0*(3+2) + 1*1 = 21219
RSRSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+3) + 0*(4+2) + 1*1 = 21217
So to be sure that I am following:
2*(5+4) = value of R (=2) * position of R(@5 and @4)
2*(5+3) = value of R (=2) * position of R(@5 and @3)
0*(3+2) = value of S (=0) * position of S(@3 and @2)
0*(4+2) = value of S (=0) * position of S(@4 and @2)
1*1 = value of I (=1) * position of I (@1)
2*10000 + 1*1000 + 2*100 Is just used as padding? So 212 could be any other number?
But in this example I would have to keep both as drug 5,2 and 1 are common
to both results but 4 and 3 are unique.
The score would be completely misleading.
So if my table has a width of 20 columns the first column would be
10^20, the next 10^19,.... +/- a few 0s off?
I would have to implement it and see what I get as result.
I have prepared a cleaned up Excel workbook with only the duplicates which pose problems. The ones I would keep have an orange ID.
I could upload it to Github. If that helps understanding the different
cases.
On Tuesday, 28 December 2021 at 10:05:50 UTC+1, Laurent wrote:
On Tuesday, 28 December 2021 at 08:48:32 UTC+1, Laurent wrote:
On Tuesday, 28 December 2021 at 01:29:57 UTC+1, Ben Bacarisse wrote:Eh forget the last sentence, brain fart: I have 2 R's so 2*10000, 1 I so 1*1000 and 2 S's so 2*100
Laurent <lut...@icloud.com> writes:The requirements are one strain of a certain microorganism/patient
On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote:I don't yet understand the requirements so I am taking it in stages.
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote: >> > > >> >> Laurent <lut...@icloud.com> writes:Can you first say why the suggestion I made is not helpful?
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote:am still struggling to understand what you need.
Sorry, but I found your problem description impossible to understand.
Try to describe more clearly the experiment that is done, the structure
of the data the experiment provides (the meaning of the Excel rows and
columns), and the statistic you want to compute.
Sorry tried to keep it short, was too short.
Columns are the antimicrobial drugs
Rows are the microorganism.
So every cell contains a result of S, I, R or simply an empty cell
S = Sensible
I = Intermediate
R = Resistant
empty cell <S<I<R
If a patient has 3 strains of the same microorganism but with
different resistance profiles I have to find the most resistant >> > > >> >> > one. Or if they are different I keep them all.
I have no idea how to explain what I am doing to the compiler. >> > > >> >> I think when you can explain it to people, you'll be able to code it. I
Why I would choose result from strain B over the result from strain A.Let's space it out
strain A: SSSRSS
strain B: SSRRRS
drug 1 drug 2 drug 3 drug 4 drug 5 drug 6
strain A S S S R S S
strain B S S R R R S
You want to choose B because it has is resistant to more drugs, yes?
Yes indeed
I think, from the ordering you give, you need a measure that treats an R
as "more important" that any "I" which is "more important" than an "S".
(We will come to empty cells later.)
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Now, in fact, you don't need to use base 10. The smallest base you can
use is one more than the maximum number of test results. If there can
be up to 16 tests (say) the score is
n(R)*17*17 + n(S)*17 + n(I).
If this suits your needs, we can consider empty cells later on. It's
not at all clear to me how to compare
strain C R____
strain D RRSSSS
Strain C is "less resistant" but only because there is not enough >> > > >> >> information. In fact it seems more serious as it is resistant to all
tested drugs.
Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete.
And then what about
strain D SR
strain E RS
Yes those are the cases which are annoying me.
That's why I came up withe idea of multiplying the value of the result
(S=1, I=2 and R=3) with the position of the value. Tried it with
triplets but there will still be cases where different results will >> > > >> > give the same numeric value. Ignoring empty cell able tps for the moment.
Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12
will be the same numerical value but they are different resistance >> > > >> > profiles I would in this case keep both.
How to prevent that from happening.
--
Ben.
You mean that one:
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Different resistance profiles same result:
The first requirement seemed to be "more or less resistant". To do that >> > > you can use digits in a large enough base but this will make the number >> > > of Rs, Ss and Is paramount. Is that acceptable as a first step?
The most resistant one or if they have different profiles
SRS vs RRS => last one, more Rs
SRS vs RSR = both, different profiles
In order to help people to be able to make further suggestions, maybeThe order of the results is given by the ID of the drug in the extraction tool.
you could give the relative ordering you would like to see between the >> > > following sets of profiles. For example, between SSR, SRS and RSS, I
think the order you want is RSS > SRS > SSR.
1: SSR, SRS, RSS
2: RSI, RIS, SRI, SIR, IRS, ISR
3: SSSR, SSRS, SRSS, RSSS
4: RRSSS, RSSSR, RIIII, SRIII, RSIII, IIIRS, IIISR
I could probably order them by family and hierarchy of potence but
would that make a difference?
It's possible you could make do with an extra field (or digits) thatSo to be sure that I am following:
gives some measure of the relative ordering between otherwise similar
sequences. For example, using base 10 (for convenience of arithmetic)
both RRSSI and RSRSI would score 212xx but the last xx would reflect the >> > > positioning of the results in the sequence. There are lots of way to do >> > > this. One way would be use, as you were thinking, some sort of weighted >> > > count. Using S=0, I=1 and R=2 with weights
54321
RRSSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+4) + 0*(3+2) + 1*1 = 21219 >> > > RSRSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+3) + 0*(4+2) + 1*1 = 21217 >> > >
2*(5+4) = value of R (=2) * position of R(@5 and @4)
2*(5+3) = value of R (=2) * position of R(@5 and @3)
0*(3+2) = value of S (=0) * position of S(@3 and @2)
0*(4+2) = value of S (=0) * position of S(@4 and @2)
1*1 = value of I (=1) * position of I (@1)
2*10000 + 1*1000 + 2*100 Is just used as padding? So 212 could be any other
number?
But in this example I would have to keep both as drug 5,2 and 1 are common >> > to both results but 4 and 3 are unique.
The score would be completely misleading.
So if my table has a width of 20 columns the first column would be
10^20, the next 10^19,.... +/- a few 0s off?
I would have to implement it and see what I get as result.
If you absolutely must never get duplicate numbers, but you still want >> > > to preserve a strict specified ordering, I think you will have much more >> > > work to do.I have prepared a cleaned up Excel workbook with only the duplicates which >> > pose problems. The ones I would keep have an orange ID.
Getting a unique number for each case it trivial (but the ordering will >> > > be wrong) and getting an ordering that rates every R > every S > every I >> > > is also trivial, but there will be lots of duplicates. It's finding the >> > > balance that's going to be hard.
--
Ben.
I could upload it to Github. If that helps understanding the different cases.
Thanks for your patience
Laurent
Ben,
I have implemented your solution but I don't understand the reason why S would have a value of 0?
I then don't need to take care of the S'es because the result will always be 0. Not that it changes a lot
Because I still couldn't choose the profile of interest only based on the numbers.
R R S S I Ben's Solution: 212 11 Mine: 212 1205 R S R S I 212 13 212 1405
R R R S I 311 17 311 1805
R S R R I 311 21 311 1407
S R R R I 311 23 311 1607
311 17 and 311 23 being the most likely but unclear where the
difference might be.
I have adapted my current solution to include the number of R,I,S
weight of the results: S=1, I=2, R=3
weight of the position in the triplet: 1st=1, 2nd=2, 3rd=3
ie.: R R R => First triplet: 1*3+2*3+3*3 = 18
S I => Second triplet 1*1+2*2 = 05
RIS count: 311
Append 1st triplet: 311 18
Append 2nd triplet: 311 18 05
311 18 05 and 311 16 07 being the most likely with some clues which
triplet is different.
Am I not somehow introducing a bias by multiplying the value with the position in the triplet?
And then there is still the case where SSR (1*1+2*1+3*3=12) and RRS (1*3+2*3+3*1=12)
will both resolve to the same value.
Wouldn't I need some sort of Traveling Salesman Problems algorithm to find the profile
with the highest number of resistances and the highest number of
triplets with high values.
The requirements are one strain of a certain microorganism/patient
The most resistant one or if they have different profiles
SRS vs RRS => last one, more Rs
SRS vs RSR = both, different profiles
Laurent <lut...@icloud.com> writes:
On Tuesday, 28 December 2021 at 10:05:50 UTC+1, Laurent wrote:
On Tuesday, 28 December 2021 at 08:48:32 UTC+1, Laurent wrote:
On Tuesday, 28 December 2021 at 01:29:57 UTC+1, Ben Bacarisse wrote:Eh forget the last sentence, brain fart: I have 2 R's so 2*10000, 1 I so 1*1000 and 2 S's so 2*100
Laurent <lut...@icloud.com> writes:The requirements are one strain of a certain microorganism/patient
On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote: >> > > >> Laurent <lut...@icloud.com> writes:I don't yet understand the requirements so I am taking it in stages. >> > > The first requirement seemed to be "more or less resistant". To do that
On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote:Can you first say why the suggestion I made is not helpful?
Laurent <lut...@icloud.com> writes:
On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote:am still struggling to understand what you need.
Sorry, but I found your problem description impossible to understand.
Try to describe more clearly the experiment that is done, the structure
of the data the experiment provides (the meaning of the Excel rows and
columns), and the statistic you want to compute.
Sorry tried to keep it short, was too short.
Columns are the antimicrobial drugs
Rows are the microorganism.
So every cell contains a result of S, I, R or simply an empty cell
S = Sensible
I = Intermediate
R = Resistant
empty cell <S<I<R
If a patient has 3 strains of the same microorganism but with >> > > >> >> > different resistance profiles I have to find the most resistant
one. Or if they are different I keep them all.
I have no idea how to explain what I am doing to the compiler. >> > > >> >> I think when you can explain it to people, you'll be able to code it. I
Why I would choose result from strain B over the result from strain A.Let's space it out
strain A: SSSRSS
strain B: SSRRRS
drug 1 drug 2 drug 3 drug 4 drug 5 drug 6
strain A S S S R S S
strain B S S R R R S
You want to choose B because it has is resistant to more drugs, yes?
Yes indeed
I think, from the ordering you give, you need a measure that treats an R
as "more important" that any "I" which is "more important" than an "S".
(We will come to empty cells later.)
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Now, in fact, you don't need to use base 10. The smallest base you can
use is one more than the maximum number of test results. If there can
be up to 16 tests (say) the score is
n(R)*17*17 + n(S)*17 + n(I).
If this suits your needs, we can consider empty cells later on. It's
not at all clear to me how to compare
strain C R____
strain D RRSSSS
Strain C is "less resistant" but only because there is not enough
information. In fact it seems more serious as it is resistant to all
tested drugs.
Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete.
And then what about
strain D SR
strain E RS
Yes those are the cases which are annoying me.
That's why I came up withe idea of multiplying the value of the result
(S=1, I=2 and R=3) with the position of the value. Tried it with >> > > >> > triplets but there will still be cases where different results will
give the same numeric value. Ignoring empty cell able tps for the moment.
Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12
will be the same numerical value but they are different resistance
profiles I would in this case keep both.
How to prevent that from happening.
--
Ben.
You mean that one:
I think you need to treat the number of Rs, Is and Ss like digits in a
number. In base 10, the strains score
R S I
strain A 1 5 0 = 150
strain B 3 3 0 = 330
Different resistance profiles same result:
you can use digits in a large enough base but this will make the number
of Rs, Ss and Is paramount. Is that acceptable as a first step?
The most resistant one or if they have different profiles
SRS vs RRS => last one, more Rs
SRS vs RSR = both, different profiles
In order to help people to be able to make further suggestions, maybe >> > > you could give the relative ordering you would like to see between the >> > > following sets of profiles. For example, between SSR, SRS and RSS, I >> > > think the order you want is RSS > SRS > SSR.The order of the results is given by the ID of the drug in the extraction tool.
1: SSR, SRS, RSS
2: RSI, RIS, SRI, SIR, IRS, ISR
3: SSSR, SSRS, SRSS, RSSS
4: RRSSS, RSSSR, RIIII, SRIII, RSIII, IIIRS, IIISR
I could probably order them by family and hierarchy of potence but
would that make a difference?
It's possible you could make do with an extra field (or digits) that >> > > gives some measure of the relative ordering between otherwise similar >> > > sequences. For example, using base 10 (for convenience of arithmetic) >> > > both RRSSI and RSRSI would score 212xx but the last xx would reflect theSo to be sure that I am following:
positioning of the results in the sequence. There are lots of way to do
this. One way would be use, as you were thinking, some sort of weighted
count. Using S=0, I=1 and R=2 with weights
54321
RRSSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+4) + 0*(3+2) + 1*1 = 21219
RSRSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+3) + 0*(4+2) + 1*1 = 21217
2*(5+4) = value of R (=2) * position of R(@5 and @4)
2*(5+3) = value of R (=2) * position of R(@5 and @3)
0*(3+2) = value of S (=0) * position of S(@3 and @2)
0*(4+2) = value of S (=0) * position of S(@4 and @2)
1*1 = value of I (=1) * position of I (@1)
2*10000 + 1*1000 + 2*100 Is just used as padding? So 212 could be any other
number?
But in this example I would have to keep both as drug 5,2 and 1 are common
to both results but 4 and 3 are unique.
The score would be completely misleading.
So if my table has a width of 20 columns the first column would be
10^20, the next 10^19,.... +/- a few 0s off?
I would have to implement it and see what I get as result.
If you absolutely must never get duplicate numbers, but you still want >> > > to preserve a strict specified ordering, I think you will have much moreI have prepared a cleaned up Excel workbook with only the duplicates which
work to do.
Getting a unique number for each case it trivial (but the ordering will
be wrong) and getting an ordering that rates every R > every S > every I
is also trivial, but there will be lots of duplicates. It's finding the
balance that's going to be hard.
--
Ben.
pose problems. The ones I would keep have an orange ID.
I could upload it to Github. If that helps understanding the different cases.
Thanks for your patience
Laurent
Ben,Posts crossed. You should probably ignore my last as it was written
before I saw this one.
I have implemented your solution but I don't understand the reason why S would have a value of 0?
I then don't need to take care of the S'es because the result will always be 0. Not that it changes a lot
Because I still couldn't choose the profile of interest only based on the numbers.
R R S S I Ben's Solution: 212 11 Mine: 212 1205
R S R S I 212 13 212 1405
R R R S I 311 17 311 1805
R S R R I 311 21 311 1407
S R R R I 311 23 311 1607
311 17 and 311 23 being the most likely but unclear where theThis is what is so frustrating for me. What do you mean, most likely?
difference might be.
What do you mean be what the difference might be? Can you describe to
me, as a human being, which you would choose and tell me how you
decided. If you can't do that then all you are doing is trying random
schemes until something pops up the look right for some specific set of
data!
I have adapted my current solution to include the number of R,I,S
weight of the results: S=1, I=2, R=3
weight of the position in the triplet: 1st=1, 2nd=2, 3rd=3
ie.: R R R => First triplet: 1*3+2*3+3*3 = 18
S I => Second triplet 1*1+2*2 = 05
RIS count: 311
Append 1st triplet: 311 18
Append 2nd triplet: 311 18 05
311 18 05 and 311 16 07 being the most likely with some clues whichIt sound like you want the result to "give some clues". Why not just
triplet is different.
return the string of letters? SRRRI tells you everything about the
tests. What more could you want? If you want these ordered by number
R, I and S counts, put these first always using two digits:
"030101SRRRI"
This string will sort the important, highly resistant strains to the top
and also gives all the information about the individual tests.
Am I not somehow introducing a bias by multiplying the value with the position in the triplet?Eh? Don't you want more Rs to get high scores? That's what the counts
And then there is still the case where SSR (1*1+2*1+3*3=12) and RRS (1*3+2*3+3*1=12)
will both resolve to the same value.
are for.
The requirements are one strain of a certain microorganism/patient
The most resistant one or if they have different profiles
SRS vs RRS => last one, more Rs
SRS vs RSR = both, different profiles
Wouldn't I need some sort of Traveling Salesman Problems algorithm to find the profileI don't understand the triplets idea. Sorry.
with the highest number of resistances and the highest number of
triplets with high values.
--
Ben.
The problem is not that I don't want to use Ada. We are using Citrix so I
am stuck with the programs
the IT departments allows me to use. Was already a chore to get MS Access made available.
On Mon, 27 Dec 2021 23:48:31 -0800 (PST), Laurent <lutgenl@icloud.com> declaimed the following:
The requirements are one strain of a certain microorganism/patient
The most resistant one or if they have different profiles
SRS vs RRS => last one, more Rs
SRS vs RSR = both, different profiles
Which is still inconclusive (at least as I view it) -- your second
example ALSO fits the "last one, more Rs" constraint. You haven't to
define
how the first doesn't qualify as "different profiles". Both examples are
"1R, 2S" vs "2R, 1S".
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 251 |
Nodes: | 16 (2 / 14) |
Uptime: | 27:44:28 |
Calls: | 5,553 |
Files: | 11,677 |
Messages: | 5,114,727 |