Finding this was trial and error, I don't recall it exactly, but probably simply trying x16 and /16 and it gives a perfect fit:
interpolation table for scaling 16 shades to 256 shades with 16 intermediate shades.
total count = 256
input count = 16
output count = 16 x multiplier(16) = 256
intermediate count = 16
interpolation space 0: 0 1.. 16 17
interpolation space 1: 17 18.. 33 34
interpolation space 2: 34 35.. 50 51
interpolation space 3: 51 52.. 67 68
interpolation space 4: 68 69.. 84 85
interpolation space 5: 85 86..101 102
interpolation space 6: 102 103..118 119
interpolation space 7: 119 120..135 136
interpolation space 8: 136 137..152 153
interpolation space 9: 153 154..169 170
interpolation space 10: 170 171..186 187
interpolation space 11: 187 188..203 204
interpolation space 12: 204 205..220 221
interpolation space 13: 221 222..237 238
interpolation space 14: 238 239..254 255
Visualization the interpolation table/space above was tricky because the end point of interpolation space zero has to be repeated as begin point of interpolation space 1, otherwise it would lead to a duplicate and this is not desired.
So in reality these interpolation spaces are connected and form a single dimension/line/slots as follows:
Which indeed gives 15 interpolation spaces counting the dots, each double dot is one interpolation space.
So the input range 0..15 is mapped to output range 0..255
0 ends up on 0
1 ends up on 17
2 ends up on 34
3 ends up on 51
4 ends up on 68
5 ends up on 85
6 ends up on 102
7 ends up on 119
8 ends up on 136
9 ends up on 153
10 ends up on 170
11 ends up on 187
12 ends up on 204
13 ends up on 221
14 ends up on 238
15 ends up on 255
Now the mission for the programmer is to find instructions which can convert the input range 0..15 efficiently to 0..255 and back again from 0..255 to 0..15
This would be a start but is not yet the end of the story.
For real-time interpolation the programmer would have to convert from 0..255 down to 0..15 but also compute an intermediate value which would range from 1 to 16 in the first row example.
To avoid duplicates calculated interpolation entries... the interpolation calculation would then probably be:
(InterpolationValue is then the interpolated color):
InterpolationValue = (Left * (1-T)) + (Right * T)
This interpolation formula is a bit unusual, it has the parameters for T swapped, so that it interpolates correctly from a low value/range to a high value/range.
T should not start or end at 0 or 1 because that would cause duplicates entries/interpolation colors.
Thus it seems to me computing T will have to be done as follows:
T = IntermediateValue / 17
Where immediate value would range from 1 to 16 so it would never be zero or one which is a bit interesting and again unusual, have not tested it yet, but it seems sound ! ;)
So as you can see, there are many values which are not powers of 2.
So this will make it difficult to find efficient instructions to compute this fast and in real time.
Hence the idea of using a lookup table for "palette" interpolation seems interesting. Except at higher resolutions and especially color depth, say color depths of R,G,B,A is increased from 8 bit per component to 10 bit per component or even higher 16 bit
per component than such look-up tables consume a lot of space in memory !
The current lookup table would consume 16 rows x 256 entries x 3 rgb x 1 byte = 12288 bytes
where currently L1 data caches seem limited to 32 kilobytes, so this takes quite a byte out of it.
256 entries would be enough to shade from black to red. However there are some shades/gradients which are more complex where the shading goes from red, to yellow, or from green to blue to purple. Such more complex gradients would benefit from even more
entries to make the gradient more smooth and prevent any visible banding.
So I would like 16 rows x 512 entries x 3 rgb x 1 byte = 24576 bytes
to make absolutely sure that no banding would be visible, but maybe this is pushing it a little bit, but I do believe it may be necessary.
Now I have not yet looked into 48 bit monitors however I know somebody that purchased a monitor from sony which claims to do 10 bits of precision per R,G,B so that is interesting.
I am not yet sure how to program for such monitors, adobe photoshop seems to use 16 bits per component, and accounting for any alpha values leads to 64 bits per color:
Now I can live without the alpha channel for now, but let's compute both, however I am no longer sure if 512 would not lead to any banding but let's assume for now;
(entry = palette index = a color slot for a mix of r,g,b)
16 rows x 512 entries x 3 rgb x 2 bytes = 49152 bytes
So this surpasses the L1 data cache size of 32 KB, this would be 48 KB.
Now if alpha channel would be included as well, perhaps for 8 byte alignment purposes or so:
16 rows x 512 entries x 4 rgba x 2 bytes = 65.536 bytes would be 64 KB.
So here again it takes up the full space of my old DreamPC 2006 L1 data cache if I recall correctly the AMD X2 3800+ processor had 64 KB L1 data cache per core lol.
(Not that it matters cause PCI Express 1.0 would be too limited to drive a 4K monitor anyway ! ;) even PCI Express 2.0 will probably struggle without compression at 120 hz cause that is what the panel can do too, 60 hz might still be doable, 120 hz would
be 8 GB/sec if I recall correctly at 3840x2160)
Anyway so seeing this large lookup table requirement makes it interesting to try and find a real-time computational solution with some kind of computer instructions.
So basically the situation would be as follows:
Let's assume a beautifull situation:
4K monitor at 120 Hz at 10 bit color precision but let's make it 16 bits just to be a bit future ready but can also do 30 bits thus 32 bits precision for now, makes more realistic sense I guess so no alpha channel, those bits repurposed for 10 bit
precision/bit/color depth I guess:
3280x2160x120x4 bytes = 3.400.704.000 bandwidth required/bytes to process.
Now with a say 3 to 4 to 5 Gigahertz processor it becomes apperently that the ammount of instructions that can be spent per byte or pixel is "low".... also in the sense of ammount of "compute cycles" spent.
SIMD/SSE might be able to process some more bytes per "compute cycle" ("clock cycle").
So now the final question becomes:
How to "scale up VGA colors" / "interpolate VGA colors" such that it can be done in real-time to provide more bit depth.
A lookup table might not be the solution, because lookups can requires 100 or more compute cycles ?! ;)
Multi-threading might also be part of a solution.
Also the assumption may be made that the source code to the VGA computer game is available and "interpolation" can be implemented into the game, to make the shading between vga colors more smooth.
So the VGA colors themselfes could first be "value scaled" to 10 bit per component and then later "slot/gradient/palette scaled" up further by interpolating more slots and thus increasing the final palette size for more smooth gradients/shading etc.