On today's x86-CPUs there is a prefetching-instruction which loads
cacheline into an cache-level chosable by a parameter for this in-
struction.
Bonita Montero <Bonita.Montero@gmail.com> writes:Let her try:
On today's x86-CPUs there is a prefetching-instruction which loads >>cacheline into an cache-level chosable by a parameter for this in- >>struction.
Modern CPUs for the last decade have included automatic prefetchers
in the cache subsystems. Usually a mix of stride-based and/or predictive fetchers.
It's very seldom necessary for an application to provide an explicit prefetching hint except in very unusual circumstances. And most
programmers trying to insert hints manually will get it wrong.
The behavior of such is also heavily microarchitecture dependent,
so what works on one chip may really slow things down on another.
Note that they are, after all, hints. The processor need not
actually do anything for a prefetch instruction.
Let the hardware handle it.
Modern CPUs for the last decade have included automatic prefetchers
in the cache subsystems. Usually a mix of stride-based and/or predictive fetchers.
It's very seldom necessary for an application to provide an
explicit prefetching hint except in very unusual circumstances.
Modern CPUs for the last decade have included automatic prefetchers
in the cache subsystems.  Usually a mix of stride-based and/or
predictive
fetchers.
If they would be better my program would give the best result of
zero prefetching. And there would be no prefetching-instructions
at all.
It's very seldom necessary for an application to provide an
explicit prefetching hint except in very unusual circumstances.
Automatic prefetchers are dumb.
On 10/1/2021 10:09 PM, Bonita Montero wrote:
Modern CPUs for the last decade have included automatic prefetchers
in the cache subsystems.  Usually a mix of stride-based and/or
predictive
fetchers.
If they would be better my program would give the best result of
zero prefetching. And there would be no prefetching-instructions
at all.
It's very seldom necessary for an application to provide an
explicit prefetching hint except in very unusual circumstances.
Automatic prefetchers are dumb.
Oh.... shit. You make me feel like a full blown moron for even
responding to you, Bonita. YIKES! Let me guess, you agree with me, and
say I am stupid for responding to you. ;^)
Enlightenment is, when you realise that everything that happens to you is from self beliefs good or bad, and when you realise that you transfer that to others, buy convincing, good or bad, you start to convince in only good, or stop completely, which is even better :P MEAutomatic prefetchers are dumb.
Oh.... shit. You make me feel like a full blown moron for even
responding to you, Bonita. YIKES! Let me guess, you agree with me, and
say I am stupid for responding to you. ;^)
lol.
Am 01.10.21 um 20:36 schrieb Scott Lurndal:
It's very seldom necessary for an application to provide an explicit
prefetching hint except in very unusual circumstances. And most
programmers trying to insert hints manually will get it wrong.
The behavior of such is also heavily microarchitecture dependent,
so what works on one chip may really slow things down on another.
I can confirm this.
I did several tests with __builtin_prefetch to reduce the collision rate
in lock free algorithms. ...
It's very seldom necessary for an application to provide an explicit prefetching hint except in very unusual circumstances. And most
programmers trying to insert hints manually will get it wrong.
The behavior of such is also heavily microarchitecture dependent,
so what works on one chip may really slow things down on another.
prefix || extend, &wasSpace );mem += prefix;
There's the Unix-command wc which counts words and lines. And the wc-implementation from the current GNU core utilities contain an
optional very tricky AVX-implementation. This improves the speed
of wc on my Linux-computer by factor 29.
I improved this algorithm further to partition the data in three
parts which I handle interleaved, i.e. 32-byte-chunks synchronously
static
vector<char> readFileRepeated( char const *fileName, size_t blockSize )
{
if( !blockSize )
return vector<char>();
ifstream ifs;
ifs.exceptions( ifstream::failbit | ifstream::badbit );
ifs.open( fileName, ifstream::binary );
ifs.seekg( 0, ios_base::end );
streampos fileSize = ifs.tellg();
if( !fileSize || fileSize > (size_t)-1 )
return vector<char>();
ifs.seekg( 0, ios_base::beg );
vector<char> block( blockSize, 0 );
size_t repSize = (size_t)fileSize <= blockSize ? (size_t)fileSize : blockSize;
ifs.read( &*block.begin(), repSize );
bool lastNewline = block[repSize - 1] == '\n';
size_t remaining = block.size() - repSize;
do
{
size_t cpy = remaining >= repSize ? repSize : remaining;
copy( block.begin(), block.begin() + cpy, block.end() - remaining );
remaining -= cpy;
if( !lastNewline && remaining )
block.end()[-(ptrdiff_t)remaining--] = '\n';
} while( remaining );
return block;
}
On 2021-10-04, Bonita Montero <Bonita.Montero@gmail.com> wrote:
There's the Unix-command wc which counts words and lines. And theTalking about efficiency :P
wc-implementation from the current GNU core utilities contain an
optional very tricky AVX-implementation. This improves the speed
of wc on my Linux-computer by factor 29.
I improved this algorithm further to partition the data in three
parts which I handle interleaved, i.e. 32-byte-chunks synchronously
static
vector<char> readFileRepeated( char const *fileName, size_t blockSize )
{
if( !blockSize )
return vector<char>();
ifstream ifs;
ifs.exceptions( ifstream::failbit | ifstream::badbit );
ifs.open( fileName, ifstream::binary );
ifs.seekg( 0, ios_base::end );
streampos fileSize = ifs.tellg();
if( !fileSize || fileSize > (size_t)-1 )
return vector<char>();
ifs.seekg( 0, ios_base::beg );
vector<char> block( blockSize, 0 );
size_t repSize = (size_t)fileSize <= blockSize ? (size_t)fileSize :
blockSize;
ifs.read( &*block.begin(), repSize );
bool lastNewline = block[repSize - 1] == '\n';
size_t remaining = block.size() - repSize;
do
{
size_t cpy = remaining >= repSize ? repSize : remaining;
copy( block.begin(), block.begin() + cpy, block.end() - remaining );
remaining -= cpy;
if( !lastNewline && remaining )
block.end()[-(ptrdiff_t)remaining--] = '\n';
} while( remaining );
return block;
}
Who will pay you for overcomplicating simple things?
Am 03.10.2021 um 15:33 schrieb Marcel Mueller:
I did several tests with __builtin_prefetch to reduce the collision
rate in lock free algorithms. ...
Why should a lockfree algorithm employ prefechting ?
Am 04.10.2021 um 18:36 schrieb Branimir Maksimovic:take a look at this simple and professionaly done
On 2021-10-04, Bonita Montero <Bonita.Montero@gmail.com> wrote:
There's the Unix-command wc which counts words and lines. And theTalking about efficiency :P
wc-implementation from the current GNU core utilities contain an
optional very tricky AVX-implementation. This improves the speed
of wc on my Linux-computer by factor 29.
I improved this algorithm further to partition the data in three
parts which I handle interleaved, i.e. 32-byte-chunks synchronously
static
vector<char> readFileRepeated( char const *fileName, size_t blockSize )
{
if( !blockSize )
return vector<char>();
ifstream ifs;
ifs.exceptions( ifstream::failbit | ifstream::badbit );
ifs.open( fileName, ifstream::binary );
ifs.seekg( 0, ios_base::end );
streampos fileSize = ifs.tellg();
if( !fileSize || fileSize > (size_t)-1 )
return vector<char>();
ifs.seekg( 0, ios_base::beg );
vector<char> block( blockSize, 0 );
size_t repSize = (size_t)fileSize <= blockSize ? (size_t)fileSize :
blockSize;
ifs.read( &*block.begin(), repSize );
bool lastNewline = block[repSize - 1] == '\n';
size_t remaining = block.size() - repSize;
do
{
size_t cpy = remaining >= repSize ? repSize : remaining;
copy( block.begin(), block.begin() + cpy, block.end() - remaining );
remaining -= cpy;
if( !lastNewline && remaining )
block.end()[-(ptrdiff_t)remaining--] = '\n';
} while( remaining );
return block;
}
Who will pay you for overcomplicating simple things?
Why should this be overcomplicated ? Im repeatedly copy a
file into a buffer until it is full; maybe not even once
fully if the file doesn't fit in the buffer's maximum size.
That's the most direct way.
Prefetch can access invalid memory. So prefetching a shared memory area behind a pointer can significantly decrease the probability of failed
CAS when implementing strong thread safety. But on some platforms I
observed excessive cachline hopping with this strategy.
On 2021-10-04, Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 04.10.2021 um 18:36 schrieb Branimir Maksimovic:take a look at this simple and professionaly done
On 2021-10-04, Bonita Montero <Bonita.Montero@gmail.com> wrote:
There's the Unix-command wc which counts words and lines. And theTalking about efficiency :P
wc-implementation from the current GNU core utilities contain an
optional very tricky AVX-implementation. This improves the speed
of wc on my Linux-computer by factor 29.
I improved this algorithm further to partition the data in three
parts which I handle interleaved, i.e. 32-byte-chunks synchronously
static
vector<char> readFileRepeated( char const *fileName, size_t blockSize ) >>>> {
if( !blockSize )
return vector<char>();
ifstream ifs;
ifs.exceptions( ifstream::failbit | ifstream::badbit );
ifs.open( fileName, ifstream::binary );
ifs.seekg( 0, ios_base::end );
streampos fileSize = ifs.tellg();
if( !fileSize || fileSize > (size_t)-1 )
return vector<char>();
ifs.seekg( 0, ios_base::beg );
vector<char> block( blockSize, 0 );
size_t repSize = (size_t)fileSize <= blockSize ? (size_t)fileSize : >>>> blockSize;
ifs.read( &*block.begin(), repSize );
bool lastNewline = block[repSize - 1] == '\n';
size_t remaining = block.size() - repSize;
do
{
size_t cpy = remaining >= repSize ? repSize : remaining;
copy( block.begin(), block.begin() + cpy, block.end() - remaining );
remaining -= cpy;
if( !lastNewline && remaining )
block.end()[-(ptrdiff_t)remaining--] = '\n';
} while( remaining );
return block;
}
Who will pay you for overcomplicating simple things?
Why should this be overcomplicated ? Im repeatedly copy a
file into a buffer until it is full; maybe not even once
fully if the file doesn't fit in the buffer's maximum size.
That's the most direct way.
program that does all that :P
(Ian Collins I think is AUTHOR :P
#include <map>
#include <unordered_map>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <iomanip>
using namespace std;
using Pairs = unordered_map<string,int>;
void fill( Pairs& pairs, char c )
{
static string word;
if( ispunct(c) ) return;
if( isspace(c) )
{
if( word.size() )
{
pairs[word]++;
word.clear();
}
}
else
{
word += tolower(c);
}
}
int main()
{
ifstream bible {"bible.txt"};
using citerator = istreambuf_iterator<char>;
Pairs pairs;
for_each( citerator(bible.rdbuf()), citerator(),
[&pairs]( char c ){ fill( pairs, c ); } );
multimap<unsigned,string> sorted;
// Sort the {word, count} pairs.
//
for_each( pairs.begin(), pairs.end(),
[&sorted]( const Pairs::value_type& p )
{ sorted.insert(make_pair(p.second,p.first)); } );
// Print the top 20.
//
auto item = sorted.rbegin();
for( auto n = 0; n < 20; ++n, ++item )
{
cout << "Position " << setw(2) << n+1
<< ": count = " << setw(6) << item->first
<< " " << item->second << '\n';
}
return 0;
}
Am 04.10.2021 um 21:59 schrieb Marcel Mueller:
Prefetch can access invalid memory. So prefetching a shared memory area
behind a pointer can significantly decrease the probability of failed
CAS when implementing strong thread safety. But on some platforms I
observed excessive cachline hopping with this strategy.
That doesn't make sense. When you prefetch you usually process a lot of
data before the point you prefetched. When you have CASes you rotatedy process the same data; prefetching here is nonsense.
Prefetch can access invalid memory. So prefetching a shared memory area
behind a pointer can significantly decrease the probability of failed
CAS when implementing strong thread safety. But on some platforms I
observed excessive cachline hopping with this strategy.
That doesn't make sense. When you prefetch you usually process a lot of
data before the point you prefetched. When you have CASes you rotatedy
process the same data; prefetching here is nonsense.
Prefetching is nonsense in HLL :P
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 113 |
Nodes: | 8 (1 / 7) |
Uptime: | 132:44:36 |
Calls: | 2,501 |
Files: | 8,696 |
Messages: | 1,925,432 |