head 1.1; branch 1.1.1; access; symbols netbsd-11-0-RC4:1.1.1.5 netbsd-11-0-RC3:1.1.1.5 netbsd-11-0-RC2:1.1.1.5 netbsd-11-0-RC1:1.1.1.5 gcc-14-3-0:1.1.1.5 perseant-exfatfs-base-20250801:1.1.1.5 netbsd-11:1.1.1.5.0.10 netbsd-11-base:1.1.1.5 gcc-12-5-0:1.1.1.5 netbsd-10-1-RELEASE:1.1.1.5 perseant-exfatfs-base-20240630:1.1.1.5 gcc-12-4-0:1.1.1.5 perseant-exfatfs:1.1.1.5.0.8 perseant-exfatfs-base:1.1.1.5 netbsd-8-3-RELEASE:1.1.1.2 netbsd-9-4-RELEASE:1.1.1.4 netbsd-10-0-RELEASE:1.1.1.5 netbsd-10-0-RC6:1.1.1.5 netbsd-10-0-RC5:1.1.1.5 netbsd-10-0-RC4:1.1.1.5 netbsd-10-0-RC3:1.1.1.5 netbsd-10-0-RC2:1.1.1.5 netbsd-10-0-RC1:1.1.1.5 gcc-12-3-0:1.1.1.5 gcc-10-5-0:1.1.1.5 netbsd-10:1.1.1.5.0.6 netbsd-10-base:1.1.1.5 netbsd-9-3-RELEASE:1.1.1.4 gcc-10-4-0:1.1.1.5 cjep_sun2x-base1:1.1.1.5 cjep_sun2x:1.1.1.5.0.4 cjep_sun2x-base:1.1.1.5 cjep_staticlib_x-base1:1.1.1.5 netbsd-9-2-RELEASE:1.1.1.4 cjep_staticlib_x:1.1.1.5.0.2 cjep_staticlib_x-base:1.1.1.5 gcc-10-3-0:1.1.1.5 netbsd-9-1-RELEASE:1.1.1.4 gcc-9-3-0:1.1.1.5 gcc-7-5-0:1.1.1.4 phil-wifi-20200421:1.1.1.4 phil-wifi-20200411:1.1.1.4 is-mlppp:1.1.1.4.0.4 is-mlppp-base:1.1.1.4 phil-wifi-20200406:1.1.1.4 netbsd-8-2-RELEASE:1.1.1.2 gcc-8-4-0:1.1.1.4 netbsd-9-0-RELEASE:1.1.1.4 netbsd-9-0-RC2:1.1.1.4 netbsd-9-0-RC1:1.1.1.4 phil-wifi-20191119:1.1.1.4 gcc-8-3-0:1.1.1.4 netbsd-9:1.1.1.4.0.2 netbsd-9-base:1.1.1.4 phil-wifi-20190609:1.1.1.4 netbsd-8-1-RELEASE:1.1.1.2 netbsd-8-1-RC1:1.1.1.2 pgoyette-compat-merge-20190127:1.1.1.2.14.2 pgoyette-compat-20190127:1.1.1.4 gcc-7-4-0:1.1.1.4 pgoyette-compat-20190118:1.1.1.3 pgoyette-compat-1226:1.1.1.3 pgoyette-compat-1126:1.1.1.3 gcc-6-5-0:1.1.1.3 pgoyette-compat-1020:1.1.1.2 pgoyette-compat-0930:1.1.1.2 pgoyette-compat-0906:1.1.1.2 netbsd-7-2-RELEASE:1.1.1.1 pgoyette-compat-0728:1.1.1.2 netbsd-8-0-RELEASE:1.1.1.2 phil-wifi:1.1.1.2.0.16 phil-wifi-base:1.1.1.2 pgoyette-compat-0625:1.1.1.2 netbsd-8-0-RC2:1.1.1.2 pgoyette-compat-0521:1.1.1.2 pgoyette-compat-0502:1.1.1.2 pgoyette-compat-0422:1.1.1.2 netbsd-8-0-RC1:1.1.1.2 pgoyette-compat-0415:1.1.1.2 pgoyette-compat-0407:1.1.1.2 pgoyette-compat-0330:1.1.1.2 pgoyette-compat-0322:1.1.1.2 pgoyette-compat-0315:1.1.1.2 netbsd-7-1-2-RELEASE:1.1.1.1 pgoyette-compat:1.1.1.2.0.14 pgoyette-compat-base:1.1.1.2 gcc-6-4-0:1.1.1.2 netbsd-7-1-1-RELEASE:1.1.1.1 gcc-5-5-0:1.1.1.2 matt-nb8-mediatek:1.1.1.2.0.12 matt-nb8-mediatek-base:1.1.1.2 perseant-stdc-iso10646:1.1.1.2.0.10 perseant-stdc-iso10646-base:1.1.1.2 netbsd-8:1.1.1.2.0.8 netbsd-8-base:1.1.1.2 prg-localcount2-base3:1.1.1.2 prg-localcount2-base2:1.1.1.2 prg-localcount2-base1:1.1.1.2 prg-localcount2:1.1.1.2.0.6 prg-localcount2-base:1.1.1.2 pgoyette-localcount-20170426:1.1.1.2 bouyer-socketcan-base1:1.1.1.2 pgoyette-localcount-20170320:1.1.1.2 netbsd-7-1:1.1.1.1.0.14 netbsd-7-1-RELEASE:1.1.1.1 netbsd-7-1-RC2:1.1.1.1 netbsd-7-nhusb-base-20170116:1.1.1.1 bouyer-socketcan:1.1.1.2.0.4 bouyer-socketcan-base:1.1.1.2 pgoyette-localcount-20170107:1.1.1.2 netbsd-7-1-RC1:1.1.1.1 pgoyette-localcount-20161104:1.1.1.2 netbsd-7-0-2-RELEASE:1.1.1.1 localcount-20160914:1.1.1.2 netbsd-7-nhusb:1.1.1.1.0.12 netbsd-7-nhusb-base:1.1.1.1 pgoyette-localcount-20160806:1.1.1.2 pgoyette-localcount-20160726:1.1.1.2 pgoyette-localcount:1.1.1.2.0.2 pgoyette-localcount-base:1.1.1.2 gcc-5-4-0:1.1.1.2 netbsd-7-0-1-RELEASE:1.1.1.1 gcc-5-3-0:1.1.1.2 netbsd-7-0:1.1.1.1.0.10 netbsd-7-0-RELEASE:1.1.1.1 gcc-4-8-5-pre-gcc-old-import:1.1.1.1 netbsd-7-0-RC3:1.1.1.1 netbsd-7-0-RC2:1.1.1.1 post-gcc-4-8-5-merge:1.1.1.1 gcc-4-8-5:1.1.1.1 netbsd-7-0-RC1:1.1.1.1 gcc-4-8-4:1.1.1.1 gcc-4-8-20141009:1.1.1.1 tls-maxphys-base:1.1.1.1 tls-maxphys:1.1.1.1.0.8 netbsd-7:1.1.1.1.0.6 netbsd-7-base:1.1.1.1 gcc-4-8-3:1.1.1.1 yamt-pagecache:1.1.1.1.0.4 yamt-pagecache-base9:1.1.1.1 tls-earlyentropy:1.1.1.1.0.2 tls-earlyentropy-base:1.1.1.1 riastradh-xf86-video-intel-2-7-1-pre-2-21-15:1.1.1.1 riastradh-drm2-base3:1.1.1.1 gcc-4-8-3-pre-r208254:1.1.1.1 gcc-4-8-3-pre-r206687:1.1.1.1 FSF:1.1.1; locks; strict; comment @# @; 1.1 date 2014.03.01.08.41.30; author mrg; state Exp; branches 1.1.1.1; next ; commitid TtaB91QNTknAoYqx; 1.1.1.1 date 2014.03.01.08.41.30; author mrg; state Exp; branches 1.1.1.1.4.1 1.1.1.1.8.1; next 1.1.1.2; commitid TtaB91QNTknAoYqx; 1.1.1.2 date 2016.01.24.06.05.43; author mrg; state Exp; branches 1.1.1.2.14.1 1.1.1.2.16.1; next 1.1.1.3; commitid uWWfbLp08zOK79Sy; 1.1.1.3 date 2018.11.04.00.12.37; author mrg; state Exp; branches; next 1.1.1.4; commitid bulspy67pMB6EyYA; 1.1.1.4 date 2019.01.19.10.14.11; author mrg; state Exp; branches; next 1.1.1.5; commitid VQ8OwWIg5RS9kn8B; 1.1.1.5 date 2020.09.05.07.52.18; author mrg; state Exp; branches; next ; commitid ZRYA7IOuwfMjAPmC; 1.1.1.1.4.1 date 2014.03.01.08.41.30; author yamt; state dead; branches; next 1.1.1.1.4.2; commitid DX8bafDLmqEbpyBx; 1.1.1.1.4.2 date 2014.05.22.16.37.45; author yamt; state Exp; branches; next ; commitid DX8bafDLmqEbpyBx; 1.1.1.1.8.1 date 2014.03.01.08.41.30; author tls; state dead; branches; next 1.1.1.1.8.2; commitid jTnpym9Qu0o4R1Nx; 1.1.1.1.8.2 date 2014.08.19.23.54.46; author tls; state Exp; branches; next ; commitid jTnpym9Qu0o4R1Nx; 1.1.1.2.14.1 date 2018.11.26.01.50.57; author pgoyette; state Exp; branches; next 1.1.1.2.14.2; commitid Zj4q5SspGdKXto1B; 1.1.1.2.14.2 date 2019.01.26.21.59.32; author pgoyette; state Exp; branches; next ; commitid JKpcmvSjdT25dl9B; 1.1.1.2.16.1 date 2019.06.10.21.54.49; author christos; state Exp; branches; next ; commitid jtc8rnCzWiEEHGqB; desc @@ 1.1 log @Initial revision @ text @ Diagnostics

Diagnostics

The table below presents all the diagnostics we intend to implement. Each diagnostic has a corresponding compile time switch -D_GLIBCXX_PROFILE_<diagnostic>. Groups of related diagnostics can be turned on with a single switch. For instance, -D_GLIBCXX_PROFILE_LOCALITY is equivalent to -D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH -D_GLIBCXX_PROFILE_RBTREE_LOCALITY.

The benefit, cost, expected frequency and accuracy of each diagnostic was given a grade from 1 to 10, where 10 is highest. A high benefit means that, if the diagnostic is accurate, the expected performance improvement is high. A high cost means that turning this diagnostic on leads to high slowdown. A high frequency means that we expect this to occur relatively often. A high accuracy means that the diagnostic is unlikely to be wrong. These grades are not perfect. They are just meant to guide users with specific needs or time budgets.

Table 19.2. Profile Diagnostics

GroupFlagBenefitCostFreq.Implemented 
CONTAINERS HASHTABLE_TOO_SMALL101 10yes
  HASHTABLE_TOO_LARGE51 10yes
  INEFFICIENT_HASH73 10yes
  VECTOR_TOO_SMALL81 10yes
  VECTOR_TOO_LARGE51 10yes
  VECTOR_TO_HASHTABLE77 10no
  HASHTABLE_TO_VECTOR77 10no
  VECTOR_TO_LIST85 10yes
  LIST_TO_VECTOR105 10no
  ORDERED_TO_UNORDERED105 10only map/unordered_map
ALGORITHMS SORT78 7no
LOCALITY SOFTWARE_PREFETCH88 5no
  RBTREE_LOCALITY48 5no
  FALSE_SHARING810 10no

Diagnostic Template

  • Switch: _GLIBCXX_PROFILE_<diagnostic>.

  • Goal: What problem will it diagnose?

  • Fundamentals:. What is the fundamental reason why this is a problem

  • Sample runtime reduction: Percentage reduction in execution time. When reduction is more than a constant factor, describe the reduction rate formula.

  • Recommendation: What would the advise look like?

  • To instrument: What stdlibc++ components need to be instrumented?

  • Analysis: How do we decide when to issue the advice?

  • Cost model: How do we measure benefits? Math goes here.

  • Example:

    program code
    ...
    advice sample
    

Containers

Switch: _GLIBCXX_PROFILE_CONTAINERS.

Hashtable Too Small

  • Switch: _GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL.

  • Goal: Detect hashtables with many rehash operations, small construction size and large destruction size.

  • Fundamentals: Rehash is very expensive. Read content, follow chains within bucket, evaluate hash function, place at new location in different order.

  • Sample runtime reduction: 36%. Code similar to example below.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument: unordered_set, unordered_map constructor, destructor, rehash.

  • Analysis: For each dynamic instance of unordered_[multi]set|map, record initial size and call context of the constructor. Record size increase, if any, after each relevant operation such as insert. Record the estimated rehash cost.

  • Cost model: Number of individual rehash operations * cost per rehash.

  • Example:

    1 unordered_set<int> us;
    2 for (int k = 0; k < 1000000; ++k) {
    3   us.insert(k);
    4 }
    
    foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.
    

Hashtable Too Large

  • Switch: _GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE.

  • Goal: Detect hashtables which are never filled up because fewer elements than reserved are ever inserted.

  • Fundamentals: Save memory, which is good in itself and may also improve memory reference performance through fewer cache and TLB misses.

  • Sample runtime reduction: unknown.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument: unordered_set, unordered_map constructor, destructor, rehash.

  • Analysis: For each dynamic instance of unordered_[multi]set|map, record initial size and call context of the constructor, and correlate it with its size at destruction time.

  • Cost model: Number of iteration operations + memory saved.

  • Example:

    1 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ;
    2 for (int k = 0; k < 100000; ++k) {
    3   for (int j = 0; j < 10; ++j) {
    4     v[k].insert(k + j);
    5  }
    6 }
    
    foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N
    bytes of memory and M iteration steps.
    

Inefficient Hash

  • Switch: _GLIBCXX_PROFILE_INEFFICIENT_HASH.

  • Goal: Detect hashtables with polarized distribution.

  • Fundamentals: A non-uniform distribution may lead to long chains, thus possibly increasing complexity by a factor up to the number of elements.

  • Sample runtime reduction: factor up to container size.

  • Recommendation: Change hash function for container built at site S. Distribution score = N. Access score = S. Longest chain = C, in bucket B.

  • To instrument: unordered_set, unordered_map constructor, destructor, [], insert, iterator.

  • Analysis: Count the exact number of link traversals.

  • Cost model: Total number of links traversed.

  • Example:

    class dumb_hash {
     public:
      size_t operator() (int i) const { return 0; }
    };
    ...
      unordered_set<int, dumb_hash> hs;
      ...
      for (int i = 0; i < COUNT; ++i) {
        hs.find(i);
      }
    

Vector Too Small

  • Switch: _GLIBCXX_PROFILE_VECTOR_TOO_SMALL.

  • Goal:Detect vectors with many resize operations, small construction size and large destruction size..

  • Fundamentals:Resizing can be expensive. Copying large amounts of data takes time. Resizing many small vectors may have allocation overhead and affect locality.

  • Sample runtime reduction:%.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument:vector.

  • Analysis: For each dynamic instance of vector, record initial size and call context of the constructor. Record size increase, if any, after each relevant operation such as push_back. Record the estimated resize cost.

  • Cost model: Total number of words copied * time to copy a word.

  • Example:

    1 vector<int> v;
    2 for (int k = 0; k < 1000000; ++k) {
    3   v.push_back(k);
    4 }
    
    foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves
    copying 4000000 bytes and 20 memory allocations and deallocations.
    

Vector Too Large

  • Switch: _GLIBCXX_PROFILE_VECTOR_TOO_LARGE

  • Goal:Detect vectors which are never filled up because fewer elements than reserved are ever inserted.

  • Fundamentals:Save memory, which is good in itself and may also improve memory reference performance through fewer cache and TLB misses.

  • Sample runtime reduction:%.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument:vector.

  • Analysis: For each dynamic instance of vector, record initial size and call context of the constructor, and correlate it with its size at destruction time.

  • Cost model: Total amount of memory saved.

  • Example:

    1 vector<vector<int>> v(100000, vector<int>(100)) ;
    2 for (int k = 0; k < 100000; ++k) {
    3   for (int j = 0; j < 10; ++j) {
    4     v[k].insert(k + j);
    5  }
    6 }
    
    foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N
    bytes of memory and may reduce the number of cache and TLB misses.
    

Vector to Hashtable

  • Switch: _GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE.

  • Goal: Detect uses of vector that can be substituted with unordered_set to reduce execution time.

  • Fundamentals: Linear search in a vector is very expensive, whereas searching in a hashtable is very quick.

  • Sample runtime reduction:factor up to container size.

  • Recommendation:Replace vector with unordered_set at site S.

  • To instrument:vector operations and access methods.

  • Analysis: For each dynamic instance of vector, record call context of the constructor. Issue the advice only if the only methods called on this vector are push_back, insert and find.

  • Cost model: Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) - cost(unordered_set::insert) + cost(unordered_set::find).

  • Example:

    1  vector<int> v;
    ...
    2  for (int i = 0; i < 1000; ++i) {
    3    find(v.begin(), v.end(), i);
    4  }
    
    foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000
    comparisons.
    

Hashtable to Vector

  • Switch: _GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR.

  • Goal: Detect uses of unordered_set that can be substituted with vector to reduce execution time.

  • Fundamentals: Hashtable iterator is slower than vector iterator.

  • Sample runtime reduction:95%.

  • Recommendation:Replace unordered_set with vector at site S.

  • To instrument:unordered_set operations and access methods.

  • Analysis: For each dynamic instance of unordered_set, record call context of the constructor. Issue the advice only if the number of find, insert and [] operations on this unordered_set are small relative to the number of elements, and methods begin or end are invoked (suggesting iteration).

  • Cost model: Number of .

  • Example:

    1  unordered_set<int> us;
    ...
    2  int s = 0;
    3  for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) {
    4    s += *it;
    5  }
    
    foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N
    indirections and may achieve better data locality.
    

Vector to List

  • Switch: _GLIBCXX_PROFILE_VECTOR_TO_LIST.

  • Goal: Detect cases where vector could be substituted with list for better performance.

  • Fundamentals: Inserting in the middle of a vector is expensive compared to inserting in a list.

  • Sample runtime reduction:factor up to container size.

  • Recommendation:Replace vector with list at site S.

  • To instrument:vector operations and access methods.

  • Analysis: For each dynamic instance of vector, record the call context of the constructor. Record the overhead of each insert operation based on current size and insert position. Report instance with high insertion overhead.

  • Cost model: (Sum(cost(vector::method)) - Sum(cost(list::method)), for method in [push_back, insert, erase]) + (Cost(iterate vector) - Cost(iterate list))

  • Example:

    1  vector<int> v;
    2  for (int i = 0; i < 10000; ++i) {
    3    v.insert(v.begin(), i);
    4  }
    
    foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000
    operations.
    

List to Vector

  • Switch: _GLIBCXX_PROFILE_LIST_TO_VECTOR.

  • Goal: Detect cases where list could be substituted with vector for better performance.

  • Fundamentals: Iterating through a vector is faster than through a list.

  • Sample runtime reduction:64%.

  • Recommendation:Replace list with vector at site S.

  • To instrument:vector operations and access methods.

  • Analysis: Issue the advice if there are no insert operations.

  • Cost model: (Sum(cost(vector::method)) - Sum(cost(list::method)), for method in [push_back, insert, erase]) + (Cost(iterate vector) - Cost(iterate list))

  • Example:

    1  list<int> l;
    ...
    2  int sum = 0;
    3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
    4    sum += *it;
    5  }
    
    foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect
    memory references.
    

List to Forward List (Slist)

  • Switch: _GLIBCXX_PROFILE_LIST_TO_SLIST.

  • Goal: Detect cases where list could be substituted with forward_list for better performance.

  • Fundamentals: The memory footprint of a forward_list is smaller than that of a list. This has beneficial effects on memory subsystem, e.g., fewer cache misses.

  • Sample runtime reduction:40%. Note that the reduction is only noticeable if the size of the forward_list node is in fact larger than that of the list node. For memory allocators with size classes, you will only notice an effect when the two node sizes belong to different allocator size classes.

  • Recommendation:Replace list with forward_list at site S.

  • To instrument:list operations and iteration methods.

  • Analysis: Issue the advice if there are no backwards traversals or insertion before a given node.

  • Cost model: Always true.

  • Example:

    1  list<int> l;
    ...
    2  int sum = 0;
    3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
    4    sum += *it;
    5  }
    
    foo.cc:1: advice: Change "list" to "forward_list".
    

Ordered to Unordered Associative Container

  • Switch: _GLIBCXX_PROFILE_ORDERED_TO_UNORDERED.

  • Goal: Detect cases where ordered associative containers can be replaced with unordered ones.

  • Fundamentals: Insert and search are quicker in a hashtable than in a red-black tree.

  • Sample runtime reduction:52%.

  • Recommendation: Replace set with unordered_set at site S.

  • To instrument: set, multiset, map, multimap methods.

  • Analysis: Issue the advice only if we are not using operator ++ on any iterator on a particular [multi]set|map.

  • Cost model: (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for method in [insert, erase, find]) + (Cost(iterate hashtable) - Cost(iterate rbtree))

  • Example:

    1  set<int> s;
    2  for (int i = 0; i < 100000; ++i) {
    3    s.insert(i);
    4  }
    5  int sum = 0;
    6  for (int i = 0; i < 100000; ++i) {
    7    sum += *s.find(i);
    8  }
    

Algorithms

Switch: _GLIBCXX_PROFILE_ALGORITHMS.

Sort Algorithm Performance

  • Switch: _GLIBCXX_PROFILE_SORT.

  • Goal: Give measure of sort algorithm performance based on actual input. For instance, advise Radix Sort over Quick Sort for a particular call context.

  • Fundamentals: See papers: A framework for adaptive algorithm selection in STAPL and Optimizing Sorting with Machine Learning Algorithms.

  • Sample runtime reduction:60%.

  • Recommendation: Change sort algorithm at site S from X Sort to Y Sort.

  • To instrument: sort algorithm.

  • Analysis: Issue the advice if the cost model tells us that another sort algorithm would do better on this input. Requires us to know what algorithm we are using in our sort implementation in release mode.

  • Cost model: Runtime(algo) for algo in [radix, quick, merge, ...]

  • Example:

Data Locality

Switch: _GLIBCXX_PROFILE_LOCALITY.

Need Software Prefetch

  • Switch: _GLIBCXX_PROFILE_SOFTWARE_PREFETCH.

  • Goal: Discover sequences of indirect memory accesses that are not regular, thus cannot be predicted by hardware prefetchers.

  • Fundamentals: Indirect references are hard to predict and are very expensive when they miss in caches.

  • Sample runtime reduction:25%.

  • Recommendation: Insert prefetch instruction.

  • To instrument: Vector iterator and access operator [].

  • Analysis: First, get cache line size and page size from system. Then record iterator dereference sequences for which the value is a pointer. For each sequence within a container, issue a warning if successive pointer addresses are not within cache lines and do not form a linear pattern (otherwise they may be prefetched by hardware). If they also step across page boundaries, make the warning stronger.

    The same analysis applies to containers other than vector. However, we cannot give the same advice for linked structures, such as list, as there is no random access to the n-th element. The user may still be able to benefit from this information, for instance by employing frays (user level light weight threads) to hide the latency of chasing pointers.

    This analysis is a little oversimplified. A better cost model could be created by understanding the capability of the hardware prefetcher. This model could be trained automatically by running a set of synthetic cases.

  • Cost model: Total distance between pointer values of successive elements in vectors of pointers.

  • Example:

    1 int zero = 0;
    2 vector<int*> v(10000000, &zero);
    3 for (int k = 0; k < 10000000; ++k) {
    4   v[random() % 10000000] = new int(k);
    5 }
    6 for (int j = 0; j < 10000000; ++j) {
    7   count += (*v[j] == 0 ? 0 : 1);
    8 }
    
    foo.cc:7: advice: Insert prefetch instruction.
    

Linked Structure Locality

  • Switch: _GLIBCXX_PROFILE_RBTREE_LOCALITY.

  • Goal: Give measure of locality of objects stored in linked structures (lists, red-black trees and hashtables) with respect to their actual traversal patterns.

  • Fundamentals:Allocation can be tuned to a specific traversal pattern, to result in better data locality. See paper: Custom Memory Allocation for Free.

  • Sample runtime reduction:30%.

  • Recommendation: High scatter score N for container built at site S. Consider changing allocation sequence or choosing a structure conscious allocator.

  • To instrument: Methods of all containers using linked structures.

  • Analysis: First, get cache line size and page size from system. Then record the number of successive elements that are on different line or page, for each traversal method such as find. Give advice only if the ratio between this number and the number of total node hops is above a threshold.

  • Cost model: Sum(same_cache_line(this,previous))

  • Example:

     1  set<int> s;
     2  for (int i = 0; i < 10000000; ++i) {
     3    s.insert(i);
     4  }
     5  set<int> s1, s2;
     6  for (int i = 0; i < 10000000; ++i) {
     7    s1.insert(i);
     8    s2.insert(i);
     9  }
    ...
          // Fast, better locality.
    10    for (set<int>::iterator it = s.begin(); it != s.end(); ++it) {
    11      sum += *it;
    12    }
          // Slow, elements are further apart.
    13    for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) {
    14      sum += *it;
    15    }
    
    foo.cc:5: advice: High scatter score NNN for set built here.  Consider changing
    the allocation sequence or switching to a structure conscious allocator.
    

Multithreaded Data Access

The diagnostics in this group are not meant to be implemented short term. They require compiler support to know when container elements are written to. Instrumentation can only tell us when elements are referenced.

Switch: _GLIBCXX_PROFILE_MULTITHREADED.

Data Dependence Violations at Container Level

  • Switch: _GLIBCXX_PROFILE_DDTEST.

  • Goal: Detect container elements that are referenced from multiple threads in the parallel region or across parallel regions.

  • Fundamentals: Sharing data between threads requires communication and perhaps locking, which may be expensive.

  • Sample runtime reduction:?%.

  • Recommendation: Change data distribution or parallel algorithm.

  • To instrument: Container access methods and iterators.

  • Analysis: Keep a shadow for each container. Record iterator dereferences and container member accesses. Issue advice for elements referenced by multiple threads. See paper: The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization.

  • Cost model: Number of accesses to elements referenced from multiple threads

  • Example:

False Sharing

  • Switch: _GLIBCXX_PROFILE_FALSE_SHARING.

  • Goal: Detect elements in the same container which share a cache line, are written by at least one thread, and accessed by different threads.

  • Fundamentals: Under these assumptions, cache protocols require communication to invalidate lines, which may be expensive.

  • Sample runtime reduction:68%.

  • Recommendation: Reorganize container or use padding to avoid false sharing.

  • To instrument: Container access methods and iterators.

  • Analysis: First, get the cache line size. For each shared container, record all the associated iterator dereferences and member access methods with the thread id. Compare the address lists across threads to detect references in two different threads to the same cache line. Issue a warning only if the ratio to total references is significant. Do the same for iterator dereference values if they are pointers.

  • Cost model: Number of accesses to same cache line from different threads.

  • Example:

    1     vector<int> v(2, 0);
    2 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)
    3     for (i = 0; i < SIZE; ++i) {
    4       v[i % 2] += i;
    5     }
    
    OMP_NUM_THREADS=2 ./a.out
    foo.cc:1: advice: Change container structure or padding to avoid false
    sharing in multithreaded access at foo.cc:4.  Detected N shared cache lines.
    

Statistics

Switch: _GLIBCXX_PROFILE_STATISTICS.

In some cases the cost model may not tell us anything because the costs appear to offset the benefits. Consider the choice between a vector and a list. When there are both inserts and iteration, an automatic advice may not be issued. However, the programmer may still be able to make use of this information in a different way.

This diagnostic will not issue any advice, but it will print statistics for each container construction site. The statistics will contain the cost of each operation actually performed on the container.

@ 1.1.1.1 log @import GCC 4.8 branch at r206687. highlights from: http://gcc.gnu.org/gcc-4.6/changes.html GCC now has stricter checks for invalid command-line options New -Wunused-but-set-variable and -Wunused-but-set-parameter warnings Many platforms have been obsoleted Link-time optimization improvements A new switch -fstack-usage has been added A new function attribute leaf was introduced A new warning, enabled by -Wdouble-promotion Support for selectively enabling and disabling warnings via #pragma GCC diagnostic has been added There is now experimental support for some features from the upcoming C1X revision of the ISO C standard Improved experimental support for the upcoming C++0x ISO C++ standard G++ now issues clearer diagnostics in several cases Updates for ARM, x86, MIPS, PPC/PPC64, SPARC Darwin, FreeBSD, Solaris 2, MinGW and Cygwin now all support __float128 on 32-bit and 64-bit x86 targets. [*1] highlights from: http://gcc.gnu.org/gcc-4.7/changes.html The -fconserve-space flag has been deprecated Support for a new parameter --param case-values-threshold=n was added Interprocedural and Link-time optimization improvements A new built-in, __builtin_assume_aligned, has been added A new warning option -Wunused-local-typedefs was added A new experimental command-line option -ftrack-macro-expansion was added Support for atomic operations specifying the C++11/C11 memory model has been added There is support for some more features from the C11 revision of the ISO C standard Improved experimental support for the new ISO C++ standard, C++11 Updates for ARM, x86, MIPS, PPC/PPC64, SH, SPARC, TILE* A new option (-grecord-gcc-switches) was added highlights from: http://gcc.gnu.org/gcc-4.8/changes.html GCC now uses C++ as its implementation language. This means that to build GCC from sources, you will need a C++ compiler that understands C++ 2003 DWARF4 is now the default when generating DWARF debug information A new general optimization level, -Og, has been introduced A new option -ftree-partial-pre was added The option -fconserve-space has been removed The command-line options -fipa-struct-reorg and -fipa-matrix-reorg have been removed Interprocedural and Link-time optimization improvements AddressSanitizer, a fast memory error detector, has been added [*2] A new -Wsizeof-pointer-memaccess warning has been added G++ now supports a -std=c++1y option for experimentation with features proposed for the next revision of the standard, expected around 2014 Improved experimental support for the new ISO C++ standard, C++11 A new port has been added to support AArch64 Updates for ARM, x86, MIPS, PPC/PPC64, SH, SPARC, TILE* [*1] we should support this too! [*2] we should look into this. https://code.google.com/p/address-sanitizer/ @ text @@ 1.1.1.2 log @import GCC 5.3.0. see these urls for details which are too large to include here: http://gcc.gnu.org/gcc-4.9/changes.html http://gcc.gnu.org/gcc-5/changes.html (note that GCC 5.x is a release stream like GCC 4.9.x, 4.8.x, etc.) the main issues we will have are: The default mode for C is now -std=gnu11 instead of -std=gnu89. ARM: The deprecated option -mwords-little-endian has been removed. The options -mapcs, -mapcs-frame, -mtpcs-frame and -mtpcs-leaf-frame which are only applicable to the old ABI have been deprecated. MIPS: The o32 ABI has been modified and extended. The o32 64-bit floating-point register support is now obsolete and has been removed. It has been replaced by three ABI extensions FPXX, FP64A, and FP64. The meaning of the -mfp64 command-line option has changed. It is now used to enable the FP64A and FP64 ABI extensions. @ text @d20 1 a20 1

Table 19.2. Profile Diagnostics

GroupFlagBenefitCostFreq.Implemented 
@ 1.1.1.2.16.1 log @Sync with HEAD @ text @d2 1 a2 1 Diagnostics

Diagnostics

d20 1 a20 1

Table 19.2. Profile Diagnostics

GroupFlagBenefitCostFreq.Implemented 
d379 1 a379 1 d444 2 a445 2 Custom Memory Allocation for Free by Jula and Rauchwerger. @ 1.1.1.2.14.1 log @Sync with HEAD, resolve a couple of conflicts @ text @d2 1 a2 1 Diagnostics

Diagnostics

d20 1 a20 1

Table 19.2. Profile Diagnostics

GroupFlagBenefitCostFreq.Implemented 
@ 1.1.1.2.14.2 log @Sync with HEAD @ text @d379 1 a379 1 d444 2 a445 2 Custom Memory Allocation for Free by Jula and Rauchwerger. @ 1.1.1.3 log @import GCC 6.5.0. this is largely a maint release with no particularly features listed here: http://gcc.gnu.org/gcc-6/changes.html this fixes over 250 PRs in the GCC bugzilla: https://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&resolution=FIXED&target_milestone=6.5 @ text @d2 1 a2 1 Diagnostics

Diagnostics

d20 1 a20 1

Table 19.2. Profile Diagnostics

GroupFlagBenefitCostFreq.Implemented 
@ 1.1.1.4 log @import GCC 7.4.0. main changes include: The non-standard C++0x type traits has_trivial_default_constructor, has_trivial_copy_constructor and has_trivial_copy_assign have been removed. On ARM targets (arm*-*-*), a bug introduced in GCC 5 that affects conformance to the procedure call standard (AAPCS) has been fixed. Many optimiser improvements DWARF-5 support. Many new and enhanced warnings. Warnings about format strings now underline the pertinent part of the string, and can offer suggested fixes. Several new warnings related to buffer overflows and buffer truncation. New __builtin_add_overflow_p, __builtin_sub_overflow_p, __builtin_mul_overflow_p built-ins added that test for overflow. The C++ front end has experimental support for all of the current C++17 draft. The -fverbose-asm option has been expanded to prints comments showing the source lines that correspond to the assembly. The gcc and g++ driver programs will now provide suggestions for misspelled arguments to command-line options. AArch64 specific: GCC has been updated to the latest revision of the procedure call standard (AAPCS64) to provide support for parameter passing when data types have been over-aligned. The ARMv8.2-A and ARMv8.3-A architecture are now supported. ARM specific: Support for the ARMv5 and ARMv5E architectures has been deprecated (which have no known implementations). A new command-line option -mpure-code has been added. It does not allow constant data to be placed in code sections. x86 specific: Support for the AVX-512 4FMAPS, 4VNNIW, VPOPCNTDQ and Software Guard Extensions (SGX) ISA extensions has been added. PPC specific: GCC now diagnoses inline assembly that clobbers register r2. RISC-V specific: Support for the RISC-V instruction set has been added. SH specific: Support for SH5/SH64 has been removed. Support for SH2A has been enhanced. @ text @d379 1 a379 1 d444 2 a445 2 Custom Memory Allocation for Free by Jula and Rauchwerger. @ 1.1.1.5 log @initial import of GCC 9.3.0. changes include: - live patching support - shell completion help - generally better diagnostic output (less verbose/more useful) - diagnostics and optimisation choices can be emitted in json - asan memory usage reduction - many general, and specific to switch, inter-procedure, profile and link-time optimisations. from the release notes: "Overall compile time of Firefox 66 and LibreOffice 6.2.3 on an 8-core machine was reduced by about 5% compared to GCC 8.3" - OpenMP 5.0 support - better spell-guesser - partial experimental support for c2x and c++2a - c++17 is no longer experimental - arm AAPCS GCC 6-8 structure passing bug fixed, may cause incompatibility (restored compat with GCC 5 and earlier.) - openrisc support @ text @d379 1 a379 1 @ 1.1.1.1.8.1 log @file profile_mode_diagnostics.html was added on branch tls-maxphys on 2014-08-19 23:54:46 +0000 @ text @d1 557 @ 1.1.1.1.8.2 log @Rebase to HEAD as of a few days ago. @ text @a0 557 Diagnostics

Diagnostics

The table below presents all the diagnostics we intend to implement. Each diagnostic has a corresponding compile time switch -D_GLIBCXX_PROFILE_<diagnostic>. Groups of related diagnostics can be turned on with a single switch. For instance, -D_GLIBCXX_PROFILE_LOCALITY is equivalent to -D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH -D_GLIBCXX_PROFILE_RBTREE_LOCALITY.

The benefit, cost, expected frequency and accuracy of each diagnostic was given a grade from 1 to 10, where 10 is highest. A high benefit means that, if the diagnostic is accurate, the expected performance improvement is high. A high cost means that turning this diagnostic on leads to high slowdown. A high frequency means that we expect this to occur relatively often. A high accuracy means that the diagnostic is unlikely to be wrong. These grades are not perfect. They are just meant to guide users with specific needs or time budgets.

Table 19.2. Profile Diagnostics

GroupFlagBenefitCostFreq.Implemented 
CONTAINERS HASHTABLE_TOO_SMALL101 10yes
  HASHTABLE_TOO_LARGE51 10yes
  INEFFICIENT_HASH73 10yes
  VECTOR_TOO_SMALL81 10yes
  VECTOR_TOO_LARGE51 10yes
  VECTOR_TO_HASHTABLE77 10no
  HASHTABLE_TO_VECTOR77 10no
  VECTOR_TO_LIST85 10yes
  LIST_TO_VECTOR105 10no
  ORDERED_TO_UNORDERED105 10only map/unordered_map
ALGORITHMS SORT78 7no
LOCALITY SOFTWARE_PREFETCH88 5no
  RBTREE_LOCALITY48 5no
  FALSE_SHARING810 10no

Diagnostic Template

  • Switch: _GLIBCXX_PROFILE_<diagnostic>.

  • Goal: What problem will it diagnose?

  • Fundamentals:. What is the fundamental reason why this is a problem

  • Sample runtime reduction: Percentage reduction in execution time. When reduction is more than a constant factor, describe the reduction rate formula.

  • Recommendation: What would the advise look like?

  • To instrument: What stdlibc++ components need to be instrumented?

  • Analysis: How do we decide when to issue the advice?

  • Cost model: How do we measure benefits? Math goes here.

  • Example:

    program code
    ...
    advice sample
    

Containers

Switch: _GLIBCXX_PROFILE_CONTAINERS.

Hashtable Too Small

  • Switch: _GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL.

  • Goal: Detect hashtables with many rehash operations, small construction size and large destruction size.

  • Fundamentals: Rehash is very expensive. Read content, follow chains within bucket, evaluate hash function, place at new location in different order.

  • Sample runtime reduction: 36%. Code similar to example below.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument: unordered_set, unordered_map constructor, destructor, rehash.

  • Analysis: For each dynamic instance of unordered_[multi]set|map, record initial size and call context of the constructor. Record size increase, if any, after each relevant operation such as insert. Record the estimated rehash cost.

  • Cost model: Number of individual rehash operations * cost per rehash.

  • Example:

    1 unordered_set<int> us;
    2 for (int k = 0; k < 1000000; ++k) {
    3   us.insert(k);
    4 }
    
    foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.
    

Hashtable Too Large

  • Switch: _GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE.

  • Goal: Detect hashtables which are never filled up because fewer elements than reserved are ever inserted.

  • Fundamentals: Save memory, which is good in itself and may also improve memory reference performance through fewer cache and TLB misses.

  • Sample runtime reduction: unknown.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument: unordered_set, unordered_map constructor, destructor, rehash.

  • Analysis: For each dynamic instance of unordered_[multi]set|map, record initial size and call context of the constructor, and correlate it with its size at destruction time.

  • Cost model: Number of iteration operations + memory saved.

  • Example:

    1 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ;
    2 for (int k = 0; k < 100000; ++k) {
    3   for (int j = 0; j < 10; ++j) {
    4     v[k].insert(k + j);
    5  }
    6 }
    
    foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N
    bytes of memory and M iteration steps.
    

Inefficient Hash

  • Switch: _GLIBCXX_PROFILE_INEFFICIENT_HASH.

  • Goal: Detect hashtables with polarized distribution.

  • Fundamentals: A non-uniform distribution may lead to long chains, thus possibly increasing complexity by a factor up to the number of elements.

  • Sample runtime reduction: factor up to container size.

  • Recommendation: Change hash function for container built at site S. Distribution score = N. Access score = S. Longest chain = C, in bucket B.

  • To instrument: unordered_set, unordered_map constructor, destructor, [], insert, iterator.

  • Analysis: Count the exact number of link traversals.

  • Cost model: Total number of links traversed.

  • Example:

    class dumb_hash {
     public:
      size_t operator() (int i) const { return 0; }
    };
    ...
      unordered_set<int, dumb_hash> hs;
      ...
      for (int i = 0; i < COUNT; ++i) {
        hs.find(i);
      }
    

Vector Too Small

  • Switch: _GLIBCXX_PROFILE_VECTOR_TOO_SMALL.

  • Goal:Detect vectors with many resize operations, small construction size and large destruction size..

  • Fundamentals:Resizing can be expensive. Copying large amounts of data takes time. Resizing many small vectors may have allocation overhead and affect locality.

  • Sample runtime reduction:%.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument:vector.

  • Analysis: For each dynamic instance of vector, record initial size and call context of the constructor. Record size increase, if any, after each relevant operation such as push_back. Record the estimated resize cost.

  • Cost model: Total number of words copied * time to copy a word.

  • Example:

    1 vector<int> v;
    2 for (int k = 0; k < 1000000; ++k) {
    3   v.push_back(k);
    4 }
    
    foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves
    copying 4000000 bytes and 20 memory allocations and deallocations.
    

Vector Too Large

  • Switch: _GLIBCXX_PROFILE_VECTOR_TOO_LARGE

  • Goal:Detect vectors which are never filled up because fewer elements than reserved are ever inserted.

  • Fundamentals:Save memory, which is good in itself and may also improve memory reference performance through fewer cache and TLB misses.

  • Sample runtime reduction:%.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument:vector.

  • Analysis: For each dynamic instance of vector, record initial size and call context of the constructor, and correlate it with its size at destruction time.

  • Cost model: Total amount of memory saved.

  • Example:

    1 vector<vector<int>> v(100000, vector<int>(100)) ;
    2 for (int k = 0; k < 100000; ++k) {
    3   for (int j = 0; j < 10; ++j) {
    4     v[k].insert(k + j);
    5  }
    6 }
    
    foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N
    bytes of memory and may reduce the number of cache and TLB misses.
    

Vector to Hashtable

  • Switch: _GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE.

  • Goal: Detect uses of vector that can be substituted with unordered_set to reduce execution time.

  • Fundamentals: Linear search in a vector is very expensive, whereas searching in a hashtable is very quick.

  • Sample runtime reduction:factor up to container size.

  • Recommendation:Replace vector with unordered_set at site S.

  • To instrument:vector operations and access methods.

  • Analysis: For each dynamic instance of vector, record call context of the constructor. Issue the advice only if the only methods called on this vector are push_back, insert and find.

  • Cost model: Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) - cost(unordered_set::insert) + cost(unordered_set::find).

  • Example:

    1  vector<int> v;
    ...
    2  for (int i = 0; i < 1000; ++i) {
    3    find(v.begin(), v.end(), i);
    4  }
    
    foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000
    comparisons.
    

Hashtable to Vector

  • Switch: _GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR.

  • Goal: Detect uses of unordered_set that can be substituted with vector to reduce execution time.

  • Fundamentals: Hashtable iterator is slower than vector iterator.

  • Sample runtime reduction:95%.

  • Recommendation:Replace unordered_set with vector at site S.

  • To instrument:unordered_set operations and access methods.

  • Analysis: For each dynamic instance of unordered_set, record call context of the constructor. Issue the advice only if the number of find, insert and [] operations on this unordered_set are small relative to the number of elements, and methods begin or end are invoked (suggesting iteration).

  • Cost model: Number of .

  • Example:

    1  unordered_set<int> us;
    ...
    2  int s = 0;
    3  for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) {
    4    s += *it;
    5  }
    
    foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N
    indirections and may achieve better data locality.
    

Vector to List

  • Switch: _GLIBCXX_PROFILE_VECTOR_TO_LIST.

  • Goal: Detect cases where vector could be substituted with list for better performance.

  • Fundamentals: Inserting in the middle of a vector is expensive compared to inserting in a list.

  • Sample runtime reduction:factor up to container size.

  • Recommendation:Replace vector with list at site S.

  • To instrument:vector operations and access methods.

  • Analysis: For each dynamic instance of vector, record the call context of the constructor. Record the overhead of each insert operation based on current size and insert position. Report instance with high insertion overhead.

  • Cost model: (Sum(cost(vector::method)) - Sum(cost(list::method)), for method in [push_back, insert, erase]) + (Cost(iterate vector) - Cost(iterate list))

  • Example:

    1  vector<int> v;
    2  for (int i = 0; i < 10000; ++i) {
    3    v.insert(v.begin(), i);
    4  }
    
    foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000
    operations.
    

List to Vector

  • Switch: _GLIBCXX_PROFILE_LIST_TO_VECTOR.

  • Goal: Detect cases where list could be substituted with vector for better performance.

  • Fundamentals: Iterating through a vector is faster than through a list.

  • Sample runtime reduction:64%.

  • Recommendation:Replace list with vector at site S.

  • To instrument:vector operations and access methods.

  • Analysis: Issue the advice if there are no insert operations.

  • Cost model: (Sum(cost(vector::method)) - Sum(cost(list::method)), for method in [push_back, insert, erase]) + (Cost(iterate vector) - Cost(iterate list))

  • Example:

    1  list<int> l;
    ...
    2  int sum = 0;
    3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
    4    sum += *it;
    5  }
    
    foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect
    memory references.
    

List to Forward List (Slist)

  • Switch: _GLIBCXX_PROFILE_LIST_TO_SLIST.

  • Goal: Detect cases where list could be substituted with forward_list for better performance.

  • Fundamentals: The memory footprint of a forward_list is smaller than that of a list. This has beneficial effects on memory subsystem, e.g., fewer cache misses.

  • Sample runtime reduction:40%. Note that the reduction is only noticeable if the size of the forward_list node is in fact larger than that of the list node. For memory allocators with size classes, you will only notice an effect when the two node sizes belong to different allocator size classes.

  • Recommendation:Replace list with forward_list at site S.

  • To instrument:list operations and iteration methods.

  • Analysis: Issue the advice if there are no backwards traversals or insertion before a given node.

  • Cost model: Always true.

  • Example:

    1  list<int> l;
    ...
    2  int sum = 0;
    3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
    4    sum += *it;
    5  }
    
    foo.cc:1: advice: Change "list" to "forward_list".
    

Ordered to Unordered Associative Container

  • Switch: _GLIBCXX_PROFILE_ORDERED_TO_UNORDERED.

  • Goal: Detect cases where ordered associative containers can be replaced with unordered ones.

  • Fundamentals: Insert and search are quicker in a hashtable than in a red-black tree.

  • Sample runtime reduction:52%.

  • Recommendation: Replace set with unordered_set at site S.

  • To instrument: set, multiset, map, multimap methods.

  • Analysis: Issue the advice only if we are not using operator ++ on any iterator on a particular [multi]set|map.

  • Cost model: (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for method in [insert, erase, find]) + (Cost(iterate hashtable) - Cost(iterate rbtree))

  • Example:

    1  set<int> s;
    2  for (int i = 0; i < 100000; ++i) {
    3    s.insert(i);
    4  }
    5  int sum = 0;
    6  for (int i = 0; i < 100000; ++i) {
    7    sum += *s.find(i);
    8  }
    

Algorithms

Switch: _GLIBCXX_PROFILE_ALGORITHMS.

Sort Algorithm Performance

  • Switch: _GLIBCXX_PROFILE_SORT.

  • Goal: Give measure of sort algorithm performance based on actual input. For instance, advise Radix Sort over Quick Sort for a particular call context.

  • Fundamentals: See papers: A framework for adaptive algorithm selection in STAPL and Optimizing Sorting with Machine Learning Algorithms.

  • Sample runtime reduction:60%.

  • Recommendation: Change sort algorithm at site S from X Sort to Y Sort.

  • To instrument: sort algorithm.

  • Analysis: Issue the advice if the cost model tells us that another sort algorithm would do better on this input. Requires us to know what algorithm we are using in our sort implementation in release mode.

  • Cost model: Runtime(algo) for algo in [radix, quick, merge, ...]

  • Example:

Data Locality

Switch: _GLIBCXX_PROFILE_LOCALITY.

Need Software Prefetch

  • Switch: _GLIBCXX_PROFILE_SOFTWARE_PREFETCH.

  • Goal: Discover sequences of indirect memory accesses that are not regular, thus cannot be predicted by hardware prefetchers.

  • Fundamentals: Indirect references are hard to predict and are very expensive when they miss in caches.

  • Sample runtime reduction:25%.

  • Recommendation: Insert prefetch instruction.

  • To instrument: Vector iterator and access operator [].

  • Analysis: First, get cache line size and page size from system. Then record iterator dereference sequences for which the value is a pointer. For each sequence within a container, issue a warning if successive pointer addresses are not within cache lines and do not form a linear pattern (otherwise they may be prefetched by hardware). If they also step across page boundaries, make the warning stronger.

    The same analysis applies to containers other than vector. However, we cannot give the same advice for linked structures, such as list, as there is no random access to the n-th element. The user may still be able to benefit from this information, for instance by employing frays (user level light weight threads) to hide the latency of chasing pointers.

    This analysis is a little oversimplified. A better cost model could be created by understanding the capability of the hardware prefetcher. This model could be trained automatically by running a set of synthetic cases.

  • Cost model: Total distance between pointer values of successive elements in vectors of pointers.

  • Example:

    1 int zero = 0;
    2 vector<int*> v(10000000, &zero);
    3 for (int k = 0; k < 10000000; ++k) {
    4   v[random() % 10000000] = new int(k);
    5 }
    6 for (int j = 0; j < 10000000; ++j) {
    7   count += (*v[j] == 0 ? 0 : 1);
    8 }
    
    foo.cc:7: advice: Insert prefetch instruction.
    

Linked Structure Locality

  • Switch: _GLIBCXX_PROFILE_RBTREE_LOCALITY.

  • Goal: Give measure of locality of objects stored in linked structures (lists, red-black trees and hashtables) with respect to their actual traversal patterns.

  • Fundamentals:Allocation can be tuned to a specific traversal pattern, to result in better data locality. See paper: Custom Memory Allocation for Free.

  • Sample runtime reduction:30%.

  • Recommendation: High scatter score N for container built at site S. Consider changing allocation sequence or choosing a structure conscious allocator.

  • To instrument: Methods of all containers using linked structures.

  • Analysis: First, get cache line size and page size from system. Then record the number of successive elements that are on different line or page, for each traversal method such as find. Give advice only if the ratio between this number and the number of total node hops is above a threshold.

  • Cost model: Sum(same_cache_line(this,previous))

  • Example:

     1  set<int> s;
     2  for (int i = 0; i < 10000000; ++i) {
     3    s.insert(i);
     4  }
     5  set<int> s1, s2;
     6  for (int i = 0; i < 10000000; ++i) {
     7    s1.insert(i);
     8    s2.insert(i);
     9  }
    ...
          // Fast, better locality.
    10    for (set<int>::iterator it = s.begin(); it != s.end(); ++it) {
    11      sum += *it;
    12    }
          // Slow, elements are further apart.
    13    for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) {
    14      sum += *it;
    15    }
    
    foo.cc:5: advice: High scatter score NNN for set built here.  Consider changing
    the allocation sequence or switching to a structure conscious allocator.
    

Multithreaded Data Access

The diagnostics in this group are not meant to be implemented short term. They require compiler support to know when container elements are written to. Instrumentation can only tell us when elements are referenced.

Switch: _GLIBCXX_PROFILE_MULTITHREADED.

Data Dependence Violations at Container Level

  • Switch: _GLIBCXX_PROFILE_DDTEST.

  • Goal: Detect container elements that are referenced from multiple threads in the parallel region or across parallel regions.

  • Fundamentals: Sharing data between threads requires communication and perhaps locking, which may be expensive.

  • Sample runtime reduction:?%.

  • Recommendation: Change data distribution or parallel algorithm.

  • To instrument: Container access methods and iterators.

  • Analysis: Keep a shadow for each container. Record iterator dereferences and container member accesses. Issue advice for elements referenced by multiple threads. See paper: The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization.

  • Cost model: Number of accesses to elements referenced from multiple threads

  • Example:

False Sharing

  • Switch: _GLIBCXX_PROFILE_FALSE_SHARING.

  • Goal: Detect elements in the same container which share a cache line, are written by at least one thread, and accessed by different threads.

  • Fundamentals: Under these assumptions, cache protocols require communication to invalidate lines, which may be expensive.

  • Sample runtime reduction:68%.

  • Recommendation: Reorganize container or use padding to avoid false sharing.

  • To instrument: Container access methods and iterators.

  • Analysis: First, get the cache line size. For each shared container, record all the associated iterator dereferences and member access methods with the thread id. Compare the address lists across threads to detect references in two different threads to the same cache line. Issue a warning only if the ratio to total references is significant. Do the same for iterator dereference values if they are pointers.

  • Cost model: Number of accesses to same cache line from different threads.

  • Example:

    1     vector<int> v(2, 0);
    2 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)
    3     for (i = 0; i < SIZE; ++i) {
    4       v[i % 2] += i;
    5     }
    
    OMP_NUM_THREADS=2 ./a.out
    foo.cc:1: advice: Change container structure or padding to avoid false
    sharing in multithreaded access at foo.cc:4.  Detected N shared cache lines.
    

Statistics

Switch: _GLIBCXX_PROFILE_STATISTICS.

In some cases the cost model may not tell us anything because the costs appear to offset the benefits. Consider the choice between a vector and a list. When there are both inserts and iteration, an automatic advice may not be issued. However, the programmer may still be able to make use of this information in a different way.

This diagnostic will not issue any advice, but it will print statistics for each container construction site. The statistics will contain the cost of each operation actually performed on the container.

@ 1.1.1.1.4.1 log @file profile_mode_diagnostics.html was added on branch yamt-pagecache on 2014-05-22 16:37:45 +0000 @ text @d1 557 @ 1.1.1.1.4.2 log @sync with head. for a reference, the tree before this commit was tagged as yamt-pagecache-tag8. this commit was splitted into small chunks to avoid a limitation of cvs. ("Protocol error: too many arguments") @ text @a0 557 Diagnostics

Diagnostics

The table below presents all the diagnostics we intend to implement. Each diagnostic has a corresponding compile time switch -D_GLIBCXX_PROFILE_<diagnostic>. Groups of related diagnostics can be turned on with a single switch. For instance, -D_GLIBCXX_PROFILE_LOCALITY is equivalent to -D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH -D_GLIBCXX_PROFILE_RBTREE_LOCALITY.

The benefit, cost, expected frequency and accuracy of each diagnostic was given a grade from 1 to 10, where 10 is highest. A high benefit means that, if the diagnostic is accurate, the expected performance improvement is high. A high cost means that turning this diagnostic on leads to high slowdown. A high frequency means that we expect this to occur relatively often. A high accuracy means that the diagnostic is unlikely to be wrong. These grades are not perfect. They are just meant to guide users with specific needs or time budgets.

Table 19.2. Profile Diagnostics

GroupFlagBenefitCostFreq.Implemented 
CONTAINERS HASHTABLE_TOO_SMALL101 10yes
  HASHTABLE_TOO_LARGE51 10yes
  INEFFICIENT_HASH73 10yes
  VECTOR_TOO_SMALL81 10yes
  VECTOR_TOO_LARGE51 10yes
  VECTOR_TO_HASHTABLE77 10no
  HASHTABLE_TO_VECTOR77 10no
  VECTOR_TO_LIST85 10yes
  LIST_TO_VECTOR105 10no
  ORDERED_TO_UNORDERED105 10only map/unordered_map
ALGORITHMS SORT78 7no
LOCALITY SOFTWARE_PREFETCH88 5no
  RBTREE_LOCALITY48 5no
  FALSE_SHARING810 10no

Diagnostic Template

  • Switch: _GLIBCXX_PROFILE_<diagnostic>.

  • Goal: What problem will it diagnose?

  • Fundamentals:. What is the fundamental reason why this is a problem

  • Sample runtime reduction: Percentage reduction in execution time. When reduction is more than a constant factor, describe the reduction rate formula.

  • Recommendation: What would the advise look like?

  • To instrument: What stdlibc++ components need to be instrumented?

  • Analysis: How do we decide when to issue the advice?

  • Cost model: How do we measure benefits? Math goes here.

  • Example:

    program code
    ...
    advice sample
    

Containers

Switch: _GLIBCXX_PROFILE_CONTAINERS.

Hashtable Too Small

  • Switch: _GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL.

  • Goal: Detect hashtables with many rehash operations, small construction size and large destruction size.

  • Fundamentals: Rehash is very expensive. Read content, follow chains within bucket, evaluate hash function, place at new location in different order.

  • Sample runtime reduction: 36%. Code similar to example below.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument: unordered_set, unordered_map constructor, destructor, rehash.

  • Analysis: For each dynamic instance of unordered_[multi]set|map, record initial size and call context of the constructor. Record size increase, if any, after each relevant operation such as insert. Record the estimated rehash cost.

  • Cost model: Number of individual rehash operations * cost per rehash.

  • Example:

    1 unordered_set<int> us;
    2 for (int k = 0; k < 1000000; ++k) {
    3   us.insert(k);
    4 }
    
    foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.
    

Hashtable Too Large

  • Switch: _GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE.

  • Goal: Detect hashtables which are never filled up because fewer elements than reserved are ever inserted.

  • Fundamentals: Save memory, which is good in itself and may also improve memory reference performance through fewer cache and TLB misses.

  • Sample runtime reduction: unknown.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument: unordered_set, unordered_map constructor, destructor, rehash.

  • Analysis: For each dynamic instance of unordered_[multi]set|map, record initial size and call context of the constructor, and correlate it with its size at destruction time.

  • Cost model: Number of iteration operations + memory saved.

  • Example:

    1 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ;
    2 for (int k = 0; k < 100000; ++k) {
    3   for (int j = 0; j < 10; ++j) {
    4     v[k].insert(k + j);
    5  }
    6 }
    
    foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N
    bytes of memory and M iteration steps.
    

Inefficient Hash

  • Switch: _GLIBCXX_PROFILE_INEFFICIENT_HASH.

  • Goal: Detect hashtables with polarized distribution.

  • Fundamentals: A non-uniform distribution may lead to long chains, thus possibly increasing complexity by a factor up to the number of elements.

  • Sample runtime reduction: factor up to container size.

  • Recommendation: Change hash function for container built at site S. Distribution score = N. Access score = S. Longest chain = C, in bucket B.

  • To instrument: unordered_set, unordered_map constructor, destructor, [], insert, iterator.

  • Analysis: Count the exact number of link traversals.

  • Cost model: Total number of links traversed.

  • Example:

    class dumb_hash {
     public:
      size_t operator() (int i) const { return 0; }
    };
    ...
      unordered_set<int, dumb_hash> hs;
      ...
      for (int i = 0; i < COUNT; ++i) {
        hs.find(i);
      }
    

Vector Too Small

  • Switch: _GLIBCXX_PROFILE_VECTOR_TOO_SMALL.

  • Goal:Detect vectors with many resize operations, small construction size and large destruction size..

  • Fundamentals:Resizing can be expensive. Copying large amounts of data takes time. Resizing many small vectors may have allocation overhead and affect locality.

  • Sample runtime reduction:%.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument:vector.

  • Analysis: For each dynamic instance of vector, record initial size and call context of the constructor. Record size increase, if any, after each relevant operation such as push_back. Record the estimated resize cost.

  • Cost model: Total number of words copied * time to copy a word.

  • Example:

    1 vector<int> v;
    2 for (int k = 0; k < 1000000; ++k) {
    3   v.push_back(k);
    4 }
    
    foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves
    copying 4000000 bytes and 20 memory allocations and deallocations.
    

Vector Too Large

  • Switch: _GLIBCXX_PROFILE_VECTOR_TOO_LARGE

  • Goal:Detect vectors which are never filled up because fewer elements than reserved are ever inserted.

  • Fundamentals:Save memory, which is good in itself and may also improve memory reference performance through fewer cache and TLB misses.

  • Sample runtime reduction:%.

  • Recommendation: Set initial size to N at construction site S.

  • To instrument:vector.

  • Analysis: For each dynamic instance of vector, record initial size and call context of the constructor, and correlate it with its size at destruction time.

  • Cost model: Total amount of memory saved.

  • Example:

    1 vector<vector<int>> v(100000, vector<int>(100)) ;
    2 for (int k = 0; k < 100000; ++k) {
    3   for (int j = 0; j < 10; ++j) {
    4     v[k].insert(k + j);
    5  }
    6 }
    
    foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N
    bytes of memory and may reduce the number of cache and TLB misses.
    

Vector to Hashtable

  • Switch: _GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE.

  • Goal: Detect uses of vector that can be substituted with unordered_set to reduce execution time.

  • Fundamentals: Linear search in a vector is very expensive, whereas searching in a hashtable is very quick.

  • Sample runtime reduction:factor up to container size.

  • Recommendation:Replace vector with unordered_set at site S.

  • To instrument:vector operations and access methods.

  • Analysis: For each dynamic instance of vector, record call context of the constructor. Issue the advice only if the only methods called on this vector are push_back, insert and find.

  • Cost model: Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) - cost(unordered_set::insert) + cost(unordered_set::find).

  • Example:

    1  vector<int> v;
    ...
    2  for (int i = 0; i < 1000; ++i) {
    3    find(v.begin(), v.end(), i);
    4  }
    
    foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000
    comparisons.
    

Hashtable to Vector

  • Switch: _GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR.

  • Goal: Detect uses of unordered_set that can be substituted with vector to reduce execution time.

  • Fundamentals: Hashtable iterator is slower than vector iterator.

  • Sample runtime reduction:95%.

  • Recommendation:Replace unordered_set with vector at site S.

  • To instrument:unordered_set operations and access methods.

  • Analysis: For each dynamic instance of unordered_set, record call context of the constructor. Issue the advice only if the number of find, insert and [] operations on this unordered_set are small relative to the number of elements, and methods begin or end are invoked (suggesting iteration).

  • Cost model: Number of .

  • Example:

    1  unordered_set<int> us;
    ...
    2  int s = 0;
    3  for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) {
    4    s += *it;
    5  }
    
    foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N
    indirections and may achieve better data locality.
    

Vector to List

  • Switch: _GLIBCXX_PROFILE_VECTOR_TO_LIST.

  • Goal: Detect cases where vector could be substituted with list for better performance.

  • Fundamentals: Inserting in the middle of a vector is expensive compared to inserting in a list.

  • Sample runtime reduction:factor up to container size.

  • Recommendation:Replace vector with list at site S.

  • To instrument:vector operations and access methods.

  • Analysis: For each dynamic instance of vector, record the call context of the constructor. Record the overhead of each insert operation based on current size and insert position. Report instance with high insertion overhead.

  • Cost model: (Sum(cost(vector::method)) - Sum(cost(list::method)), for method in [push_back, insert, erase]) + (Cost(iterate vector) - Cost(iterate list))

  • Example:

    1  vector<int> v;
    2  for (int i = 0; i < 10000; ++i) {
    3    v.insert(v.begin(), i);
    4  }
    
    foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000
    operations.
    

List to Vector

  • Switch: _GLIBCXX_PROFILE_LIST_TO_VECTOR.

  • Goal: Detect cases where list could be substituted with vector for better performance.

  • Fundamentals: Iterating through a vector is faster than through a list.

  • Sample runtime reduction:64%.

  • Recommendation:Replace list with vector at site S.

  • To instrument:vector operations and access methods.

  • Analysis: Issue the advice if there are no insert operations.

  • Cost model: (Sum(cost(vector::method)) - Sum(cost(list::method)), for method in [push_back, insert, erase]) + (Cost(iterate vector) - Cost(iterate list))

  • Example:

    1  list<int> l;
    ...
    2  int sum = 0;
    3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
    4    sum += *it;
    5  }
    
    foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect
    memory references.
    

List to Forward List (Slist)

  • Switch: _GLIBCXX_PROFILE_LIST_TO_SLIST.

  • Goal: Detect cases where list could be substituted with forward_list for better performance.

  • Fundamentals: The memory footprint of a forward_list is smaller than that of a list. This has beneficial effects on memory subsystem, e.g., fewer cache misses.

  • Sample runtime reduction:40%. Note that the reduction is only noticeable if the size of the forward_list node is in fact larger than that of the list node. For memory allocators with size classes, you will only notice an effect when the two node sizes belong to different allocator size classes.

  • Recommendation:Replace list with forward_list at site S.

  • To instrument:list operations and iteration methods.

  • Analysis: Issue the advice if there are no backwards traversals or insertion before a given node.

  • Cost model: Always true.

  • Example:

    1  list<int> l;
    ...
    2  int sum = 0;
    3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
    4    sum += *it;
    5  }
    
    foo.cc:1: advice: Change "list" to "forward_list".
    

Ordered to Unordered Associative Container

  • Switch: _GLIBCXX_PROFILE_ORDERED_TO_UNORDERED.

  • Goal: Detect cases where ordered associative containers can be replaced with unordered ones.

  • Fundamentals: Insert and search are quicker in a hashtable than in a red-black tree.

  • Sample runtime reduction:52%.

  • Recommendation: Replace set with unordered_set at site S.

  • To instrument: set, multiset, map, multimap methods.

  • Analysis: Issue the advice only if we are not using operator ++ on any iterator on a particular [multi]set|map.

  • Cost model: (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for method in [insert, erase, find]) + (Cost(iterate hashtable) - Cost(iterate rbtree))

  • Example:

    1  set<int> s;
    2  for (int i = 0; i < 100000; ++i) {
    3    s.insert(i);
    4  }
    5  int sum = 0;
    6  for (int i = 0; i < 100000; ++i) {
    7    sum += *s.find(i);
    8  }
    

Algorithms

Switch: _GLIBCXX_PROFILE_ALGORITHMS.

Sort Algorithm Performance

  • Switch: _GLIBCXX_PROFILE_SORT.

  • Goal: Give measure of sort algorithm performance based on actual input. For instance, advise Radix Sort over Quick Sort for a particular call context.

  • Fundamentals: See papers: A framework for adaptive algorithm selection in STAPL and Optimizing Sorting with Machine Learning Algorithms.

  • Sample runtime reduction:60%.

  • Recommendation: Change sort algorithm at site S from X Sort to Y Sort.

  • To instrument: sort algorithm.

  • Analysis: Issue the advice if the cost model tells us that another sort algorithm would do better on this input. Requires us to know what algorithm we are using in our sort implementation in release mode.

  • Cost model: Runtime(algo) for algo in [radix, quick, merge, ...]

  • Example:

Data Locality

Switch: _GLIBCXX_PROFILE_LOCALITY.

Need Software Prefetch

  • Switch: _GLIBCXX_PROFILE_SOFTWARE_PREFETCH.

  • Goal: Discover sequences of indirect memory accesses that are not regular, thus cannot be predicted by hardware prefetchers.

  • Fundamentals: Indirect references are hard to predict and are very expensive when they miss in caches.

  • Sample runtime reduction:25%.

  • Recommendation: Insert prefetch instruction.

  • To instrument: Vector iterator and access operator [].

  • Analysis: First, get cache line size and page size from system. Then record iterator dereference sequences for which the value is a pointer. For each sequence within a container, issue a warning if successive pointer addresses are not within cache lines and do not form a linear pattern (otherwise they may be prefetched by hardware). If they also step across page boundaries, make the warning stronger.

    The same analysis applies to containers other than vector. However, we cannot give the same advice for linked structures, such as list, as there is no random access to the n-th element. The user may still be able to benefit from this information, for instance by employing frays (user level light weight threads) to hide the latency of chasing pointers.

    This analysis is a little oversimplified. A better cost model could be created by understanding the capability of the hardware prefetcher. This model could be trained automatically by running a set of synthetic cases.

  • Cost model: Total distance between pointer values of successive elements in vectors of pointers.

  • Example:

    1 int zero = 0;
    2 vector<int*> v(10000000, &zero);
    3 for (int k = 0; k < 10000000; ++k) {
    4   v[random() % 10000000] = new int(k);
    5 }
    6 for (int j = 0; j < 10000000; ++j) {
    7   count += (*v[j] == 0 ? 0 : 1);
    8 }
    
    foo.cc:7: advice: Insert prefetch instruction.
    

Linked Structure Locality

  • Switch: _GLIBCXX_PROFILE_RBTREE_LOCALITY.

  • Goal: Give measure of locality of objects stored in linked structures (lists, red-black trees and hashtables) with respect to their actual traversal patterns.

  • Fundamentals:Allocation can be tuned to a specific traversal pattern, to result in better data locality. See paper: Custom Memory Allocation for Free.

  • Sample runtime reduction:30%.

  • Recommendation: High scatter score N for container built at site S. Consider changing allocation sequence or choosing a structure conscious allocator.

  • To instrument: Methods of all containers using linked structures.

  • Analysis: First, get cache line size and page size from system. Then record the number of successive elements that are on different line or page, for each traversal method such as find. Give advice only if the ratio between this number and the number of total node hops is above a threshold.

  • Cost model: Sum(same_cache_line(this,previous))

  • Example:

     1  set<int> s;
     2  for (int i = 0; i < 10000000; ++i) {
     3    s.insert(i);
     4  }
     5  set<int> s1, s2;
     6  for (int i = 0; i < 10000000; ++i) {
     7    s1.insert(i);
     8    s2.insert(i);
     9  }
    ...
          // Fast, better locality.
    10    for (set<int>::iterator it = s.begin(); it != s.end(); ++it) {
    11      sum += *it;
    12    }
          // Slow, elements are further apart.
    13    for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) {
    14      sum += *it;
    15    }
    
    foo.cc:5: advice: High scatter score NNN for set built here.  Consider changing
    the allocation sequence or switching to a structure conscious allocator.
    

Multithreaded Data Access

The diagnostics in this group are not meant to be implemented short term. They require compiler support to know when container elements are written to. Instrumentation can only tell us when elements are referenced.

Switch: _GLIBCXX_PROFILE_MULTITHREADED.

Data Dependence Violations at Container Level

  • Switch: _GLIBCXX_PROFILE_DDTEST.

  • Goal: Detect container elements that are referenced from multiple threads in the parallel region or across parallel regions.

  • Fundamentals: Sharing data between threads requires communication and perhaps locking, which may be expensive.

  • Sample runtime reduction:?%.

  • Recommendation: Change data distribution or parallel algorithm.

  • To instrument: Container access methods and iterators.

  • Analysis: Keep a shadow for each container. Record iterator dereferences and container member accesses. Issue advice for elements referenced by multiple threads. See paper: The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization.

  • Cost model: Number of accesses to elements referenced from multiple threads

  • Example:

False Sharing

  • Switch: _GLIBCXX_PROFILE_FALSE_SHARING.

  • Goal: Detect elements in the same container which share a cache line, are written by at least one thread, and accessed by different threads.

  • Fundamentals: Under these assumptions, cache protocols require communication to invalidate lines, which may be expensive.

  • Sample runtime reduction:68%.

  • Recommendation: Reorganize container or use padding to avoid false sharing.

  • To instrument: Container access methods and iterators.

  • Analysis: First, get the cache line size. For each shared container, record all the associated iterator dereferences and member access methods with the thread id. Compare the address lists across threads to detect references in two different threads to the same cache line. Issue a warning only if the ratio to total references is significant. Do the same for iterator dereference values if they are pointers.

  • Cost model: Number of accesses to same cache line from different threads.

  • Example:

    1     vector<int> v(2, 0);
    2 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)
    3     for (i = 0; i < SIZE; ++i) {
    4       v[i % 2] += i;
    5     }
    
    OMP_NUM_THREADS=2 ./a.out
    foo.cc:1: advice: Change container structure or padding to avoid false
    sharing in multithreaded access at foo.cc:4.  Detected N shared cache lines.
    

Statistics

Switch: _GLIBCXX_PROFILE_STATISTICS.

In some cases the cost model may not tell us anything because the costs appear to offset the benefits. Consider the choice between a vector and a list. When there are both inserts and iteration, an automatic advice may not be issued. However, the programmer may still be able to make use of this information in a different way.

This diagnostic will not issue any advice, but it will print statistics for each container construction site. The statistics will contain the cost of each operation actually performed on the container.

@