1 Shakale

C Bit Field Assignment Charge

Some elementary operations, even being conceptually as simple as others, are much faster for the processor. A clever programmer can choose the faster instructions for the job.

Though, every optimizing compiler is already able to choose the fastest instructions for the target processor, and so some techniques are useless with some compilers.

In addition, some techniques may even worsen performance on some processors.

In this section some techniques are presented that may improve performance on some compiler/processor combinations.

Structure fields order[edit]

Arrange the member variables of classes and structures in such a way that the most used variables are in the first 128 bytes, and then sorted from the longest object to the shortest.

If in the following structure the member is used only for error messages, while the other members are used for computations:

struct{charmsg[400];doubled;inti;};

you can speed up the computation by replacing the structure with the following one:

struct{doubled;inti;charmsg[400];};

On some processors, the addressing of a member is more efficient if its distance from the beginning of the structure is less than 128 bytes.

In the first example, to address the and fields using a pointer to the beginning of the structure, an offset of at least 400 bytes is required.

Instead, in the second example, containing the same fields in a different order, the offsets to address and are of few bytes, and this allows to use more compact instructions.

Now, let's assume you wrote the following structure:

struct{boolb;doubled;shorts;inti;};

Because of fields alignment, it typically occupies 1 (bool) + 7 (padding) + 8 (double) + 2 (short) + 2 (padding) + 4 (int) = 24 bytes.

The following structure is obtained from the previous one by sorting the fields from the longest to the shortest:

struct{doubled;inti;shorts;boolb;};

It typically occupies 8 (double) + 4 (int) + 2 (short) + 1 (bool) + 1 (padding) = 16 bytes. The sorting minimized the paddings (or holes) caused by the alignment requirements, and so generates a more compact structure.

Floating point to integer conversion[edit]

Exploit non-standard routines to round floating point numbers to integer numbers.

The C++ language do not provide a primitive operation to round floating point numbers. The simplest technique to convert a floating point number to the nearest integer number is the following statement:

n=int(floor(x+0.5f));

Using such a technique, if is exactly equidistant between two integers, will be the upper integer (for example, 0.5 generates 1, 1.5 generates 2, -0.5 generates 0, and -1.5 generates -1).

Unfortunately, on some processors (in particular, the Pentium family), such expression is compiled in a very slow machine code. Some processors have specific instructions to round numbers.

In particular, the Pentium family has the instruction , that, used as in the following code, gives much faster, albeit not exactly equivalent, code:

#if defined(__unix__) || defined(__GNUC__)// For 32-bit Linux, with Gnu/AT&T syntax__asm("fldl %1 \n fistpl %0 ":"=m"(n):"m"(x):"memory");#else// For 32-bit Windows, with Intel/MASM syntax__asmfldqwordptrx;__asmfistpdwordptrn;#endif

The above code rounds to the nearest integer, but if is exactly equidistant between to integers, will be the nearest even integer (for example, 0.5 generates 0, 1.5 generates 2, -0.5 generates 0, and -1.5 generates -2).

If this result is tolerable or even desired, and you are allowed to use embedded assembly, then use this code. Obviously, it is not portable to other processor families.

Integer numbers bit twiddling[edit]

Twiddle the bits of integer numbers exploiting your knowledge of their representation.

A collection of hacks of this kind is here. Some of these tricks are actually already used by some compilers, others are useful to solve rare problems, others are useful only on some platforms.

Array cells size[edit]

Ensure that the size (resulting from the operator) of non-large cells of arrays or of s be a power of two, and that the size of large cells of arrays or of s be not a power of two.

The direct access to an array cell is performed by multiplying the index by the cell size, that is a constant. If the second factor of this multiplication is a power of two, such an operation is much faster, as it is performed as a bit shift. Analogously, in multidimensional arrays, all the sizes, except at most the first one, should be powers of two.

This sizing is obtained by adding unused fields to structures and unused cells to arrays. For example, if every cell is a 3-tuple of objects, it is enough to add a fourth dummy object to every cell.

Though, when accessing the cells of a multidimensional array in which the last dimension is an enough large power of two, you can drop into the data cache contention phenomenon (aka data cache conflict), that may slow down the computation by a factor of 10 or more. This phenomenon happens only when the array cells exceed a certain size, that depends on the data cache, but is about 1 to 8 KB. Therefore, in case an algorithm has to process an array whose cells have or could have as size a power of two greater or equal to 1024 bytes, first, you should detect if the data cache contention happens, in such a case you should avoid such phenomenon.

For example, a matrix of 100 x 512 objects is an array of 100 arrays of 512 s. Every cell of the first-level array has a size of 512 x 4 = 2048 bytes, and therefore it is at risk of data cache contention.

To detect the contention, it is enough to add an elementary cell (a ) to every to every last-level array, but keeping to process the same cells than before, and measure whether the processing time decrease substantially (by at least 20%). In such a case, you have to ensure that such improvement be stabilized. For that goal, you can employ one of the following techniques:

  • Add one or more unused cells at the end of every last-level array. For example, the array could become , even if the computation will process such an array up to the previous sizes.
  • Keep the array sizes, but partition it in rectangular blocks, and process all the cells in one block at a time.

Prefix vs. Postfix Operators[edit]

Prefer prefix operators over postfix operators.

When dealing with primitive types, the prefix and postfix arithmetic operations are likely to have identical performance. With objects, however, postfix operators can cause the object to create a copy of itself to preserve its initial state (to be returned as a result of the operation), as well as causing the side-effect of the operation. Consider the following example:

classIntegerIncreaser{intm_Value;public:/* Postfix operator. */IntegerIncreaseroperator++(int){IntegerIncreasertmp(*this);++m_Value;returntmp;};/* Prefix operator. */IntegerIncreaseroperator++(){++m_Value;return*this;};};

Because the postfix operators are required to return an unaltered version of the value being incremented (or decremented) — regardless of whether the result is actually being used — they will most likely make a copy. STL iterators (for example) are more efficient when altered with the prefix operators.

Explicit inlining[edit]

If you don't use the compiler options of whole program optimization and to allow the compiler to inline any function, try to move to the header files the functions called in bottlenecks, and declare them .

As explained in the guideline "Inlined functions" in section 3.1, every inlined function is faster, but many inlined functions slow down the whole program.

Try to declare a couple of functions at a time, as long as you get significant speed improvements (at least 10%) in a single command.

Operations with powers of two[edit]

If you have to choose an integer constant by which you have to multiply or divide often, choose a power of two.

The multiplication, division, and modulo operations between integer numbers are much faster if the second operand is a constant power of two, as in such case they are implemented as bit shifts or bit maskings.

Integer division by a constant[edit]

When you divide an integer (that is known to be positive or zero) by a constant, convert the integer to .

If is a integer, is an integer, and is a constant integer expression (positive or negative), the operation is slower than , and is slower than . This is most significant when is a power of two, but in all cases, the sign must be taken into account during division.

The conversion from to , however, is free of charge, as it is only a reinterpretation of the same bits. Therefore, if is a integer that you know to be positive or zero, you can speed up its division using the following (equivalent) expressions: and .

Processors with reduced data bus[edit]

If the data bus of the target processor is smaller than the processor registers, if possible, use integer types not larger than the data bus for all the variables except for function parameters and for the most used local variables.

The types and are the most efficient, after they have been loaded in processor registers. Though, with some processor families, they could not be the most efficient type to access in memory.

For example, there are processors having 16-bit registers, but an 8-bit data bus, and other processors having 32-bit registers, but 16-bit data bus. For processors having the data bus smaller than the internal registers, usually the types and match the size of the registers.

For such systems, loading and storing in memory an object takes a longer time than that taken by an integer not larger than the data bus.

The function arguments and the most used local variables are usually allocated in registers, and therefore do not cause memory access.

Rearrange an array of structures as several arrays[edit]

Instead of processing a single array of aggregate objects, process in parallel two or more arrays having the same length.

For example, instead of the following code:

constintn=10000;struct{doublea,b,c;}s[n];for(inti=0;i<n;++i){s[i].a=s[i].b+s[i].c;}

the following code may be faster:

constintn=10000;doublea[n],b[n],c[n];for(inti=0;i<n;++i){a[i]=b[i]+c[i];}

Using this rearrangement, "a", "b", and "c" may be processed by array processing instructions that are significantly faster than scalar instructions. This optimization may have null or adverse results on some (simpler) architectures.

Even better is to interleave the arrays:

constintn=10000;doubleinterleaved[n*3];for(inti=0;i<n;++i){constsize_tidx=i*3;interleaved[idx]=interleaved[idx+1]+interleaved[idx+2];}

Remember test everything! And don't optimise prematurely.

Declares a class data member with explicit size, in bits. Adjacent bit field members may be packed to share and straddle the individual bytes.

A bit field declaration is a class data member declaration which uses the following declarator:

identifier(optional)attr(optional)size (1)
identifier(optional)attr(optional)sizebrace-or-equal-initializer (2) (since C++20)

The type of the bit field is introduced by the decl-specifier-seq of the declaration syntax.

attr(C++11) - optional sequence of any number of attributes
identifier - the name of the bit field that is being declared. The name is optional: nameless bitfields introduce the specified number of bits of padding
size - an integral constant expression with a value greater or equal to zero. When greater than zero, this is the number of bits that this bit field will occupy. The value zero is only allowed for nameless bitfields and has special meaning: it specifies that the next bit field in the class definition will begin at an allocation unit's boundary.
brace-or-equal-initializer - default member initializer to be used with this bit field

[edit]Explanation

The number of bits in a bit field sets the limit to the range of values it can hold:

Run this code

#include <iostream>struct S {// three-bit unsigned field,// allowed values are 0...7unsignedint b :3;};int main(){ S s ={7};++s.b;// unsigned overflow (guaranteed wrap-around)std::cout<< s.b<<'\n';// output: 0}

Multiple adjacent bit fields are usually packed together (although this behavior is implementation-defined):

Run this code

#include <iostream>struct S {// will usually occupy 2 bytes:// 3 bits: value of b1// 2 bits: unused// 6 bits: value of b2// 2 bits: value of b3// 3 bits: unusedunsignedchar b1 :3, :2, b2 :6, b3 :2;};int main(){std::cout<< sizeof(S)<<'\n';// usually prints 2}

The special unnamed bit field of size zero can be forced to break up padding. It specifies that the next bit field begins at the beginning of its allocation unit:

Run this code

#include <iostream>struct S {// will usually occupy 2 bytes:// 3 bits: value of b1// 5 bits: unused// 6 bits: value of b2// 2 bits: value of b3unsignedchar b1 :3;unsignedchar:0;// start a new byteunsignedchar b2 :6;unsignedchar b3 :2;};int main(){std::cout<< sizeof(S)<<'\n';// usually prints 2}

If the specified size of the bit field is greater than the size of its type, the value is limited by the type: a std::uint8_t b :1000; would still hold values between 0 and 255. the extra bits become unused padding.

Because bit fields do not necessarily begin at the beginning of a byte, address of a bit field cannot be taken. Pointers and non-const references to bit fields are not possible. When initializing a const reference from a bit field, a temporary is created (its type is the type of the bit field), copy initialized with the value of the bit field, and the reference is bound to that temporary.

The type of a bit field can only be integral or enumeration type.

A bit field cannot be a static data member.

There are no bit field prvalues: lvalue-to-rvalue conversion always produces an object of the underlying type of the bit field.

There are no default member initializers for bit fields: int b :1=0; and int b :1{0} are ill-formed.(until C++20)

In case of ambiguity between the size of the bit field and the default member initializer, the longest sequence of tokens that forms a valid size is chosen:

int a;constint b =0;struct S {// simple casesint x1 :8=42;// OK; "= 42" is brace-or-equal-initializerint x2 :8{42};// OK; "{ 42 }" is brace-or-equal-initializer// ambiguitiesint y1 :true?8: a =42;// OK; brace-or-equal-initializer is absentint y2 :true?8: b =42;// error: cannot assign to const intint y3 :(true?8: b)=42;// OK; "= 42" is brace-or-equal-initializerint z :1|| new int{0};// OK; brace-or-equal-initializer is absent};
(since C++20)

[edit]Notes

The following properties of bit fields are implementation-defined

  • The value that results from assigning or initializing a signed bit field with a value out of range, or from incrementing a signed bit field past its range.
  • Everything about the actual allocation details of bit fields within the class object
  • For example, on some platforms, bit fields don't straddle bytes, on others they do
  • Also, on some platforms, bit fields are packed left-to-right, on others right-to-left
  • Whether char, short, int, long, and longlong bit fields that aren't explicitly signed or unsigned are signed or unsigned.
  • For example, int b:3; may have the range of values 0..7 or -4..3.
(until C++14)

In the C programming language, the width of a bit field cannot exceed the width of the underlying type.

[edit]References

  • C++11 standard (ISO/IEC 14882:2011):
  • 9.6 Bit-fields [class.bit]
  • C++98 standard (ISO/IEC 14882:1998):
  • 9.6 Bit-fields [class.bit]

[edit]See also

Leave a Comment

(0 Comments)

Your email address will not be published. Required fields are marked *