27/3/2019, 22:11

Rounding errors and compiler optimizations

In a previous post, we saw how careful one must be to reason about floating-point numbers and how \((a / b) \cdot b\) does not in general equal \(a\) in the context of computer hardware.



Indeed, the first thing everybody learns about floating-point numbers is that any attempt to compare them is likely fruitless as unexpected results show up all over the place, a typical example being that \(0.1 + 0.2 \not= 0.3\). In Python:


$ python
>>> 0.1 + 0.2 == 0.3
False

In this post we will see two non-obvious sources of rounding errors that conspired to slow down time itself by 3 seconds in a particularly nasty bug in a .NET application, and we will see how one would go about debugging such issues.



Context


At Ørsted we're building the world's biggest offshore wind farms. These are massive construction projects requiring specialized vessels whose chartering necessitates tight planning. The construction work moreover depends heavily on the often adverse weather conditions, and as the operation itself is rather costly, our budgetting relies on precise mathematical models for how construction takes place.



At the same time, we have several machines running the built models to ensure that their outputs always match expectations. One day, one of these machines reported that a construction project that ought to take about a year all of a sudden took about a year and three seconds. A three second delay of a wind farm construction probably won't ruin the budget, but any such discrepancy warrants further investigation.



As you have probably guessed from the introduction of this post, rounding errors were to blame, and the only question is how such rounding errors could show up on a single build machine while all other machines reported no issues.



Background


Before jumping into the concrete example, we recall a bit of terminology on floating-point numbers and on .NET in particular.



First of all, floating-point numbers come with a given precision that determines how many numbers can be represented: While arbitrary accuracy representations of real numbers can be achieved with enough computer memory, all floating-point representations come with a predetermined set of numbers that can be represented. In practice, one usually encounters single-precision and double-precision formats in which a given number takes up 32 bits or 64 bits of memory respectively.



When writing .NET code, and C# in particular, the developer can choose between Debug and Release configurations; the former allows them to break into running code at any given line, whereas the latter allows the compiler to perform certain optimizations that improve runtime performance at the cost of debugging sometimes becoming impossible.



Finally, the C# compiler and associated runtime can target different platforms: The x86 platform offers compatibility with older systems whereas compiling with x64 as a target platform allows the user to make use of a larger virtual memory space, while at the same time changing several runtime characteristics of the application: The calling convention that governs how function calls are turned into zeros and ones is different, and different sets of hardware registers and instructions are used for each platform. In the context of floating-point arithmetic this makes a particularly large difference as the x86 platform relies on x87 instructions for floating-point math, while the x64 platform relies on SSE instructions.



The example


With that out of the way, we can look at the example that ended up causing the 3-second delay of our construction project; it boils down to a few lines of C#:


Single f1 = 0.00000000002f;
Single f2 = 1 / f1;
Double d = f2;
Console.WriteLine(d);

Here we simply take the single-precision floating-point number closest to 0.00000000002, find its reciprocal, convert that to a double-precision number, and print the result.



It turns out that what's printed depends both on whether or not one makes use of the compiler's release mode optimizations, and on the choice of target platform. Concretely, the output is as follows:


      Debug         Release
x86 49999998976 50000000199.7901
x64 49999998976 49999998976


The explanation


Using Visual Studio, it is possible to investigate the machine code generated by the .NET Framework runtime when executing the above example (Debug → Windows → Disassembly in Visual Studio).



Doing so allows us to immediately get some hints about what is going on, and why the behavior is what it is.



Let us first focus on what happens in the x86 case and consider the generated machine code for both the debug and release configurations. In the below table we show the code generated in debug mode on the left, and in release mode on the right, matching up instructions that they have in common. Each line corresponds to a single instruction, which consists of a label and a number of operands operated upon by the instruction.


Debug                                     | Release
mov dword ptr [ebp-40h],2DAFEBFFh | mov dword ptr [ebp-4],2DAFEBFFh
fld dword ptr [ebp-40h] | fld dword ptr [ebp-4]
fld1 | fld1
fdivrp st(1),st | fdivrp st(1),st
fstp dword ptr [ebp-44h] |
fld dword ptr [ebp-44h] |
fstp qword ptr [ebp-4Ch] |
fld qword ptr [ebp-4Ch] |
sub esp,8 | sub esp,8
fstp qword ptr [esp] | fstp qword ptr [esp]
call 6B9783BC | call 6B9783BC

In general, all of these instructions and what they do to the operands can be looked up in manuals; see for instance this page for a detailed description of fdivrp. Here we just focus on a few examples.



As a simple example, the instruction sub esp,8 takes whatever integer number is in the CPU register labelled esp, subtracts 8, and stores the result back in esp.



A slightly more complicated example is the instruction fld dword ptr [ebp-40h] which loads whatever is at the memory address pointed to by [ebp-40h] into an x87-specific register. Here, ebp is itself a 32-bit register containing a memory address, and [ebp-40h] is then simply that address shifted 40 (hexadecimal) bytes. In other words, fld dword ptr [ebp-40h] moves a floating-point number from memory to the CPU.



The opposite operation, moving a floating-point number from the CPU to memory is provided by fstp.



The actual division of floating-point numbers is handled by fdivrp. What ends up making the difference for us is that even though we provide single-precision, i.e. 32-bit, numbers, the division itself is carried out in an 80-bit register, so the result can have much higher precision than what comes in, and this additional precision may or may not be lost when transferring back the result to memory.



Comparing the unoptimized debug code with the optimized release code, we see a bunch of seemingly redundant "store the value from the floating point register in memory, then immediately load it back from memory into the floating point register" operations that have been optimized away. However, the two instructions


fstp        dword ptr [ebp-44h]  
fld dword ptr [ebp-44h]

are enough to change the value in the x87 register from 50000000199.790138 to 49999998976. That is, moving our 80-bit floating-point to a 32-bit (dword) memory address, then back into the 80-bit register will cost us some precision. This seemingly innocent back-and-forth dance has been optimized away in release mode where we move the result of the division to memory as a double-precision (qword) number directly, and this then ends up causing a discrepancy in outputs at the end of the day.



The way to debug such matters in Visual Studio is to step through the disassembly while investigating the values of the registers directly (Debug → Windows → Registers, then right click and check "Floating point").



The story for the x64 platform is wildly different. We still see the same optimization removing a few instructions, but this time around, all instructions have turned into SSE instructions which rely on a different set of hardware registers:


Debug                                      | Release
vmovss xmm0,dword ptr [7FF7D0E104F8h] | vmovss xmm0,dword ptr [7FF7D0E304C8h]
vmovss dword ptr [rbp+34h],xmm0 | vmovss dword ptr [rbp-4],xmm0
vmovss xmm0,dword ptr [7FF7D0E104FCh] | vmovss xmm0,dword ptr [7FF7D0E304CCh]
vdivss xmm0,xmm0,dword ptr [rbp+34h] | vdivss xmm0,xmm0,dword ptr [rbp-4]
vmovss dword ptr [rbp+30h],xmm0 |
vcvtss2sd xmm0,xmm0,dword ptr [rbp+30h] | vcvtss2sd xmm0,xmm0,xmm0
vmovsd qword ptr [rbp+28h],xmm0 |
vmovsd xmm0,qword ptr [rbp+28h] |
call 00007FF81C9343F0 | call 00007FF81C9343F0

Where previously, the x87 floating-point operations could take place at a higher precision than the inputs given, in the case of SSE the compiler is given full control over what precision to use at any given time. For instance, the instructions vmovss and vmovsd both move floating-point numbers around, but one is used for single-precision numbers, and the other for double-precision numbers. Similarly, the floating-point division is done in vdivss which stays within the space of single-precision numbers. Indeed, one finds (after enabling the SSE registers in the Visual Studio Registers overview) that after the vdivss instruction, the SSE register XMM0 contains hexadecimal 0000000000000000-00000000513A43B7 which is exactly the binary representation of the number 49999998976 from before, so it is no surprise that this ends up being the output in both debug and release mode.



Closing remarks


First of all, this example makes painfully clear what we already knew: Checking floating-point numbers for equality can only go wrong. No matter how careful one might be, compilers will find a way to make it all fall apart.



At the same time, the example shows that even though we were dealing with a high-level language such as C#, in order to properly debug issues caused by floating-point rounding errors we had to dig all the way down to assembly. No matter how many software abstractions we put on top of our hardware, implementation details will leak through, and having a basic understanding of assembly can go a long way when debugging otherwise nasty bugs.



Finally, we should ask ourselves if the discrepancy above is in fact to be expected, or if we somehow encountered a bug in the compiler. The C# specification offers the following explanation:



Floating-point operations may be performed with higher precision than the result type of the operation. For example, some hardware architectures support an "extended" or "long double" floating-point type with greater range and precision than the double type, and implicitly perform all floating-point operations using this higher precision type. Only at excessive cost in performance can such hardware architectures be made to perform floating-point operations with less precision, and rather than require an implementation to forfeit both performance and precision, C# allows a higher precision type to be used for all floating-point operations.



This explains the behavior we see in the x86 debug case, where prior to the final conversion, our single-precision division provides a higher precision result. We do, however, explicitly coerce the result into a single-precision number in our example, and it is not clear to me whether the lack of such a conversion in the release mode case falls under the general caveat given by the specification above. However, the runtime specification (Section I.12.1.3) goes on to state that:



Storage locations for floating-point numbers (statics, array elements, and fields of classes) are of fixed size. The supported storage sizes are float32 and float64. Everywhere else (on the evaluation stack, as arguments, as return types, and as local variables) floating-point numbers are represented using an internal floating-point type. In each such instance, the nominal type of the variable or expression is either float32or float64, but its value can be represented internally with additional range and/or precision. The size of the internal floating-point representation is implementation-dependent, can vary, and shall have precision at least as great as that of the variable or expression being represented.



In particular, we are dealing with local variables, and according to the above, whenever we say Single the runtime can take this to mean "Single or more precise". Indeed, if we move each of our local variables to fields of a class, the x86 build will give the same results in both debug and release mode.




Kommentarer


Ingen kommentarer endnu.


Tilføj kommentar

For at undgå for meget spam på siden skal du logge ind for at tilfæje en kommentar. Det kan du gøre nedenfor, eller du kan lave en bruger, hvis du ikke har en allerede. Du kan også bruge dit fotologin her.


E-post-adresse:
Kodeord:  Glemt din kode?
Captcha:

[ Nyt billede ]
CAPTCHA Image