Knocking my socks off!


As I mentioned in my previous entry, I migrated a library from Java to C++ to make it easier to use from Python. I tested its performance on both Linux and OSX, and it was on par with the Java version. Sample file S266.txt would take around 45 seconds on my computer using the Java version, and the C++ version needed thirty-something seconds on a M1 MacBook Pro running OSX, a bit less than that on an i9-13900 server running Linux. From that, I could not claim there was a huge difference in performance compared to the original Java runtime.

But I wanted to make a fair comparison, and I wanted to face the trouble of building the library for Windows. It was not a challenge, but it took a while. Let me summarize all the steps I needed to cover:

  1. Install Microsoft's Visual Studio (Community Edition)
  2. Install Microsoft's vcpkg (C++ package manager)
  3. Install CMake.
  4. Install dependencies (tbb, gtest, boost, pybind11)
  5. Clone the project repository.
  6. Build the project.
It took me like an hour to complete these steps, as just step 1 is a 2.9 GB download (a 10GB install). And Boost library took 20 minutes to install. TBB, another 7 minutes. You better not be in a hurry :-) 

I thought the pain was over when I started building the project, but then all the libraries were not found automatically by CMake. I had to find them in the filesystem and created the environment variables to signal it to the CMake command. The CMakeLists.txt on GitHub specified the 1.71.0 version of the Boost library, but on Windows, version 1.88.0 was installed instead. That created a build error that I summarily fixed by editing line 17 of the file, changing it to the version installed in Windows (an innocent change, I thought). Eventually, everything was sorted out and the project was built.

Finally, it was the moment of the truth, and I tried running a few of the sample files with the Windows binary just built. Instead of waiting for half a minute or more, just a second after pressing enter, the output was spat out. I thought there was something wrong, perhaps I had selected the wrong sample file, but no, after carefully checking it, the same project now required only 1.3 seconds!!! 

I know, I know, what if the output is fast but is wrong? But no, I checked the result, and it's the correct result. The program seems to be doing the same job as before, just almost 30x faster. How is that possible? Other computers running other OS do not show such a speed-up. 

What I think is the root cause is the use of a newer version of the Boost library for the Windows version (something I did not plan at all). My guess is some of the library functions the program uses have been optimized in a way that boosts (is there a pun here?) the performance of the library☺. 

Update

I checked the build on the Mac, and I discovered Boost library version 1.88.0 was already installed (and cmake did not complain there). So my theory that the amazing speed-up experienced in Windows was due to the version of the Boost library used there does not hold. 

Let me know in the comments if you have a clue, and remember, the GitHub repository is linked above if you want to experiment yourself. 

Update2

I updated the Linux version to Boost version 1.88.0, and the program's performance did not change. Then I read the following info, chatting with Claude 2 bot: 

Windows (vcpkg/MSVC): Automatically uses aggressive optimization flags
  • Equivalent to -O3 -march=native -mtune=native 
  • MSVC's /O2 or /Ox with architecture-specific optimizations

Linux/macOS default: Uses basic optimization (probably -O1 or -O2 without CPU-specific flags)
  • Missing the crucial -march=native that tells GCC to use your i9-13900's specific instruction sets
  • Missing -mtune=native for optimal instruction scheduling

What -march=native -mtune=native does:
  • Enables AVX2/AVX-512 instructions on your i9-13900
  • Uses optimal instruction scheduling for your specific CPU
  • Enables all the SIMD optimizations for geometric calculations

The lesson:
This is a perfect example of why build system defaults matter. vcpkg/MSVC was being "smart" and using optimal flags automatically, while the Linux build was being "safe" with generic optimization.

For 2D packing algorithms with lots of geometric calculations, those CPU-specific optimizations make an enormous difference!

Now I can go to bed knowing why this happened.

Comments

Popular posts from this blog

VFD control with Arduino using RS485 link

Importing OpenSCAD designs into Onshape

Arduino mood light