Skip to main content

Print() and std::cout are extremely expensive

· 3 mins
Ryan Gibson
Author
Ryan Gibson
Quantitative Analyst | Computer Scientist
Table of Contents

Print statements are far more expensive than most junior developers realize. They’re usually very surprised when you profile their code and show them how slow they are.

Look at this program snippet. It runs in about 3 seconds, but what amount of time do you think each part takes?

double x = 0;
for (size_t i = 0; i < ONE_MILLION; i++) {
    // (1): several thousand arithmetic, bitwise, and logical operations
    for (size_t j = 0; j < 1500; j++) {
        x += (i * j) + (i + j) + (i << 3) + (j >> 2) +
             (i > j) + (i < j) + (i & j) + (i | j);
    }
    // (2): a print statement
    std::cout << "A relatively short sentence." << std::endl;
}

On your typical machine printing to a terminal, (2) will take about ~1-2x the time of (1).1 And Windows is ~15x slower than that!

In other words, printing a single line takes as long as tens of thousands to millions of arithmetic operations!

This happens in pretty much every language
#

You’ll see the same thing in Python. Try running

$ python3 -m timeit "print('A relatively short sentence.')"
...
100000 loops, best of 5: 2.03 usec per loop

Two microseconds! That’s roughly 10,000 cycles on a modern processor.

You can construct a list of about 500 elements and sum it in a similar amount of time.

$ python3 -m timeit "sum([1] * 500)"
100000 loops, best of 5: 2.41 usec per loop

What is print() doing?
#

Think about what print() actually has to do here. In short,

  1. Copy the string into to a user-space buffer2 and then flush it (because of the line terminator)
  2. Perform some standard library processing and run the associated write() syscall
  3. Switch over to kernel mode and copy the data to a kernel buffer
  4. Pass the write request along to the terminal through the virtual file system
  5. Wait until the terminal driver has transferred the data to its own buffers (though not necessarily render it)
  6. Return back to user space and finalize some standard library bookkeeping before continuing

That’s an enormous amount of work!

Keep this in mind the next time someone dumps a huge amount of data to stdout or writes a simple “print statement progress bar”3 and then tries to claim that the task truly needs to run that slowly.

The bottleneck very well may be the print statements themselves.

See also and references
#

If you’re interested, feel free to glance through


  1. I’m cheating a bit here by using more of the CPU’s execution units than you’d see in a typical loop and glossing over a few small compiler optimizations. Replacing the loop body with x += i * j; does not change the runtime. ↩︎

  2. This copy could probably be optimized out in some cases but not universally. ↩︎

  3. E.g., something like this in a critical loop.

    std::cout << "\rRunning trial " << trial << " and working on operation " << i << "...";
    

    or

    print(f"\rRunning trial {trial} and working on operation {i}...", end='')
    

    These ones will (probably) avoid the excessive flushing, but the impact can still be quite significant. ↩︎

Related