Tuesday, October 27, 2015

PHP array de-duplication trick that'll save you lots of CPU (* YMMV)

Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live.

Today, I discovered this curious gem in legacy code:

<?php
// ✂ ...snip... wall of code that creates array $ids
$ids = array_flip($ids);
$ids = array_flip($ids);
// ✂ ...snip... wall of code using $ids

Looks like a cut and paste error, and my gut reaction was to delete the second line. But, a sneaking suspicion stalled my reaction: the developer-that-no-longer-works-here who wrote that code had a talent for writing clever, uncommented code. Perhaps this was another instance of that pattern, and I'd better check myself.

PHP arrays map keys to values. array_flip spins the mapping around, values to keys. If the code intended to flip the array, then I'd expect to see iteration over values to keys. What I observed in later logic was instead iteration of keys to values, as if the flip never happened. So what was this code doing?

It's a de-duplication trick, first seen in a 2002 comment about array_unique. Purportedly, double flip is significantly faster than the equivalent array_unique call for large arrays. Before you rush off and change all your code to use this trick, keep these things in mind:

  • array_unique() and array_flip(array_flip()) produce different results where keys are concerned: array_unique keeps the first unique (key, value) pair while array_flip keeps the last.
  • array_unique([0, false, 0]) produces the expected result. Double flip does not (and raises a warning to boot).
  • For small arrays, the performance difference is invisible.
  • For very large arrays of numbers, the difference between array_unique($a, SORT_NUMERIC) and array_flip(array_flip()) is negligible. On a medium Amazon EC2 instance, 0.27s vs 0.4s for 10M integers.
  • For very large arrays of strings, the difference is significant. On a medium Amazon EC2 instance, 1.2s vs 10.9s for 10M strings of random length between 3 and 5 ASCII characters.

That's a real savings, and so this trick definitely has a place in the developer's tool box. But please, please for the love of all that is holy, comment the trick so that's clear what's going on. My preferred way of seeing this trick deployed is:

$ids = array_flip(array_flip($ids)); // want unique values, don't care about keys

Related Posts:

  • Vim gem: built-in calculationVim is my go-to editor. Has been for 20 years. Besides being an all-around awesome editor for composing text, it also has some handy built-ins, like calculations: In insert mode, ^R= accepts a mathematical expression, the r… Read More
  • Using vim to replace string functions with their multi-byte equivalentThe PHP INI option mbstring.func_overload override certain string functions (like strpos, substr, etc.) with multi-byte aware implementations. This makes it super easy to migrate a legacy code base to UTF-8, but immediately r… Read More
  • Jenkins + Slack + Fortune I love Jenkins, and I love Slack. But the stock Slack build message is boring as hell. If you feel the same way, here's how to get a fun message to follow the notification: Install Jenkins, the Slack notification plugin… Read More
  • PHP Contributor EtiquetteI was the first to publically +1 the Code of Conduct RFC. I'd love to see a policy that fosters diversity and inclusion, because damn the PHP crowd is startlingly similar. But, after hearing the arguments,… Read More
  • [Proposed] Elephpant EtiquetteYes, I do believe PHP internals needs a guide to etiquette. But, no, not a code of conduct. Internals is a decades (plural) old cathedral-like meritocracy. There is no benevolent dictator. There is no functional oversight gro… Read More

0 comments:

Post a Comment

Share your thoughts!