When preg_replace just isn't enough

Introduction

Sometimes, you want to replace a word in a string with another word. You may want to replace all swear words with more neutral statements, replace BBCode tags with the corresponding HTML tags or use placeholders for data in a templating system.

In PHP, there are several functions to replace something:

Often, you want to replace a pattern. Consider our templating system in which we have placeholders for data:

Hello {planet}. My name is {name}.
In this case, we want to replace {anything} with some other string, which depends on the string between the curly braces. We may have an array which contains the data to replace the placeholders with.

We can make a regular expression to match our placeholders: /{[^}]*}/ The regex is is between slashes. It matches a curly brace, then anything and then a closing curly brace. The [^}] means "anything but a }", which may repeat zero or more times because of the asterisk.

Now, if we use preg_replace, we can replace all our placeholders. However, with preg_replace we can only replace all tags with the same string. We want to vary the replacement string depending on the source string, but preg_replace allows only a static string.

If we want to replace a regex depending on the content of the match, we have three options:

  • Use preg_replace_callback, where the callback determines the replacement string.
  • First search all patterns using preg_match, then replace all matches with str_replace.
  • Using only str_replace, which does not work in all cases.

Preg_replace_callback

With preg_replace_callback, a user-supplied function is called for each match. This function, called the callback function, is passed the $matches array which contains information about the match. The callback function also determines the replacement string.

On the previous page, we invented a regex to match our placeholder tags: /{[^}]*}/. We use that regex with preg_replace_callback, with one modification. We want the text between the curly braces to become available to our callback function. Therefore, we mark that piece as a group by using parenthesis: /{([^}]*)}/.

To see how preg_replace_callback behaves, we use a function which simply prints its argument:

function print_callback($matches)
{
        print_r($matches);
}
$template = 'Hello {planet}. My name is {name}.';
echo preg_replace_callback('/{([^}]*)}/', 'print_callback', $template);
The regex will match twice. It calls the print_callback function, which prints two arrays:
Array
(
    [0] => {planet}
    [1] => planet
)
Array
(
    [0] => {name}
    [1] => name
)
Just as with preg_match, the array contains the whole match as its first element, and any groups as subsequent elements. It has the contents of our placeholder as second element. We can use that to replace the placeholder with something useful. In that case, we have to return a string:
$data = array('planet' => 'World', 'name' => 'Wiebe');
$template = 'Hello {planet}. My name is {name}.';

function array_callback($matches)
{
        global $data;
        $key = $matches[1];
        return $data[$key];
}

echo preg_replace_callback('/{([^}]*)}/', 'array_callback', $template);

Preg_match_all & str_replace

Instead of relying on preg_replace_callback, we can also do the matching and replacing ourselves. In this method, we first search all placeholders, determine what to replace them with and then replace all of them. This method is slightly less efficient and elegant than the preg_replace_callback method.

To search for all placeholder patterns, we use the following code:

preg_match_all('/{([^}]*)}/', $template, $matches);
This will put an array in $matches which contains all the matches. The contents of this array are a little strange: The first element of the array contains an array with all full matches. The second element contains the first group of each match, and so on. So the match $matches[0][1] contains the group $matches[1][1]. An example:
Array
(
    [0] => Array
        (
            [0] => {planet}
            [1] => {name}
        )

    [1] => Array
        (
            [0] => planet
            [1] => name
        )

)

To determine the replacement string, we step through $matches[1] and put the original string and the replacement string in an array. We then pass that array to str_replace, in a form that it can understand:

$data = array('planet' => 'World', 'name' => 'Wiebe');
$template = 'Hello {planet}. My name is {name}.';

preg_match_all('/{([^}]*)}/', $template, $matches);
foreach ($matches[1] as $key)
{
        $replacements['{'.$key.'}'] = $data[$key];
}
echo str_replace(array_keys($replacements), array_values($replacements), $template);
The array $replacements contains placeholder tags as keys and replacement strings as data. We pass that to str_replace, thus replacing all placeholders with the corresponding data.

Str_replace alone

If we know beforehand what the curly braces may contain, we can use do the replacing the other way around: instead of searching for the pattern and replacing the value, we simply replace all values we can think of. With this method, we loop through our array of possible values, generate the possible pattern for it and replace that:

$data = array('planet' => 'World', 'name' => 'Wiebe');
$template = 'Hello {planet}. My name is {name}.';

foreach ($data as $key => $value)
{
        $pattern = '{' . $key . '}';
        $replacements[$pattern] = $value;
}

echo str_replace(array_keys($replacements), array_values($replacements), $template);

With this method, we do not use regular expressions, but only simple replacements. It may be that we try to replace a placeholder in our template which isn't there, but that is not big deal.

This method is not always available. Sometimes you have very many possible placeholder values or it may be expensive to get the values for placeholders. However, if it is possible, this method is by far the simplest.

Performance comparison

I have laid out three algorithms to replace placeholder tags in a string. These algorithms have different performance, and as everybody in the PHP scene seems to think that performance matters, let me look into it.

There is no "best" method in every case. Instead, which method to use depends on the template used and the data which is available. The following graph nicely shows this:

In this graph, the number of available data keys is plotted against the time that each method takes. The number of available data keys is the size of the $data array in our example. Of the keys in the data array, approximately 50% was actually used in the template.

In the str_replace method, str_replace() is called once for each data key. The time this method takes is directly dependent on the size of the data array. The preg_match method calls str_replace() once for each data key that is actually used in the template. Because approximately 50% of our keys are used in the template, it calls str_replace() about half the time of the str_replace method. These algorithms are both O(n), where n is the number of items in the data array. This means that they become linearly slower when the array becomes bigger.

The preg_replace_callback method, on the contrary, is not influenced by the size of the data array. It simply searches for our placeholders and only then looks up the value in the data array. Looking up a value in an array takes some time which is not dependant on the size of that array, thus giving an O(1) algorithm: the speed is independant of the size of the data array.

In the next graph, we show what happens if you keep the number of data items equal but vary the number of different placeholders that show up in the template. At the left side of the graph, only a few different placeholders are used in the template. At the right side of the graph, all available placeholders from the data array are used.

A you can see, preg_replace_callback is again not impressed by our variations. It does not care whether the placeholders it has to replace are different, because it does a lookup for each of them.

The preg_match method gets slower when it has to replace more different placeholders. Because preg_match replaces only the placeholders which are actually used in the template, it calls str_replace more if it has to replace more different items.

Str_replace also gets slower when it has to replace more different items. This is an interesting detail of the implementation. After all, if str_replace() searched through the whole array and replaced all occurrences each time, each call to str_replace() would take approximately the same time, thus the speed would not be influenced by the number of different keys. However, str_replace() has a little optimization: before doing anything, it first searches whether the word you're looking for actually occurs in the text. If it doesn't, it returns. If it does, it allocates a bigger memory slot to fit the text with the replacements and does the actual replacements. This causes str_replace() to be slower when it actually has something to replace, giving the curved line in the graph.

Finally, we vary the size of the template and see what happens to our functions:

As you can see, all functions get linearly slower when template size (and this the number of placeholders) increases, which makes sense. However, str_replace() is a lot less impressed by a big template than the preg methods. This is because str_replace() is a much simpler algorithm than matching regular expressions.

Conclusion

There are several methods for replacing placeholders in a template. They differ in code readibility and performance. However, we saw that there is not one best-performing method. It depends on the input data which method performs best.