Caching in PHP using the filesystem, APC and Memcached

Caching is very important and really pays off in big internet applications. When you cache the data you're fetching from the database, in a lot of cases the load on your servers can be reduced enormously.

One way of caching, is simply storing the results of your database queries in files.. Opening a file and unserializing is often a lot faster than doing an expensive SELECT query with multiple joins.

Here's a simple file-based caching engine.

<?php

// Our class
class FileCache {

  // This is the function you store information with
  function store($key,$data,$ttl) {

    // Opening the file
    $h = fopen($this->getFileName($key),'w');
    if (!$h) throw new Exception('Could not write to cache');
    // Serializing along with the TTL
    $data = serialize(array(time()+$ttl,$data));
    if (fwrite($h,$data)===false) {
      throw new Exception('Could not write to cache');
    }
    fclose($h);

  }

  // General function to find the filename for a certain key
  private function getFileName($key) {

      return '/tmp/s_cache' . md5($key);

  }

  // The function to fetch data returns false on failure
  function fetch($key) {

      $filename = $this->getFileName($key);
      if (!file_exists($filename) || !is_readable($filename)) return false;

      $data = file_get_contents($filename);

      $data = @unserialize($data);
      if (!$data) {

         // Unlinking the file when unserializing failed
         unlink($filename);
         return false;

      }

      // checking if the data was expired
      if (time() > $data[0]) {

         // Unlinking
         unlink($filename);
         return false;

      }
      return $data[1];
    }

}

?>

Key strategies

All the data is identified by a key. Your keys have to be unique system wide; it is therefore a good idea to namespace your keys. My personal preference is to name the key by the class thats storing the data, combined with for example an id.

example

Your user-management class is called My_Auth, and all users are identified by an id. A sample key for cached user-data would then be "My_Auth:users:1234". '1234' is here the user id.

Some reasoning behind this code

I chose 4096 bytes per chunk, because this is often the default inode size in linux and this or a multiple of this is generally the fastest. Much later I found out file_get_contents is actually faster.

Lots of caching engines based on files actually don't specify the TTL (the time it takes before the cache expires) at the time of storing data in the cache, but while fetching it from the cache. This has one big advantage; you can check if a file is valid before actually opening the file, using the last modified time (filemtime()).

The reason I did not go with this approach is because most non-file based cache systems do specify the TTL on storing the data, and as you will see later in the article we want to keep things compatible. Another advantage of storing the TTL in the data, is that we can create a cleanup script later that will delete expired cache files.

Usage of this class

The number one place in web applications where caching is a good idea is on database queries. MySQL and others usually have a built-in cache, but it is far from optimal, mainly because they have no awareness of the logic of you application (and they shouldn't have), and the cache is usually flushed whenever there's an update on a table. Here is a sample function that fetches user data and caches the result for 10 minutes.

<?php

 // constructing our cache engine
 $cache = new FileCache();

 function getUsers() {

    global $cache;

    // A somewhat unique key
    $key = 'getUsers:selectAll';

    // check if the data is not in the cache already
    if (!$data = $cache->fetch($key)) {
       // there was no cache version, we are fetching fresh data

       // assuming there is a database connection
       $result = mysql_query("SELECT * FROM users");
       $data = array();

       // fetching all the data and putting it in an array
       while($row = mysql_fetch_assoc($result)) { $data[] = $row; }

       // Storing the data in the cache for 10 minutes
       $cache->store($key,$data,600);
    }
    return $data;
}

$users = getUsers();

?>

The reason i picked the mysql_ set of functions here, is because most of the readers will probably know these.. Personally I prefer PDO or another abstraction library. This example assumes there's a database connection, a users table and other issues.

Problems with the library

The first problem is simple, the library will only work on linux, because it uses the /tmp folder. Luckily we can use the php.ini setting 'session.save_path'.

<?php

  private function getFileName($key) {

      return ini_get('session.save_path') . '/s_cache' . md5($key);

  }

?>

The next problem is a little bit more complex. In the case where one of our cache files is being read, and in the same time being written by another process, you can get really unusual results. Caching bugs can be hard to find because they only occur in really specific circumstances, therefore you might never really see this issue happening yourself, somewhere out there your user will.

PHP can lock files with flock(). Flock operates on an open file handle (opened by fopen) and either locks a file for reading (shared lock, everybody can read the file) or writing (exclusive lock, everybody waits till the writing is done and the lock is released). Because file_get_contents is the most efficient, and we can only use flock on filehandles, we'll use a combination of both.

The updated store and fetch methods will look like this

<?php
  // This is the function you store information with
  function store($key,$data,$ttl) {

    // Opening the file in read/write mode
    $h = fopen($this->getFileName($key),'a+');
    if (!$h) throw new Exception('Could not write to cache');

    flock($h,LOCK_EX); // exclusive lock, will get released when the file is closed

    fseek($h,0); // go to the beginning of the file

    // truncate the file
    ftruncate($h,0);

    // Serializing along with the TTL
    $data = serialize(array(time()+$ttl,$data));
    if (fwrite($h,$data)===false) {
      throw new Exception('Could not write to cache');
    }
    fclose($h);

  }

  function fetch($key) {

      $filename = $this->getFileName($key);
      if (!file_exists($filename)) return false;
      $h = fopen($filename,'r');

      if (!$h) return false;

      // Getting a shared lock
      flock($h,LOCK_SH);

      $data = file_get_contents($filename);
      fclose($h);

      $data = @unserialize($data);
      if (!$data) {

         // If unserializing somehow didn't work out, we'll delete the file
         unlink($filename);
         return false;

      }

      if (time() > $data[0]) {

         // Unlinking when the file was expired
         unlink($filename);
         return false;

      }
      return $data[1];
   }

?>

Well that actually wasn't too hard.. Only 3 new lines.. The next issue we're facing is updates of data. When somebody updates, say, a page in the cms; they usually expect the respecting page to update instantly.. In those cases you can update the data using store(), but in some cases it is simply more convenient to flush the cache.. So we need a delete method.

<?php

    function delete( $key ) {

        $filename = $this->getFileName($key);
        if (file_exists($filename)) {
            return unlink($filename);
        } else {
            return false;
        }

    }

?>

Abstracting the code

This cache class is pretty straight-forward. The only methods in there are delete, store and fetch.. We can easily abstract that into the following base class. I'm also giving it a proper prefix (I tend to prefix everything with Sabre, name yours whatever you want..). A good reason to prefix all your classes, is that they will never collide with other classnames if you need to include other code. The PEAR project made a stupid mistake by naming one of their classes 'Date', by doing this and refusing to change this they actually prevented an internal PHP-date class to be named Date.

<?php

    abstract class Sabre_Cache_Abstract {

        abstract function fetch($key);
        abstract function store($key,$data,$ttl);
        abstract function delete($key);

    }

?>

The resulting FileCache (which I'l rename to Filesystem) is:

<?php

class Sabre_Cache_Filesystem extends Sabre_Cache_Abstract {

  // This is the function you store information with
  function store($key,$data,$ttl) {

    // Opening the file in read/write mode
    $h = fopen($this->getFileName($key),'a+');
    if (!$h) throw new Exception('Could not write to cache');

    flock($h,LOCK_EX); // exclusive lock, will get released when the file is closed

    fseek($h,0); // go to the start of the file

    // truncate the file
    ftruncate($h,0);

    // Serializing along with the TTL
    $data = serialize(array(time()+$ttl,$data));
    if (fwrite($h,$data)===false) {
      throw new Exception('Could not write to cache');
    }
    fclose($h);

  }

  // The function to fetch data returns false on failure
  function fetch($key) {

      $filename = $this->getFileName($key);
      if (!file_exists($filename)) return false;
      $h = fopen($filename,'r');

      if (!$h) return false;

      // Getting a shared lock 
      flock($h,LOCK_SH);

      $data = file_get_contents($filename);
      fclose($h);

      $data = @unserialize($data);
      if (!$data) {

         // If unserializing somehow didn't work out, we'll delete the file
         unlink($filename);
         return false;

      }

      if (time() > $data[0]) {

         // Unlinking when the file was expired
         unlink($filename);
         return false;

      }
      return $data[1];
   }

   function delete( $key ) {

      $filename = $this->getFileName($key);
      if (file_exists($filename)) {
          return unlink($filename);
      } else {
          return false;
      }

   }

  private function getFileName($key) {

      return ini_get('session.save_path') . '/s_cache' . md5($key);

  }

}

?>

There you go, a complete, proper OOP, file-based caching class... I hope I explained things well.

Memory based caching through APC

If files aren't fast enough for you, and you have enough memory to spare.. Memory-based caching might be the solution. Obviously, storing and retrieving stuff from memory is a lot faster. The APC extension not only does opcode cache (speeds up your php scripts by caching the parsed php script), but it also provides a simple mechanism to store data in shared memory.

Using shared memory in APC is extremely simple, I'm not even going to explain it, the code should tell enough.

<?php

    class Sabre_Cache_APC extends Sabre_Cache_Abstract {

        function fetch($key) {
            return apc_fetch($key);
        }

        function store($key,$data,$ttl) {

            return apc_store($key,$data,$ttl);

        }

        function delete($key) {

            return apc_delete($key);

        }

    }

?>

My personal problem with APC that it tends to break my code.. So if you want to use it.. give it a testrun.. I have to admit that I haven't checked it anymore since they fixed 'my' bug.. This bug is now fixed, APC is amazing for single-server applications and for the really often used data.

Memcached

Problems start when you are dealing with more than one webserver. Since there is no shared cache between the servers situations can occur where data is updated on one server and it takes a while before the other server is up to date.. It can be really useful to have a really high TTL on your data and simply replace or delete the cache whenever there is an actual update. When you are dealing with multiple webservers this scheme is simply not possible with the previous caching methods.

Introducing memcached. Memcached is a cache server originally developed by the LiveJournal people and now being used by sites like Digg, Facebook, Slashdot and Wikipedia.

How it works

  • Memcached consists of a server and a client part.. The server is a standalone program that runs on your servers and the client is in this case a PHP extension.
  • If you have 3 webservers which all run Memcached, all webservers connect to all 3 memcached servers. The 3 memcache servers are all in the same 'pool'.
  • The cache servers all only contain part of the cache. Meaning, the cache is not replicated between the memcached servers.
  • To find the server where the cache is stored (or should be stored) a so-called hashing algorithm is used. This way the 'right' server is always picked.
  • Every memcached server has a memory limit. It will never consume more memory than the limit. If the limit is exceeded, older cache is automatically thrown out (if the TTL is exceed or not).
  • This means it cannot be used as a place to simply store data.. The database does that part. Don't confuse the purpose of the two!
  • Memcached runs the fastest (like many other applications) on a Linux 2.6 kernel.
  • By default, memcached is completely open.. Be sure to have a firewall in place to lock out outside ip's, because this can be a huge security risk.

Installing

When you are on debian/ubuntu, installing is easy:

apt-get install memcached

You are stuck with a version though.. Debian tends to be slow in updates. Other distributions might also have a pre-build package for you. In any other case you might need to download Memcached from the site and compile it with the usual:

./configure
make
make install

There's probably a README in the package with better instructions.

After installation, you need the Pecl extension. All you need to do for that (usually) is..

pecl install Memcache

You also need the zlib development library. For debian, you can get this by entering:

apt-get install zlib1g-dev

However, 99% of the times automatic pecl installation fails for me. Here's the alternative installation instructions.

pecl download Memcache
tar xfvz Memcache-2.1.0.tgz #version might be changed
cd Memcache-2.1.0
phpize
./configure
make
make install

Don't forget to enable the extension in php.ini by adding the line extension=memcache.so and restarting the webserver.

The good stuff

After the Memcached server is installed, running and you have PHP running with the Memcache extension, you're off.. Here's the Memcached class.

<?php

    class Sabre_Cache_MemCache extends Sabre_Cache_Abstract {

        // Memcache object
        public $connection;

        function __construct() {

            $this->connection = new MemCache;

        }

        function store($key, $data, $ttl) {

            return $this->connection->set($key,$data,0,$ttl);

        }

        function fetch($key) {

            return $this->connection->get($key);

        }

        function delete($key) {

            return $this->connection->delete($key);

        }

        function addServer($host,$port = 11211, $weight = 10) {

            $this->connection->addServer($host,$port,true,$weight);

        }

    }

?>

Now, the only thing you have to do in order to use this class, is add servers. Add servers consistently! Meaning that every server should add the exact same memcache servers so the keys will distributed in the same way from every webserver.

If a server has double the memory available for memcached, you can double the weight. The chance that data will be stored on that specific server will also be doubled.

Example

<?php

    $cache = new Sabre_Cache_MemCache();
    $cache->addServer('www1');
    $cache->addServer('www2',11211,20); // this server has double the memory, and gets double the weight
    $cache->addServer('www3',11211);

    // Store some data in the cache for 10 minutes
    $cache->store('my_key','foobar',600);

    // Get it out of the cache again
    echo($cache->fetch('my_key'));

?>

Some final tips

  • Be sure to check out the docs for Memcache and APC to and try to determine whats right for you.
  • Caching can help everywhere SQL queries are done.. You'd be surprised how big the difference can be in terms of speed..
  • In some cases you might want the cross-server abilities of memcached, but you don't want to use up your memory or have your items automatically get flushed out.. Wikipedia came across this problem and traded in fast memory caching for virtually infinite size file-based caching by creating a memcached-compatible engine, called Tugela Cache, so you can still use the Pecl Memcache client with this, so it should be pretty easy. I don't have experience with this or know how stable it is.
  • If you have different requirements for different parts of your cache, you can always consider using the different types alongside.

Respond