Caching in PHP using the filesystem, APC and Memcached
Caching is very important and really pays off in big internet applications. When you cache the data you're fetching from the database, in a lot of cases the load on your servers can be reduced enormously.
One way of caching, is simply storing the results of your database queries in files.. Opening a file and unserializing is often a lot faster than doing an expensive SELECT query with multiple joins.
Here's a simple file-based caching engine.
<?php
// Our class
class FileCache {
// This is the function you store information with
function store($key,$data,$ttl) {
// Opening the file
$h = fopen($this->getFileName($key),'w');
if (!$h) throw new Exception('Could not write to cache');
// Serializing along with the TTL
$data = serialize(array(time()+$ttl,$data));
if (fwrite($h,$data)===false) {
throw new Exception('Could not write to cache');
}
fclose($h);
}
// General function to find the filename for a certain key
private function getFileName($key) {
return '/tmp/s_cache' . md5($key);
}
// The function to fetch data returns false on failure
function fetch($key) {
$filename = $this->getFileName($key);
if (!file_exists($filename) || !is_readable($filename)) return false;
$data = file_get_contents($filename);
$data = @unserialize($data);
if (!$data) {
// Unlinking the file when unserializing failed
unlink($filename);
return false;
}
// checking if the data was expired
if (time() > $data[0]) {
// Unlinking
unlink($filename);
return false;
}
return $data[1];
}
}
?>
Key strategies
All the data is identified by a key. Your keys have to be unique system wide; it is therefore a good idea to namespace your keys. My personal preference is to name the key by the class thats storing the data, combined with for example an id.
example
Your user-management class is called My_Auth, and all users are identified by an id. A sample key for cached user-data would then be "My_Auth:users:1234". '1234' is here the user id.
Some reasoning behind this code
I chose 4096 bytes per chunk, because this is often the default inode size in linux and this or a multiple of this is generally the fastest. Much later I found out file_get_contents is actually faster.
Lots of caching engines based on files actually don't specify the TTL (the time it takes before the cache expires) at the time of storing data in the cache, but while fetching it from the cache. This has one big advantage; you can check if a file is valid before actually opening the file, using the last modified time (filemtime()).
The reason I did not go with this approach is because most non-file based cache systems do specify the TTL on storing the data, and as you will see later in the article we want to keep things compatible. Another advantage of storing the TTL in the data, is that we can create a cleanup script later that will delete expired cache files.
Usage of this class
The number one place in web applications where caching is a good idea is on database queries. MySQL and others usually have a built-in cache, but it is far from optimal, mainly because they have no awareness of the logic of you application (and they shouldn't have), and the cache is usually flushed whenever there's an update on a table. Here is a sample function that fetches user data and caches the result for 10 minutes.
<?php
// constructing our cache engine
$cache = new FileCache();
function getUsers() {
global $cache;
// A somewhat unique key
$key = 'getUsers:selectAll';
// check if the data is not in the cache already
if (!$data = $cache->fetch($key)) {
// there was no cache version, we are fetching fresh data
// assuming there is a database connection
$result = mysql_query("SELECT * FROM users");
$data = array();
// fetching all the data and putting it in an array
while($row = mysql_fetch_assoc($result)) { $data[] = $row; }
// Storing the data in the cache for 10 minutes
$cache->store($key,$data,600);
}
return $data;
}
$users = getUsers();
?>
The reason i picked the mysql_ set of functions here, is because most of the readers will probably know these.. Personally I prefer PDO or another abstraction library. This example assumes there's a database connection, a users table and other issues.
Problems with the library
The first problem is simple, the library will only work on linux, because it uses the /tmp folder. Luckily we can use the php.ini setting 'session.save_path'.
<?php
private function getFileName($key) {
return ini_get('session.save_path') . '/s_cache' . md5($key);
}
?>
The next problem is a little bit more complex. In the case where one of our cache files is being read, and in the same time being written by another process, you can get really unusual results. Caching bugs can be hard to find because they only occur in really specific circumstances, therefore you might never really see this issue happening yourself, somewhere out there your user will.
PHP can lock files with flock(). Flock operates on an open file handle (opened by fopen) and either locks a file for reading (shared lock, everybody can read the file) or writing (exclusive lock, everybody waits till the writing is done and the lock is released). Because file_get_contents is the most efficient, and we can only use flock on filehandles, we'll use a combination of both.
The updated store and fetch methods will look like this
<?php
// This is the function you store information with
function store($key,$data,$ttl) {
// Opening the file in read/write mode
$h = fopen($this->getFileName($key),'a+');
if (!$h) throw new Exception('Could not write to cache');
flock($h,LOCK_EX); // exclusive lock, will get released when the file is closed
fseek($h,0); // go to the beginning of the file
// truncate the file
ftruncate($h,0);
// Serializing along with the TTL
$data = serialize(array(time()+$ttl,$data));
if (fwrite($h,$data)===false) {
throw new Exception('Could not write to cache');
}
fclose($h);
}
function fetch($key) {
$filename = $this->getFileName($key);
if (!file_exists($filename)) return false;
$h = fopen($filename,'r');
if (!$h) return false;
// Getting a shared lock
flock($h,LOCK_SH);
$data = file_get_contents($filename);
fclose($h);
$data = @unserialize($data);
if (!$data) {
// If unserializing somehow didn't work out, we'll delete the file
unlink($filename);
return false;
}
if (time() > $data[0]) {
// Unlinking when the file was expired
unlink($filename);
return false;
}
return $data[1];
}
?>
Well that actually wasn't too hard.. Only 3 new lines.. The next issue we're facing is updates of data. When somebody updates, say, a page in the cms; they usually expect the respecting page to update instantly.. In those cases you can update the data using store(), but in some cases it is simply more convenient to flush the cache.. So we need a delete method.
<?php
function delete( $key ) {
$filename = $this->getFileName($key);
if (file_exists($filename)) {
return unlink($filename);
} else {
return false;
}
}
?>
Abstracting the code
This cache class is pretty straight-forward. The only methods in there are delete, store and fetch.. We can easily abstract that into the following base class. I'm also giving it a proper prefix (I tend to prefix everything with Sabre, name yours whatever you want..). A good reason to prefix all your classes, is that they will never collide with other classnames if you need to include other code. The PEAR project made a stupid mistake by naming one of their classes 'Date', by doing this and refusing to change this they actually prevented an internal PHP-date class to be named Date.
<?php
abstract class Sabre_Cache_Abstract {
abstract function fetch($key);
abstract function store($key,$data,$ttl);
abstract function delete($key);
}
?>
The resulting FileCache (which I'l rename to Filesystem) is:
<?php
class Sabre_Cache_Filesystem extends Sabre_Cache_Abstract {
// This is the function you store information with
function store($key,$data,$ttl) {
// Opening the file in read/write mode
$h = fopen($this->getFileName($key),'a+');
if (!$h) throw new Exception('Could not write to cache');
flock($h,LOCK_EX); // exclusive lock, will get released when the file is closed
fseek($h,0); // go to the start of the file
// truncate the file
ftruncate($h,0);
// Serializing along with the TTL
$data = serialize(array(time()+$ttl,$data));
if (fwrite($h,$data)===false) {
throw new Exception('Could not write to cache');
}
fclose($h);
}
// The function to fetch data returns false on failure
function fetch($key) {
$filename = $this->getFileName($key);
if (!file_exists($filename)) return false;
$h = fopen($filename,'r');
if (!$h) return false;
// Getting a shared lock
flock($h,LOCK_SH);
$data = file_get_contents($filename);
fclose($h);
$data = @unserialize($data);
if (!$data) {
// If unserializing somehow didn't work out, we'll delete the file
unlink($filename);
return false;
}
if (time() > $data[0]) {
// Unlinking when the file was expired
unlink($filename);
return false;
}
return $data[1];
}
function delete( $key ) {
$filename = $this->getFileName($key);
if (file_exists($filename)) {
return unlink($filename);
} else {
return false;
}
}
private function getFileName($key) {
return ini_get('session.save_path') . '/s_cache' . md5($key);
}
}
?>
There you go, a complete, proper OOP, file-based caching class... I hope I explained things well.
Memory based caching through APC
If files aren't fast enough for you, and you have enough memory to spare.. Memory-based caching might be the solution. Obviously, storing and retrieving stuff from memory is a lot faster. The APC extension not only does opcode cache (speeds up your php scripts by caching the parsed php script), but it also provides a simple mechanism to store data in shared memory.
Using shared memory in APC is extremely simple, I'm not even going to explain it, the code should tell enough.
<?php
class Sabre_Cache_APC extends Sabre_Cache_Abstract {
function fetch($key) {
return apc_fetch($key);
}
function store($key,$data,$ttl) {
return apc_store($key,$data,$ttl);
}
function delete($key) {
return apc_delete($key);
}
}
?>
My personal problem with APC that it tends to break my code.. So if you want to use it.. give it a testrun.. I have to admit that I haven't checked it anymore since they fixed 'my' bug.. This bug is now fixed, APC is amazing for single-server applications and for the really often used data.
Memcached
Problems start when you are dealing with more than one webserver. Since there is no shared cache between the servers situations can occur where data is updated on one server and it takes a while before the other server is up to date.. It can be really useful to have a really high TTL on your data and simply replace or delete the cache whenever there is an actual update. When you are dealing with multiple webservers this scheme is simply not possible with the previous caching methods.
Introducing memcached. Memcached is a cache server originally developed by the LiveJournal people and now being used by sites like Digg, Facebook, Slashdot and Wikipedia.
How it works
- Memcached consists of a server and a client part.. The server is a standalone program that runs on your servers and the client is in this case a PHP extension.
- If you have 3 webservers which all run Memcached, all webservers connect to all 3 memcached servers. The 3 memcache servers are all in the same 'pool'.
- The cache servers all only contain part of the cache. Meaning, the cache is not replicated between the memcached servers.
- To find the server where the cache is stored (or should be stored) a so-called hashing algorithm is used. This way the 'right' server is always picked.
- Every memcached server has a memory limit. It will never consume more memory than the limit. If the limit is exceeded, older cache is automatically thrown out (if the TTL is exceed or not).
- This means it cannot be used as a place to simply store data.. The database does that part. Don't confuse the purpose of the two!
- Memcached runs the fastest (like many other applications) on a Linux 2.6 kernel.
- By default, memcached is completely open.. Be sure to have a firewall in place to lock out outside ip's, because this can be a huge security risk.
Installing
When you are on debian/ubuntu, installing is easy:
apt-get install memcached
You are stuck with a version though.. Debian tends to be slow in updates. Other distributions might also have a pre-build package for you. In any other case you might need to download Memcached from the site and compile it with the usual:
./configure
make
make install
There's probably a README in the package with better instructions.
After installation, you need the Pecl extension. All you need to do for that (usually) is..
pecl install Memcache
You also need the zlib development library. For debian, you can get this by entering:
apt-get install zlib1g-dev
However, 99% of the times automatic pecl installation fails for me. Here's the alternative installation instructions.
pecl download Memcache
tar xfvz Memcache-2.1.0.tgz #version might be changed
cd Memcache-2.1.0
phpize
./configure
make
make install
Don't forget to enable the extension in php.ini by adding the line extension=memcache.so and restarting the webserver.
The good stuff
After the Memcached server is installed, running and you have PHP running with the Memcache extension, you're off.. Here's the Memcached class.
<?php
class Sabre_Cache_MemCache extends Sabre_Cache_Abstract {
// Memcache object
public $connection;
function __construct() {
$this->connection = new MemCache;
}
function store($key, $data, $ttl) {
return $this->connection->set($key,$data,0,$ttl);
}
function fetch($key) {
return $this->connection->get($key);
}
function delete($key) {
return $this->connection->delete($key);
}
function addServer($host,$port = 11211, $weight = 10) {
$this->connection->addServer($host,$port,true,$weight);
}
}
?>
Now, the only thing you have to do in order to use this class, is add servers. Add servers consistently! Meaning that every server should add the exact same memcache servers so the keys will distributed in the same way from every webserver.
If a server has double the memory available for memcached, you can double the weight. The chance that data will be stored on that specific server will also be doubled.
Example
<?php
$cache = new Sabre_Cache_MemCache();
$cache->addServer('www1');
$cache->addServer('www2',11211,20); // this server has double the memory, and gets double the weight
$cache->addServer('www3',11211);
// Store some data in the cache for 10 minutes
$cache->store('my_key','foobar',600);
// Get it out of the cache again
echo($cache->fetch('my_key'));
?>
Some final tips
- Be sure to check out the docs for Memcache and APC to and try to determine whats right for you.
- Caching can help everywhere SQL queries are done.. You'd be surprised how big the difference can be in terms of speed..
- In some cases you might want the cross-server abilities of memcached, but you don't want to use up your memory or have your items automatically get flushed out.. Wikipedia came across this problem and traded in fast memory caching for virtually infinite size file-based caching by creating a memcached-compatible engine, called Tugela Cache, so you can still use the Pecl Memcache client with this, so it should be pretty easy. I don't have experience with this or know how stable it is.
- If you have different requirements for different parts of your cache, you can always consider using the different types alongside.
Comments
Wez Furlong •
flock() is moot if you fopen($filename, 'w') since that will truncate the file.You should use r+ which allows you to write; you should then flock() and ftruncate() the file.
Evert •
Thanks for the correction Wez. I totally didn't think of that situation.I updated the code in the article
tobozo •
What about memory_limit value in php.ini ?// fetching all the data and putting it in an array
while($row = mysql_fetch_assoc($result)) { $data[] = $row; }
when $data gets bigger than memory_limit, depending on the server profile, the cache never gets stored, the script dies ....
This cache system needs to improvement on memory management, or some sort of pagination functionality to handle big resultsets.
Evert •
Tobozo,It's a good idea to have the pagination in the SQL query and also store the 'page info' in the key of the cache.
function getUsers($start,$count) {
global $cache;
// A somewhat unique key
$key = 'getUsers:selectAll:' . $start . ':' . $count;
.....
Anonymous •
klhjjklhgavin hurley •
It's worth noting that the largest payload that memcached can store is 1MB. Admittedly, this is a lot of data but I've seen serialized objects that are this large.Evert •
Gavin,I'm going to see if this is really the case. I have not seen this limit myself, but I'll try it out..
Did you ran into this problem yourself, or is there an online resource which confirms this?
Evert
Evert •
UPDATE: fixed the FileSystem cache class to use the a+ mode instead of the r+. Before, it wouldn't write to the file if it didn't exist already.And by the way, APC now works for my code! yay!
Wez Furlong •
a+ is wrong, that is append only, meaning that you can't ftruncate and rewind to replace the content.You need to handle the case where the file didn't exist with some logic in your code with appropriate locking. You might find the PHP specific x+ mode useful for that purpose.
Wez Furlong •
Err, I'm confusing myself with open modes ;-)Disregard my last comments about x+, but a+ is still not what you want to use.
Evert •
Thanks for your comments wez.I have to say that this worked for me on linux.. Maybe I am assuming incorrect, but from the work I've seen from you, you seem to be a windows guy for a big part (I could be way off with this one).
The simple solution is to do a file_exists().. but I'm going to have to do some more serious cross-platform testing with this one and post results tomorrow.
Again, thanks very much.
Henrik Jochumsen •
I dont like having to unserialize the file each time I need to check if the file is valid.What about making two files: one for ttl values and one for serialized data.
This will save disk and CPU time..
Evert •
That is a possibility..perhaps it might be even easier to make the first line the TTL and the rest just the unserialized data.. that would speed up things..
That is probably still faster than having multiple files..
cnjax •
it is tested that memcache can't save array what bigger than 1MB size. I try to save a 60,000 record's array into memcache ,but fail.i have found there is a memcache patch in the maillist at memcache homepage,I haven't test the patch yet.David •
while (feof($h)) $data.=fread($h,4096);slight (but show stopping) error here, how about:
while (!feof($h)) $data.=fread($h,4096);
Evert •
Thanks David,I actually found out file_get_contents is a faster alternative. I probably misread somewhere that feof is better, but in this case file_get_contents will be faster..
I'll update this now
Miglius •
Usefull article, gonna try that and write what i really think about it later :)Tjerk Meesters •
Great article!About the use of +a in fopen(): even though the behaviour of this open mode is different between Windows and Linux, the desired effect of automatic file creation is identical; you just need to make sure to fseek() before ftruncate() ;-)
Instead of using file_get_contents() after flock(), consider using stream_get_contents()
Bryan •
I completely missed the part about needing zlib-devel, and spent a good hour trying to figure out what I was getting a configure error when trying to install memcache. Doing a yum install memcache fixed that.MikeFM •
I just wrapped memcache so that it's also backed by MySQL. It's a simple table with good indexing and I haven't yet experienced any slow behavior. Anything to big or to long lived to fit in memcache will just ignore memcache and anything that dropped from memcache will just reload as needed from MySQL. Works pretty well.Adam Flatau •
Thank you for this article, was very helpfullMichael Duttlinger •
Great article, helps me a lot. because this i will give back this snippet:class S_FileCache_Output extends S_FileCache {
private $_ttl;
private $_key;
function start($key, $ttl=0) {
$this->_ttl = $ttl;
$this->_key = $key;
$data = $this->fetch($key);
if ($data !== false) {
echo($data);
return true;
}
ob_start();
ob_implicit_flush(false);
return false;
}
function end() {
$data = ob_get_contents();
ob_end_clean();
$this->store($this->_key, $data, $this->_ttl);
echo($data);
}
}
Now you can yous it like:
$cache = new S_FileCache_Output();
$key = 'OutPut:Buffer';
if (!($cache->start($key))) {
// Some data
echo $data;
$cache->end();
}
Aaron •
I was wondering if you could use filemtime rather than having to open the file and then read part of it to check. Not sure if this would be faster, but logic suggests it would be.if ((time() - filemtime($filename)) >= $ttl) {
// cache has expired
// do something
} else {
// read file and show
}
Not sure if this is a good idea or not, would love to hear some feed back.
Aaron •
Disregard last post, I just been doing some testing, and it performs better using the ttl in the file over using filemtime.I opened a two column table with 10 fields using various expiry times, doing this 1000 times for each test. Using the ttl in the file averaged better in every run.
Evert •
Thats good news,the biggest reason I put the ttl in file instead of using filemtime, is because i wanted to put the expiry upon storing the data, not while loading..
This allowed me to keep a consistent api across all cache stores..
I'm a little shocked that would be faster though
steve •
Hi Evert, I was looking for something like this class, glad I found it here. I have a cople of questions though: What happens if unlink is executed on a file that is currently locked ? will the call to unlink fail ? (if it fails then the delete method wont be able to flush the cache)Thanks in advance !
Evert •
Hey Steve,The unlink will succeed, and if you are on linux and some other process currently has that file open, there will still be no issues.
Hope that helps,
Evert
vahur •
APC rocks.Its very good also on multiple servers.
When setting up APC on windows machine, be sure that you use right dll for your PHP version, otherwise you get a small amount of Apache/PHP fatals.
vahur •
APC rocks.Its very good also on multiple servers.
When setting up APC on windows machine, be sure that you use right dll for your PHP version, otherwise you get a small amount of Apache/PHP fatals.
Sandeep Verma •
This is very good resource for php caching using PECL and APC.....It is amazing in web development....
I think it is extremely useful in speed up web application and improve performance..
Sandeep Verma
http://sandeepverma.wordpress.com
Joakim Berg •
@cnjax:I don't see any situation when you really have to cache data larger than 1MB. Though, if you really need to, then just split the data into chunks, like $chunkname = $key.'_'.$chunkId;
But to be honest I don't really think it's the most optimized thing to send >1MB items, even trough your local network. if you have 100/100 on your local network, then you'll only be able to send about 12 items simultaneously per second. I don't know the size of usage on your app, but I do think one should remember that memcached is a caching daemon for fast-access-data, and not a storage-mechanism for harddrive-offloaded-files...
Andres Santos •
Has someone used Adodb for PHP with memcached? It just doesnt work on my server.Ibrahim Benzer •
HiThank you for great sharing...
Could you make an example code on how to use it?
As I am noob in OOP php :)
paullush •
I would serialize the data before openning/locking a file, just to keep the time down that the file is actually locked.paullush •
You can also avoid loading the file cache to check its ttl validity by testing its file date by touching the cache file with time() + $ttl like thisfunction store($key,$data,$ttl) {
// Opening the file in read/write mode
$h = fopen($this->getFileName($key),'a+');
if (!$h) throw new Exception('Could not write to cache');
// Serializing along with the TTL
$data = serialize(array(time()+$ttl,$data));
flock($h,LOCK_EX); // exclusive lock, will get released when the file is closed
fseek($h,0); // go to the beginning of the file
// truncate the file
ftruncate($h,0);
if (fwrite($h,$data)===false) {
throw new Exception('Could not write to cache');
}
fclose($h);
if (!touch($this->getFileName($key), time() + $ttl)) throw new Exception('Could not touch cache file');
}
function fetch($key) {
$filename = $this->getFileName($key);
if (!file_exists($filename)) return false;
// if the 'touched' file time on the cache is older then the current time, then
// delete the file and return false;
if (time() > @filemtime($filename)) {
unlink($filename);
return false;
}
$h = fopen($filename,'r');
if (!$h) return false;
// Getting a shared lock
flock($h,LOCK_SH);
$data = file_get_contents($filename);
fclose($h);
$data = @unserialize($data);
if (!$data) {
// If unserializing somehow didn't work out, we'll delete the file
unlink($filename);
return false;
}
return $data[1];
}
Jason •
Thanks for this, I was about to build a class to handle many caching methods and you just did half of it for me =)venugopal •
If TTL is 10 seconds and I access the file at 9th second it gets deleted after the next second. Does that make sense? TTL should always be 10secs from the last accessed time. I think we can achieve this if we store TTL out of the data in some way...any thoughts?Is there a way I can refresh FILE CREATION TIME/FILE ACCESS TIME without actually writing or modifying the data - so that i can use filemtime to identify TTL
venugopal •
touch and filemtime can be used to implement refreshing TTL on every cache access.Should try how better I can write a script which will remove all files whose creation/accesstime is less than current time - ttl.
If I have the cachedir with more than 10 million files, listing and then deleting all expired files wont be a good idea as this process itself is resource consuming. Any comments please...
AintNoOne •
Thanks. That helped a lot. What complications might arise from using a file-based caching solution with a multiple web server setup?
Evert •
You're commenting on an 8 year old post ;) But the main issue is that the files will only be on 1 machine, not on both. Unless you're using a network filesystem, which is slow and often a bad idea.
AintNoOne •
Thanks. Yes I noticed the age of the article after I posted. Haha. Still just as relevant today, (I think?).
My brain still struggles with understanding when and why one should abstract a class like is done above. For example, if you had defined a $cacheEngine variable in your config file to leverage the flexibility of abstraction, then wouldn't your call to the addServer method blow up for all of the objects except for Sabre_Cache_MemCache?
Evert •
That's a good question.
The best way I can explain this, is that you have to separate 'using the cache engine' and 'setting up the cache engine'.
'Using the cache engine' here means calling methods such as fetch, store and delete. This is identical for every engine.
Generally, with a lot of OOP modelling, 'how an object is set up' is not important. We only concern ourselves with abstraction after setup.
So the short answer is, objects that use 'any generic cache engine' to store information, never call addServer().
Ano •
Thanks Evert
AskAmN •
Wrong:
$cache = new Sabre_Cache_MemCache();
Should be:
$cache = new Sabre_Cache_MemCache;
:)
Evert •
Both actually works, and the former form is more common in most php code.
AskAmN •
I believe its just like when we have a constructor,
$class = new class("constructor_value");
Thanks, I understood.
Andre de Andrade •
My /tmp is small, only 1GB. I need more space. How can I change the directory to store? In APC i am using /dev/shm as tmpfs, but it's not the ideal solution. Thanks for support.
Majid •
thanks. really cool.
Meritei Covercloud •
Thanks.I hope am going to be able to implement this successfully.