PHP, XHTML MIME type and Caching
May 2005 — Updated Dec 2005, July 2006
Introduction
Using PHP for your XHTML web site is great, but if you’re not considering possible effects you could be violating Internet standards and reducing the usability for your users. You might not know your validated and tested XHTML is treated as “tag soup” by browsers, or that without sending the proper headers your pages caching ability is severely reduced, or perhaps some browsers (due to bugs) won’t cache your page at all. When using PHP with XHTML, several issues need to be considered.
- XHTML and MIME type. You should be serving XHTML as application/xhtml+xml instead of text/html. But not all clients support it. How to serve it properly to clients following standards, without making the site unusable to outdated clients?
- PHP and Caching. A small amount of work is required to improve the caching ability of your site, reduce bandwidth and improve response time.
- Default Caching of other static items (CSS, images, etc). This requires mod_expires and a few htaccess lines.
- Problems caused by buggy browsers. Internet Explorer has some nasty bugs existing from 4.x all the way to 6.x. This complicates your job as you’ll have to work-around IE’s bugs. Yes, it’s true we shouldn’t have to do this (as it’s the clients fault, not ours) if they fail to follow standards, but in the real world IE is extremely common, and we don’t want to penalize those people just because a company in Washington can’t (or won’t) follow Internet standards.
This article assumes you’ve got a background on Apache and some basic HTTP protocol knowledge; if you don’t know what etag, 304, gzip, If-Modified-Since, If-None-Match, HTTP headers and MIME types are see the references at the end of this article (As always, Google is your friend).
We’ll also assume you’re using Apache on *nix with a recent PHP (4.3.x) version, are interested in writing valid XHTML, following web standards, and improving bandwidth usage of your site.
Caching
Caching makes your site more responsive to the user, but caching involves two different functions. First, avoiding trips to the server if possible (using cache-control and expires), and second, if we must go to the server, avoid transferring the document (using validation and 304 responses).
Unfortunately with PHP caching will generally not work “out of the box”. We’ve got to do a little work to get browsers to optimally cache our documents. Then Apache should be configured such that static documents (CSS, JS, images, etc) will be cached. This requires mod_expires and a few simple htaccess lines.
XHTML and mimetype
W3C says to serve XHTML as application/xhtml+xml, but IE can’t handle it, while modern browsers (Firefox, Opera, etc) can. We’ve got to change the MIME type for modern browsers, while falling back gracefully to outdated browsers like IE. However, we don’t want to “browser sniff” the User-Agent as this is unreliable at best.
Fortunately, the HTTP protocol has a means to do exactly what we want: the Accept header. Each request by a client has the option to specify to the server exactly what it wants. A few examples make this clear.
Firefox 1.0.3 text/xml,application/xml,application/xhtml+xml,text/html;q=0.9, text/plain;q=0.8,image/png,*/*;q=0.5
IE 6.0 SP1 image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-shockwave-flash, */*
Warning!
You should *never* try out code on your production server. Always have a test or development box for trying out new code. Experimenting on your live server is asking for trouble!
Its easy to determine Firefox prefers application/xhtml+xml, while IE can’t handle it. Notice the line in Firefox saying q=0.9? That’s a Q-Value, and it’s used so a client can say what they prefer, while also saying what the can accept. Q values range from 0.000 to 1.000, although most only use one decimal.
For firefox, it has application/xhtml+xml with no q-value (assume 1), while text/html has q-value of 0.9. Thus Firefox prefers application/xhtml+xml (since 1.0 > .9), although it can accept text/html. By using the Accept header the client sends us, we avoid problematic browser-sniffing and allow the client to tell us what MIME type they prefer.
Solution (The Code)
Here’s the code; we’ll explain it after you’ve looked at it. Save this as include-mime.php and later we’ll have an example that uses it. You can also grab the php_xhtml.tar.gz file available for download.
<?php
##############################################################################
# XHTML and mimetype script for PHP #
# Copyright (C) 2005 Darrin Yeager #
# All rights reserved. #
# http://www.dyeager.org #
# #
# Redistribution and use in source and binary forms, with or without #
# modification, are permitted provided that the following conditions #
# are met: #
# #
# 1. Redistributions of source code must retain the above copyright #
# notice, this list of conditions and the following disclaimer. #
# #
# 2. Redistributions in binary form must reproduce the above copyright #
# notice, this list of conditions and the following disclaimer in the #
# documentation and/or other materials provided with the distribution. #
# #
# 3. Redistributions of modified versions must carry prominent notices #
# stating that you changed the files and the date of any change. #
# #
# 4. Neither the name of Darrin Yeager nor the names of any contributors #
# may be used to endorse or promote products derived from this software #
# without specific prior written permission. #
# #
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS #
# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT #
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR #
# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT #
# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, #
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED #
# TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR #
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF #
# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING #
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS #
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #
# #
##############################################################################
$charset = "utf-8";
$mime = "text/html";
$is304 = false;
# NOTE: To allow for q-values with one space (text/html; q=0.5),
# use the following regex:
# "/text\/html;[\ ]{0,1}q=([0-1]{0,1}\.\d{0,4})/i"
if((isset($_SERVER["HTTP_ACCEPT"])) && (stristr($_SERVER["HTTP_ACCEPT"],"application/xhtml+xml"))) {
if(preg_match("/application\/xhtml\+xml;q=([0-1]{0,1}\.\d{0,4})/i",$_SERVER["HTTP_ACCEPT"],$matches)) {
$xhtml_q = $matches[1];
if(preg_match("/text\/html;q=([0-1]{0,1}\.\d{0,4})/i",$_SERVER["HTTP_ACCEPT"],$matches)) {
$html_q = $matches[1];
if((float)$xhtml_q >= (float)$html_q)
$mime = "application/xhtml+xml";
}
}
else
$mime = "application/xhtml+xml";
}
# Get the file stats and compute last-modified time.
$filestats = @stat($_SERVER["SCRIPT_FILENAME"]);
$lastmod = $filestats[9] - date('Z'); #Convert Local time -> GMT
# ETag is "inode-lastmodtime-filesize" - See PHP stat function for more detail
$etag = '"' . dechex($filestats[1]) . "-" . dechex($lastmod) . "-" . dechex($filestats[7]) . '"';
# Check HTTP_IF_NONE_MATCH
# and report a 304 Not Modified header if they match.
if (isset ($_SERVER["HTTP_IF_NONE_MATCH"])) {
if ($etag === stripslashes($_SERVER["HTTP_IF_NONE_MATCH"]))
$is304 = true;
}
if ($is304) {
if (isset($_SERVER["SERVER_PROTOCOL"]) && $_SERVER["SERVER_PROTOCOL"] == "HTTP/1.1")
header("HTTP/1.1 304 Not Modified");
else
header("HTTP/1.0 304 Not Modified");
header("ETag: " . $etag);
header("Vary: Accept");
header("Connection: close");
exit;
}
header("Content-Type: $mime;charset=$charset");
header("Cache-Control: max-age=86400, s-maxage=86400");
header("Vary: Accept");
# If for some reason we didn't get a valid file modification time
# from the stat function, or it errored out, DO NOT send the ETag
# header as it will not be valid. Valid in this since is defined
# as modified AFTER Dec 24, 1999.
if ($lastmod > 946080000) { # 946080000 = Dec 24, 1999 4PM
header("ETag: " . $etag);
}
if (DOCTYPE == "strict") { ?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<?php } else if (DOCTYPE == "math") { ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN"
"http://www.w3.org/TR/MathML2/dtd/xhtml-math11-f.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<?php } else { ?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<?php } ?>
<head>
<meta http-equiv="Content-Type" content="<?php echo $mime ?>;charset=<?php echo $charset ?>" />
Now that you’ve reviewed the code, let’s examine it line by line and see how it works.
$charset = "utf-8"; $mime = "text/html"; $is304 = false;
Set a few default variables. The default MIME type is text/html — unless it’s been changed by this script later to application/xhtml+xml. The default character set is utf-8, but if you use some other character set, change the variable here. The last variable is just a flag to indicate if we’re going to return a 304 Not Modified header. By default we won’t.
# NOTE: To allow for q-values with one space (text/html; q=0.5),
# use the following regex:
# "/text\/html;[\ ]{0,1}q=([0-1]{0,1}\.\d{0,4})/i"
if((isset($_SERVER["HTTP_ACCEPT"])) && (stristr($_SERVER["HTTP_ACCEPT"],"application/xhtml+xml"))) {
if(preg_match("/application\/xhtml\+xml;q=([0-1]{0,1}\.\d{0,4})/i",$_SERVER["HTTP_ACCEPT"],$matches)) {
$xhtml_q = $matches[1];
if(preg_match("/text\/html;q=([0-1]{0,1}\.\d{0,4})/i",$_SERVER["HTTP_ACCEPT"],$matches)) {
$html_q = $matches[1];
if((float)$xhtml_q >= (float)$html_q)
$mime = "application/xhtml+xml";
}
}
else
$mime = "application/xhtml+xml";
}
We’ve already discussed the Accept header above and what it does, so now just determine what the Accept header contains, and if the client wants application/xhtml+xml or just text/html. The regex code is similar to other code on the web, but this regex checks the entire float decimal, not just the integer value after the decimal, and looks for up to 4 decimal points of precision. It then compares as float, not integer. The code isn’t as bad as it looks — if you strip out the regex parts you get the following algorithm:
If Accept header was sent and application/xhtml+xml exists in it
If q-value exists for application/xhtml+xml
get qvalue for text/html (if it exists)
if q-application/xhtml+xml >= q-text/html
mimetype = application/xhtml+xml
else
mimetype = application/xhtml+xml
else
mimetype = text/html
# Get the file stats and compute last-modified time.
$filestats = @stat($_SERVER["SCRIPT_FILENAME"]);
$lastmod = $filestats[9] - date('Z'); #Convert Local time -> GMT
# ETag is "inode-lastmodtime-filesize" - See PHP stat function for more detail
$etag = '"' . dechex($filestats[1]) . "-" . dechex($lastmod) . "-" . dechex($filestats[7]) . '"';
Calculate the ETag the HTTP protocol returns to the client. If any of the document changes, this ETag must also. The ETag is a string, so you’re free to determine it any way you want. For this use, we use the *nix INode, Last modified time (converted to GMT), and the file size in bytes, all converted to hex, which is similar to the method Apache uses for static HTML files.
However, if this is not specific for your needs, you’ll need to find another way to generate a unique ETag. For example, sites mainly using databases may use one script for many different pages. If so, you must find a unique ETag algorithm. This ETag will be returned to us if a client has a copy of the page, but doesn’t know if it’s current or not. That’s why it must be unique.
# Check HTTP_IF_NONE_MATCH
# and report a 304 Not Modified header if they match.
if (isset ($_SERVER["HTTP_IF_NONE_MATCH"])) {
if ($etag === stripslashes($_SERVER["HTTP_IF_NONE_MATCH"]))
$is304 = true;
}
if ($is304) {
if (isset($_SERVER["SERVER_PROTOCOL"]) && $_SERVER["SERVER_PROTOCOL"] == "HTTP/1.1")
header("HTTP/1.1 304 Not Modified");
else
header("HTTP/1.0 304 Not Modified");
header("ETag: " . $etag);
header("Vary: Accept");
header("Connection: close");
exit;
}
If the client sent us back an ETag for verification, check and see if it’s the same one we just calculated. If it is, we will not need to send the document, just a 304 Not Modified header, and exit the script. No body will be (or needs to be) sent. However, you should send the Vary header so the client knows the response depends on the original Accept header.
header("Content-Type: $mime;charset=$charset");
header("Cache-Control: max-age=86400, s-maxage=86400");
header("Vary: Accept");
# If for some reason we didn't get a valid file modification time
# from the stat function, or it errored out, DO NOT send the ETag
# header as it will not be valid. Valid in this since is defined
# as modified AFTER Dec 24, 1999.
if ($lastmod > 946080000) { # 946080000 = Dec 24, 1999 4PM
header("ETag: " . $etag);
}
We’ve now obtained a valid ETag, determined the client needs the full body of the page, and are ready to send it. We set a header with the correct MIME type we’ve determined earlier and a header for caching. The cache-control header allows clients and proxies to store this page for 1 day and use it as fresh, without checking with the server again.
The Vary header tells the client and any proxy the content can vary depending on what the client can Accept. Unfortunately IE has a long-standing bug with regards to the Vary header which prevents caching any content with a Vary header other than User-Agent. We’ll workaround this bug by enabling gzip compression of the page later in Apache which forces IE to work as it should. If you don’t enable compression, IE will never locally cache your page. This bug has existed from IE 4.x until 6.x. Some day Microsoft might fix it. (July 2006 — see update at end for IE7)
if (DOCTYPE == "strict") { ?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<?php } else if (DOCTYPE == "math") { ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN"
"http://www.w3.org/TR/MathML2/dtd/xhtml-math11-f.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<?php } else { ?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<?php } ?>
<head>
<meta http-equiv="Content-Type" content="<?php echo $mime ?>;charset=<?php echo $charset ?>" />
Now we send a correct DOCTYPE and output a META tag with the MIME type and the correct character set. Which DOCTYPE is used depends on the DOCTYPE variable in your PHP page. The remainder of your script follows as normal.
Configure Apache
Our PHP page is now setup to be cached for a period of time of our choosing, and to respond to validation requests. But we still must deal with static content. Apache will do the heavy lifting for us, as long as we tell it what we want it to do. This is simple — to configure Apache for cache-control of static content, add lines like the following in .htaccess or put them in your httpd.conf.
ExpiresActive On ExpiresByType text/css A1209600 ExpiresByType image/png A2592000 ExpiresByType image/gif A2592000 ExpiresByType image/x-icon A5184000 php_flag zlib.output_compression On php_value zlib.output_compression_level 1
You must have mod_expires in your Apache (most do) for this to work. Using mod_expires you can set default cache retention times for various types of static files. The numbers are the expiration in seconds since the client requested it. So A86400 means the content is valid for one day — the client can just use it from the cache without checking the server again. Nothing needs to be sent over the network for one day. Naturally, you’ll want to fine-tune these numbers for your own use.
For the compression level, just like gzip it can be a level from 1-9. Level 1 is the lowest, while 9 is the highest level of compression. However, going from 1 to 9 dramatically increases the CPU time to compress, while not yielding much size improvement. For example, on level 1, you might get 60% compression, while on level 9 it might increase to 67% or so, but at the cost of much more CPU time. Since web servers can be busy, leave it at level one to make life easy for your CPU.
It’s worth mentioning again, IE has a bug with regards to the Vary header. We MUST enable GZIP or IE will never cache our page. Other browsers (Firefox, Opera, etc) handle this correctly — it’s just IE with the difficulty. GZIP is expressed in Accept-Encoding header sent by client. If that header indicates the client can handle gzip compressed content, PHP does all the work, including sending the appropriate Vary:Accept-Encoding header. If the client indicates they can’t handle compressed content, PHP sends the file uncompressed.
Example
<?php
# to use XHTML 1.0 Strict DTD
define("DOCTYPE","strict");
# to use XHTML 1.1 MathML DTD
#define("DOCTYPE","math");
# if DOCTYPE is not defined, the transitional
# XHTML 1.0 DTD will be used.
require_once("include-mime.php");
?>
<title>Test Page</title>
</head>
<body>
<h1>Test Page</h1>
<h2>Server Variables</h2>
<pre>
<?php
foreach ($_SERVER as $key => $value)
print "$key: $value\n";
?>
</pre>
</body>
</html>
Save this file off and try it! By changing the DOCTYPE variable, you can get strict, transitional or the new MathML DTD.
Conclusion
That’s it. You’ll need to modify the cache times for your site, but with GZIP compression, appropriate cache-control and 304 Not Modified responses, you should see bandwidth reduction, and your users will see a snappier responding web site. We’ve also worked around Microsoft’s crippled browser.
Update on If-Modified-Since (Dec 2005)
Someone noticed I don’t mention If-Modified-Since in this code. The reason is two-fold. First, it’s the older protocol, and most modern browsers should support ETags. Second, it’s not as flexible. For example, suppose you’re dealing with a blog — how do you handle last modified date? You could use the last article date (which is reasonable), but that presents you with a problem. How do you deal with different pages being served depending on whether the user is logged in or not? The article dates are the same, yet the page served is different. This situation is difficult to deal with using only dates, but with ETags it’s easy — just add some reference (like user ID) to your ETag, and the browser cache will work correctly.
If you want to modify the code on this page to handle If-Modified-Since it’s not really that hard, and is left as an exercise for the interested reader.
Update on IE7 (July 2006)
Unfortunately, testing with IE7 beta 3 reveals Microsoft still hasn’t fixed the caching bug (at least in the betas, maybe in the final release?). IE 7 won't have application/xhtml+xml support either. Or a truly fixed CSS implementation. Sigh. It appears IE7, instead of truly being a step forward (after 5 years since IE 6), should really be considered IE6 SP2.
We can only hope IE8 fixes these bugs and truly supports XHTML and CSS. After all, the standards have been around for a decade, and Microsoft certainly has the resources (if not the desire) to fully implement web standards.
License Change (September 2006)
Previously released under GPL, this code is now under a License similar to BSD. The actual license is in the code on this page.
The download file contains the GPL license. You many choose either the GPL license in the download file or the BSD-style license contained in the code on this page. It’s your option.
References
Caching Introduction
http://www.mnot.net/cache_docs/
Internet Explorer Vary bug
http://lists.over.net/pipermail/mod_gzip/2002-December/006826.html
http://www.sitepoint.com/forums/printthread.php?t=158442
Serving XHTML with the correct MIME type
http://www.w3.org/TR/xhtml-media-types/
http://hixie.ch/advocacy/xhtml
PHP Examples for mimetype
http://www.xml.com/pub/a/2003/03/19/dive-into-xml.html
http://keystonewebsites.com/articles/mime_type.php
PHP Examples for 304 header
http://simon.incutio.com/archive/2003/04/23/conditionalGet
http://alexandre.alapetite.net/doc-alex/php-http-304/index.en.html
Copyright © 1999-2008 Darrin Yeager. http://www.dyeager.org
This page is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License. In summary, you are free to share (copy and distribute) the work under the following conditions (see the actual license for more information):
- Attribution. You must attribute the work to the author (but not in any way that suggests that they endorse you or your use of the work). Attribution should refer back to this web page and include a copyright notice and the license terms.
- Noncommercial. You may not use this work for commercial purposes.
- No Derivative Works. You may not alter, transform, or build upon this work.