TN0007: How to create a search engine-friendly, very deep, virtual
tree of html documents using PHP
Making use of an unintuitive property of the CGI interface specification,
it is quite easy to create a virtual tree of HTML documents. The problem
is keeping search engines from trying to index this tree.
When a client accesses a web server, the web server generally matches the
URL to a local filename and serves this file back to the client. In order
to be more flexible, web servers allow the creation of dynamic content.
In order to do this, they offer means to specify that certain files should
not be served verbatim, but should be executed as programs instead, and
the output of these programs should be served back to the client. In order
to provide these programs with data from forms, the CGI
standard has been established. In a nutshell it defines that all the
important information about a request is stored in environment variables,
then the program is executed, and the output of the program is served back
as the result. Currently, the most popular languages for writing these kinds
of programs are PERL, PHP and Microsoft's ASP. The end-user can often determine
if a page is dynamically created and by which language it is done, by looking
at the end of the URL. URLs ending in ".pl" are written in PERL,
URLs ending in ".php" are written in PHP, and URLS written in
ASP end in ".asp".
What is not so commonly known, is the fact, that the web server also accepts
URLs that contain superfluous path components at the end. For example, if
the URL http://www.kuckuk.com/bomb/bomb.php
properly executes the PHP file bomb.php, the URL http://www.kuckuk.com/bomb/bomb.php/garbage/and/more/garbage
will also succeed. The web server stops looking for a file with the first
file that matches and passes the rest of the path information to the program
specified using the CGI specification. It can be accessed as part of the
environment variable REQUEST_URI.
Soon after the first search engines appeared, a need was felt for a standard
on how to tell these search engines which parts of a web site they must
not access under any circumstances. This standard is the A
Standard for Robot Exclusion. Its main idea is to create a file "robots.txt"
in the root directory of each web server that specifies which subdirectories
must not be accessed. A sample robots.txt file would be for example:
# Site: http://www.kuckuk.com
# File: "robots.txt"
# Created: October 13, 1997
# Modified: October 27, 2001
# Only parts of this web-site are for public access.
For more details about writing robots.txt files please refer to the standard.
Robots Meta Tag
If a site is mantained by several people who all work in their subdirectories,
it soon becomes necessary to find a standard that work in these kinds of
situations. So the second most important standard is the Robots
META tag standard. This standard allows specifying information to search
engines on a document level by adding a so called meta tag to each document.
This meta tag specifies if a certain page is to be indexed (index) or not
(noindex), and if the links contained in this document are to be followed
(follow) or not (nofollow). An example, as used in this page here, is:
<META NAME="GENERATOR" CONTENT="Ten Fingers and one Keyboard">
<meta name="robots" content="index,follow">
<TITLE>TN0007: How to create a search engine-friendly, very deep, virtual
tree of html documents using
Putting it all together
As my server uses the Apache web server and offers PHP, I implemented my
program using PHP. In order to discourage search engines from indexing the
virtual pages, I have created a robots.txt file and put it in the proper
location at http://www.kuckuk.com/robots.txt.
Then I put the necessary "noindex,nofollow" robots meta tag right
at the top of the created pages. Then I wrote up a pretty boring message
incorporating links to virtual pages that begin with the URL of the original
page and end with additional parts of the form "/a".."/i".
I named the skript bomb.php, and put it on my server at http://www.kuckuk.com/bomb/bomb.php.
Here is the source for that skript:
// File: bomb.php
// Start: 10/21/2001
// Current as of 10/28/2001
/ (c) by Carsten Kuckuk, Ludwigsburg, Germany. E-Mail: firstname.lastname@example.org
// 10/21/2001 First Version
// 10/28/2001 Incorporated robots meta tags
// 04/22/2002 Added the creation of random e-mail addresses
<meta name="robots" content="noindex,nofollow">
<h1>This is a warning</h1>
If you are reading this page, you have ignored several warnings not to follow certain
<?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/a\">"; ?>
links</a>. You are specifically forbidden to follow this link
<?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/b\">"; ?>
here</a>, the link over
<?php echo "<a href=\"http:
here</a>, as well as
<?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/d\">"; ?>
this</a> one. If you follow any of them,
you will run deeper and deeper into my logical
<?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/e\">"; ?>
which is a virtual
<?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/f\">"; ?>
<?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/g\">"; ?>
pages. You have been warned! Do not follow
<?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/h\">"; ?>
<?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/i\">"; ?>
links</a>, especially if you are a spider, crawler, or any other automated
pest without intelligence.
Please also note, that it is not advisable to send e-mail to any of the following
for($i=0; $i<$nLetters; $i++)
$domainname = $domainname.$HTTP_HOST;
$nEmailAddresses = rand(3,10);
for($nEmails=0; $nEmails < $nEmailAddresses; $nEmails++)
for($i=0; $i<$AnzSilben; $i++)
echo "<a href=\"mailto:$email\">$email</a>, \n";
as these addresses do not exist.
It's not a beauty, but it works, and it should teach the one non-conforming
spider that it was written for a lesson.
If you receive a lot of spam,
sooner or later you want to fight back.
A particularly time consuming problem is to find out the relevant
e-mail addresses of the people responsible for spam or of their
upstream ISPs. A friend of mine wrote a set of scripts automating
this aspect of the war on spam. He has made his solution available
First Version: October 28, 2001
Second Version: April 22, 2002, Added the creation of random e-mail addresses
Third Version: April 23, 2002, Wifey fixed the layout
Fourth Version: April 24, 2002, restricted the creation of e-mail addresses to subdomains of my domain
Fifth Version: November 27, 2002, added link to SCSSP
If you have any questions, please send e-mail to Carsten