TN0007: How to create a search engine-friendly, very deep, virtual tree of html documents using PHP

Home

Technical Notes

DPSG-STR

Technical Abbreviations Database

TN0007: How to create a search engine-friendly, very deep, virtual tree of html documents using PHP

Making use of an unintuitive property of the CGI interface specification, it is quite easy to create a virtual tree of HTML documents. The problem is keeping search engines from trying to index this tree.

CGI

When a client accesses a web server, the web server generally matches the URL to a local filename and serves this file back to the client. In order to be more flexible, web servers allow the creation of dynamic content. In order to do this, they offer means to specify that certain files should not be served verbatim, but should be executed as programs instead, and the output of these programs should be served back to the client. In order to provide these programs with data from forms, the CGI standard has been established. In a nutshell it defines that all the important information about a request is stored in environment variables, then the program is executed, and the output of the program is served back as the result. Currently, the most popular languages for writing these kinds of programs are PERL, PHP and Microsoft's ASP. The end-user can often determine if a page is dynamically created and by which language it is done, by looking at the end of the URL. URLs ending in ".pl" are written in PERL, URLs ending in ".php" are written in PHP, and URLS written in ASP end in ".asp".

What is not so commonly known, is the fact, that the web server also accepts URLs that contain superfluous path components at the end. For example, if the URL http://www.kuckuk.com/bomb/bomb.php properly executes the PHP file bomb.php, the URL http://www.kuckuk.com/bomb/bomb.php/garbage/and/more/garbage will also succeed. The web server stops looking for a file with the first file that matches and passes the rest of the path information to the program specified using the CGI specification. It can be accessed as part of the environment variable REQUEST_URI.

robots.txt

Soon after the first search engines appeared, a need was felt for a standard on how to tell these search engines which parts of a web site they must not access under any circumstances. This standard is the A Standard for Robot Exclusion. Its main idea is to create a file "robots.txt" in the root directory of each web server that specifies which subdirectories must not be accessed. A sample robots.txt file would be for example:

# Site: http://www.kuckuk.com # File: "robots.txt" # # Created: October 13, 1997 # Modified: October 27, 2001 # # Only parts of this web-site are for public access. # # ck@kuckuk.com # User-agent: * Disallow: /cgi-bin Disallow: /bomb

For more details about writing robots.txt files please refer to the standard.

Robots Meta Tag

If a site is mantained by several people who all work in their subdirectories, it soon becomes necessary to find a standard that work in these kinds of situations. So the second most important standard is the Robots META tag standard. This standard allows specifying information to search engines on a document level by adding a so called meta tag to each document. This meta tag specifies if a certain page is to be indexed (index) or not (noindex), and if the links contained in this document are to be followed (follow) or not (nofollow). An example, as used in this page here, is:

<HTML> <HEAD>

<META NAME="GENERATOR" CONTENT="Ten Fingers and one Keyboard"> <meta name="robots" content="index,follow"> <TITLE>TN0007: How to create a search engine-friendly, very deep, virtual tree of html documents using PHP</TITLE>

</HEAD>

Putting it all together

As my server uses the Apache web server and offers PHP, I implemented my program using PHP. In order to discourage search engines from indexing the virtual pages, I have created a robots.txt file and put it in the proper location at http://www.kuckuk.com/robots.txt. Then I put the necessary "noindex,nofollow" robots meta tag right at the top of the created pages. Then I wrote up a pretty boring message incorporating links to virtual pages that begin with the URL of the original page and end with additional parts of the form "/a".."/i". I named the skript bomb.php, and put it on my server at http://www.kuckuk.com/bomb/bomb.php. Here is the source for that skript:

<?php // File: bomb.php // Start: 10/21/2001 // Current as of 10/28/2001 / (c) by Carsten Kuckuk, Ludwigsburg, Germany. E-Mail: ck@kuckuk.com // History: // 10/21/2001 First Version // 10/28/2001 Incorporated robots meta tags // 04/22/2002 Added the creation of random e-mail addresses ?> <html> <head> <meta name="robots" content="noindex,nofollow"> <title>Logical bomb</title> </head> <body> <h1>This is a warning</h1> <p> If you are reading this page, you have ignored several warnings not to follow certain <?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/a\">"; ?> links</a>. You are specifically forbidden to follow this link <?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/b\">"; ?> here</a>, the link over <?php echo "<a href=\"http: //".$HTTP_HOST.$REQUEST_URI."/c\">"; ?>> here</a>, as well as <?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/d\">"; ?> this</a> one. If you follow any of them, you will run deeper and deeper into my logical <?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/e\">"; ?> bomb</a> which is a virtual <?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/f\">"; ?> mesh</a> of <?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/g\">"; ?> non-existent</a> pages. You have been warned! Do not follow <?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/h\">"; ?> any</a> of these <?php echo "<a href=\"http://".$HTTP_HOST.$REQUEST_URI."/i\">"; ?> links</a>, especially if you are a spider, crawler, or any other automated pest without intelligence. <br> <br> Please also note, that it is not advisable to send e-mail to any of the following e-mail addresses: <?PHP srand((double)microtime()*1000000); $domainname=""; $nLetters=rand(4,7); for($i=0; $i<$nLetters; $i++) { $domainname=$domainname.substr("bcdfghjklmnpqrstvwxyz",rand(0,20),1); } $domainname = $domainname.$HTTP_HOST; $nEmailAddresses = rand(3,10); for($nEmails=0; $nEmails < $nEmailAddresses; $nEmails++) { $email=""; $AnzSilben=rand(2,5); for($i=0; $i<$AnzSilben; $i++) { $email=$email.substr("bcdfghjklmnpqrstvwxyz",rand(0,20),1); $email=$email.substr("aeiou",rand(0,4),1); } $email=$email.substr("bcdfghjklmnpqrstvwxyz",rand(0,20),1); $email=$email."@".$domainname; echo "<a href=\"mailto:$email\">$email</a>, \n"; } ?> as these addresses do not exist. </body> </html>

It's not a beauty, but it works, and it should teach the one non-conforming spider that it was written for a lesson.

Spam

If you receive a lot of spam, sooner or later you want to fight back. A particularly time consuming problem is to find out the relevant e-mail addresses of the people responsible for spam or of their upstream ISPs. A friend of mine wrote a set of scripts automating this aspect of the war on spam. He has made his solution available at http://www.nowak-sys.de/SCSSP/.

Document History

First Version: October 28, 2001
Second Version: April 22, 2002, Added the creation of random e-mail addresses
Third Version: April 23, 2002, Wifey fixed the layout
Fourth Version: April 24, 2002, restricted the creation of e-mail addresses to subdomains of my domain
Fifth Version: November 27, 2002, added link to SCSSP

Questions?

If you have any questions, please send e-mail to Carsten Kuckuk at .