Author Archive

Avoiding Pitfalls: A Comprehensive Guide To Firewalls


  

In the construction industry, a “firewall” is a specially-built wall designed to stop a fire from spreading between sections of a building. The term spread to other industries like car manufacturing, and in the late 1980s it made its way into computing. On one side of the wall is the seething electronic chaos of the Internet. On the other side is your powerful but vulnerable Web server.

These computer firewalls are actually more like fire doors because they have to let some stuff through. They monitor all the electronic traffic coming in and out of a network. They follow a strict set of rules to determine what is allowed and what is blocked.

This article explains in more detail how they work, the different types of firewalls available, what they are good at and not so good at, and how to configure them to protect a typical Web server.

Protocols And Ports

A computer is like a big housing complex. Every computer on the Internet has a numerical address, known as an IP address. At each address are two very large blocks of apartments. Each block contains 65,535 individual apartments. The vast majority of them are empty, but a few, especially the lower ones, have very communicative residents. All communication is by mail.

A computer is like two big blocks of apartments.
A computer is like two big blocks of apartments. Ten points if you can identify these apartment buildings.

The two blocks are called TCP and UDP. Residents of the TCP block only accept certified mail, and they are guaranteed to reply to you. Once you’ve started exchanging letters with someone in TCP, you can be sure they’ll see the conversation through to the end. The UDP block is a bit shabbier. Its residents only reply if they can be bothered. They usually do, but there are no guarantees.

Each resident has a different job. For example, the family in apartment 80 of the TCP block handles website inquiries. You can write to them with a request like “send me the home page for www.smashingmagazine.com� and they will duly reply and send back the data.

The couple in UDP apartment 53 are responsible for DNS. They translate domain names into IP addresses. Sometimes letters get lost, but that’s okay. The information they handle in UDP is not as critical. The sender can always ask again.

All addresses in the postal system therefore have three parts: computer IP address, block and apartment. So as an example: 80.72.139.101, TCP, 80. In reality, the block is the protocol, and the apartment is the port number. TCP stands for Transmission Control Protocol and UDP is User Datagram Protocol. Their main difference is that TCP creates and maintains a conversation (i.e. connection) between the two computers, whereas UDP does not. Therefore TCP is more reliable but slower.

Sender’s Address

In the example above, the web-serving family at 80.72.139.101, TCP, 80 are sitting around reading books, just waiting for letters. They never start their own conversations or connections. They are always on the receiving end.

But the sender also has an address. Whenever you browse for a Web page in your Web browser or on your smart phone, your computer assigns you an assistant in one of its apartments, generally one that is way up high in the building. A typical conversation between your Web browser and a Web server somewhere might look like this:

  • From 99.99.99.99, TCP, 63454: “Dear 80.72.139.101, TCP, 80, It has come to my attention that you handle enquiries relating to the website www.smashingmagazine.com. Can you please send me the /books/ page? Yours truly, A Web Browser.â€�
  • From 80.72.139.101, TCP 80: “Sure no problem. Here it is: <!DOCTYPE html> <html>…â€�
  • From 99.99.99.99, TCP, 63454: “Thanks. That’s all. Bye.â€�

Apartment numbers below 1024 are more secure than the higher numbers. They have CCTV, and they are reserved for privileged tasks like website and FTP serving. The higher apartments are more ephemeral. People are moving in and out all the time. They are used for things like requesting Web pages and initiating FTP connections.

Danger

But not everyone is on the straight and narrow. Here is a sample conversation between a dodgy client living above a café and a trusting SSH resident on a vulnerable server. SSH is a method for connecting to a remote computer and running commands on it. The SSH resident always lives in the TCP block, usually in apartment 22:

  • From 88.88.88.88, TCP apartment 58123: “Dear SSH server, I want to establish a connection with you. I would like to login as the user called root.â€�
  • From 80.72.139.101, TCP, 22: “Sure no problem, what’s your password?â€�
  • From 88.88.88.88: “smithâ€�
  • From 80.72.139.101: “That’s incorrect. Try again.â€�
  • From 88.88.88.88: “jonesâ€�
  • From 80.72.139.101: “Still wrong. Try again.â€�
  • From 88.88.88.88: “bloggsâ€�
  • From 80.72.139.101: “Great. Hi Mr. Root, long time, no see. What do you want to do today?â€�
  • From 88.88.88.88: “I think I’ll view the file /etc/passwd, the one with all the user names?â€�
  • From 80.72.139.101: “Here’s the information you requested… Anything else?â€�
  • From 88.88.88.88: “Yes — show me all files containing the word ‘credit card’â€�
  • From 80.72.139.101: “Sorry. That took a while. But here you go:…â€�
  • From 88.88.88.8: “Thanks, that’s all, bye.â€�

The first part of the SSH conversation above: logging in as root, getting the password wrong twice, asking for a file.
The first part of the SSH conversation above: logging in as root, getting the password wrong twice, asking for a file.

Levels Of Protection

The server depicted above is vulnerable. Anybody can send a letter to 80.72.139.101, TCP, 22. If they can guess the password correctly, then they have full access to the server and all its files. This section discusses four basic levels of protection which you can apply to a server to prevent this, and to only allow trusted people to communicate with your server.

Although SSH is just one of many services that can run on a server, it is a good one to start with because it offers the most control over the server. If a hacker breaks into your FTP or SMTP server, they can do damage, but not as much as with SSH.

Good Passwords

SSH does have its own built-in protection because it requires a user name and password. A first basic step in securing a server is to choose complicated hard-to-guess passwords.

Whenever the resident in apartment 22 receives a letter, he first asks for the correct password. If you can’t get it right, then he won’t help you. He’s a trusting person though. You can usually have as many tries as you’d like. Every three tries, you might have to resend your initial opening letter, but he generally won’t mind.

Refusing Connections

The SSH conversation above was partly faked. If you really try to SSH to 80.72.139.101 (the IP address of Smashing Magazine’s Web server), you will get this:

SSH connection refused.
SSH connection refused.

This means that the resident in 80.72.139.101, TCP, apartment 22, is receiving his mail, but is immediately sending it back. He has a piece of paper taped to the back of his door which lists all the people who he is allowed to correspond with. He checks the sender’s address against the list. If you’re not on it, he refuses to engage in any correspondence. But he does send a polite note back which says, “Sorry, connection refused.”

This kind of protection is provided by software like TCP Wrappers on UNIX servers. The blocking is done at an application level using the files /etc/hosts.deny and /etc/hosts.allow.

With this type of blocking, it is still conceivable that a particularly cunning sender could convince the resident in apartment 22 to open a letter and read it. Also, it relies on the individual residents to take note of and obey the lists. Although the guy in apartment 22 in very conscientious, other residents may be less so.

Software Firewall

A software firewall is like a concierge who filters all the incoming mail before it is even distributed to the residents. He has a similar piece of paper on his desk. He checks the sender’s address against the list. If your address isn’t on the list, your letter goes straight in the bin. No follow-up or polite apology. Nothing. You wait around for a couple weeks and then give up. He also filters outgoing mail, consulting his list to see which residents are allowed to write to the outside world.

Technically it looks like this:

SSH connection timed out — either there is nobody home or there is a firewall in a the way.
SSH connection timed out — either there is nobody home or there is a firewall in a the way.

This type of blocking is done at the operating system level. A software firewall makes sure the letter never even gets there. See this page for a nice explanation of the difference between TCP Wrappers and software firewalls.

Hardware Firewall

A hardware firewall is like having a well-trained security expert in your local post office. She has a similar list as the concierge, detailing who is allowed to send letters to whom. She might be protecting just your IP address, or she might be working for dozens of addresses at the same time, or maybe even the whole community. She scrutinizes every single bit of mail going into and out of your town. Anything she doesn’t like, she chucks. As above, you’ll just get a “connection timed out” message.

This is the ultimate in firewall security. The letters don’t even make it to the front gate. No chance at all that he’ll mistakenly pass on a letter to the resident in apartment 22.

Of course, you can implement all these levels of protection. And you can have multiple hardware firewalls. You could have one at your local post office, another at the bigger sorting office in the city and another that filters all mail coming into your county, state or country.

Note that in practice, a “hardware firewall” is just a computer which is wholly dedicated to being a firewall, and is physically separated from your Web server. The firewall itself is still a piece of software on that computer.

Stateless and Stateful Firewalls

The concierge and the security person filter incoming and outgoing mail. If they are new on the job, they are like stateless firewalls, i.e. packet filters. They treat all mail equally. When you send a letter to apartment 80, apartment 80 will send a letter back. That reply is checked against their lists. Your letter only gets through if mail from apartment 80 is allowed out.

After a while, they move up the pay scale and become stateful firewalls. They are trained to differentiate the outgoing mail between brand new outgoing letters and replies to previous letters. Consequently, their lists of rules can be a lot shorter. They can be told to deliver all letters that are part of an established correspondence, and throw away almost all new outgoing mail. So when you send a letter to apartment 80, the reply from apartment 80 is automatically allowed back through. But if on some lonely winter’s evening, apartment 80 suddenly decides to start up its own letter writing campaign, it won’t get through.

The next section looks at how the piece of paper with all the rules is formatted.

Firewall Rules

Whether you use a software or hardware firewall, you will hopefully get some sort of visual management tool. There are many firewalls available, each with its own way of doing things, but they generally stick to the same concepts. They mainly vary in their complexity. Some give you complete control but also require a lot of knowledge and effort. Others are simple but far less flexible.

This section introduces the terminology and shows two examples of firewall configuration interfaces.

Firewall Terminology

This is what the piece of paper on the concierge’s desk might look like:

  1. Letters from any sender in any apartment to the TCP building, apartment 80 – deliver; replies to those letters from apartment 80 – deliver
  2. Letters from the address 99.99.99.99, any apartment number, to the TCP building, apartment 22 – deliver; replies to those letters from apartment 22 – deliver
  3. Any other letters to or from the TCP building – throw away
  4. Any other letters to or from the UDP building – throw away

Whenever the concierge receives a new letter, he compares the sender’s and recipient’s addresses to each of these rules in order. When he finds a matching rule, he follows the instruction, which is to either deliver the letter or throw it away.

The overall effect is that anybody in the world can correspond with the Web server in apartment 80, but only the trustworthy people living at 99.99.99.99 can write to the SSH server in apartment 22. All other attempted communication is thrown away. This is how that would translate to a real software firewall like IPTables:

Rule Protocol Direction Type Remote IP
Remote port Server port Action
1a TCP incoming any any any 80 Allow
1b TCP outgoing established any any 80 Allow
2a TCP incoming any 99.99.99.99 any 22 Allow
2b TCP outgoing established 99.99.99.99 any 22 Allow
3 TCP any any any any any Deny
4 UDP any any any any any Deny

Most firewalls, however, will hide the gory details of new and established connections. They will assume that all replies to established connections are allowed through and only deal with the new connections. The table can then be much shorter:

Rule Protocol Direction Remote IP
Remote port Server port Action
1 TCP incoming any any 80 Allow
2 TCP incoming 99.99.99.99 any 22 Allow
3 TCP any any any any Deny
4 UDP any any any any Deny

If these rules were applied to a hardware firewall, there would also be a server IP address, as hardware firewalls usually work on behalf of more than one server. In that case, the table could just have source and destination and no direction. like this:

Rule Protocol Source IP
Source port Destination IP Destination port Action
1 TCP any any 80.72.139.101 80 Allow
2 TCP 99.99.99.99 any 80.72.139.101 22 Allow
3 TCP any any any any Deny
4 UDP any any any any Deny

To allow outgoing requests, the source and destination would be switched, with the server’s IP address 80.72.139.101 appearing in the source IP address column. Also note that the remote and server protocol are always the same.

Firewall Example

Some firewall configurations provide tables and forms similar to the one above and you just have to fill in the blanks. The example below is the software firewall tool provided by Plesk 9.5.4. Plesk is a common server management application. To see the screen below, click “Modules” on the left, then “Firewall,” “Edit Firewall Configuration” and “Add Custom Rule.”

Adding a new rule to the firewall in Plesk 9.5.4.
Adding a new rule to the firewall in Plesk 9.5.4.

This has most of the fields shown in the table above: direction, action of allow or deny, server port, protocol and remote/source IP address. The mini-table above blocks both directions within the same rule (rules 3 and 4). In many configurations, these have to be done separately by setting up the rule and choosing a direction of incoming or outgoing.

In this interface, there is another choice for direction — forwarding. Port forwarding allows letters to be forwarded from one port to another. It is commonly used in household broadband routers, which allow several computers to share a single IP address by mapping ports.

For example, your laptop might have a local IP address of 192.168.1.10. When you request a Web page (with your sender’s address like 192.168.1.10, TCP, 60000), your letter first goes to your broadband router, which puts your letter in a brand new envelope with a new sender’s address (such as 78.78.78.78, TCP, 12013) and forwards it to its destination (such as 80.72.139.101, TCP, 80). When the router gets the reply, it remembers that apartment 12013 is just a forwarding address, so it puts it in a new envelope and forwards it to you (at 192.168.1.10, TCP, 60000). A typical Web server would not need to do port forwarding and so the firewall would block all forwarding.

The Plesk firewall is just a friendly interface for the UNIX firewall IPTables.
The Plesk firewall is just a friendly interface for the UNIX firewall IPTables. The command iptables -L shows the current rules as above.

Another Firewall Example

The screenshot below is from the shared hardware firewall configuration tool provided by the host UK Fast. You have to pay extra for this service. In exchange, you can log into the UK Fast website and add, modify and remove rules:

A hardware firewall configuration interface for allowing incoming and outgoing traffic to TCP ports.
A hardware firewall configuration interface for allowing incoming and outgoing traffic to TCP ports.

This interface is very simple. It already knows your server IP address and it assumes that everything not listed is denied. The only thing you can do is add allowed ports.

The ports relate to the server port for incoming connections and the remote port for outgoing connections. As above, this only deals with new connections. Established requests are allowed through automatically. So the first incoming rule on the left states that anybody in the world can access port 80 on the server (the web-crazy family).

The first outgoing rule on the right allows the server to request port 80 on other servers. You may wonder why your Web server would ever need to do this. Surely it is not spending its free time browsing the Internet. But it is necessary for things like PayPal Instant Payment Notification, where your server sends a request to PayPal to double check that a payment has been received. Similarly, most of the other outgoing rules allow your server to send emails, check email accounts and do FTP.

There is no mention of the remote IP address on this screen. That is configured separately by clicking on “Admin Config” and is shown in the next subsection.

Subnets

The UK Fast hardware firewall has a separate screen for entering remote IP addresses. They have a preset list of restricted ports (such as 22 for SSH and 8443 for Plesk), and only the IP addresses shown are allowed to communicate with those ports. All other ports (like 80 for Web serving) are allowed to correspond with anybody. You lose some flexibility this way, but it makes it very easy to configure.

Entering a list of source IP addresses which are allowed to connect to restricted ports.
Entering a list of source IP addresses which are allowed to connect to restricted ports.

This screen also introduces subnets. Up till now, all the rules have dealt with single IP addresses like 99.99.99.99, but you can also enter IP addresses in ranges.

IP addresses consist of four numbers between 0 and 255. (There are newer ones with six numbers but they are still relatively uncommon.) It would be nice to be able to enter ranges by putting things like 99.99.0-255.0-255. Unfortunately you can’t. Instead you enter a starting address like 99.99.0.0 and a mask like 255.255.0.0. Similarly the range 77.77.77.0-255 is entered as the address 77.77.77.0 and the subnet mask 255.255.255.0.

You can enter smaller ranges such as 77.77.77.8-15, which would be 77.77.77.8 with subnet mask 255.255.255.248. It’s confusing because it invokes binary, but there are lots of detailed subnet explanations available.

All the Fields

Here is a summary of the different fields discussed in this section:

  • Rule number: remember that rules are checked and applied in order
  • Protocol: either TCP or UDP
  • Direction: incoming, outgoing or forwarding
  • Remote IP address: such as 77.77.77.0
  • Remote subnet mask: such as 255.255.255.0
  • Remote port or port range
  • Server IP address
  • Server subnet mask
  • Server port or port range
  • Action: allow or deny

Sometimes the interface will have source and destination instead of direction, remote and server. In this case, they are the same for incoming mail, but reversed for outgoing. In other words, remote equals source for incoming but remote equals destination for outgoing.

The next section discusses the ports in more detail. The concierge’s piece of paper is very limiting. A real Web server will look more like the hardware firewall example above.

Configuring A Firewall For A Web Server

The Web server family usually lives in apartment 80. Many other popular computer services have a preferred apartment too. This section lists and describes some of the most common ports used by a standard Web and email server, and then provides recommendations on setting up your own firewall.

Ports

These are some of the well-known official ports you may come across. Some firewalls don’t even show the port number, only the service which usually uses it.

Protocol Port Used by Description
TCP 20 FTP (active mode) FTP operates in active or passive mode, as requested by the person doing the FTPing. In both modes, commands are sent to the server on port 21. In active mode, the server initiates a new connection from port 20 to send data back. In passive mode, there is an extra incoming connection to an unprivileged port (>=1024). So for FTP to work, you either have to allow outgoing connections from port 20 (active) or incoming connections from port 21 to port>=1024.
TCP 21 FTP FTP is the File Transfer Protocol. Restrict it by IP address so that only trusted people can FTP to your server, or turn it off completely and only allow Secure FTP, which runs as part of SSH over port 22.
TCP 22 SSH Secure Shell allows people to login and run commands on your server. This is very useful for server administration, but should be restricted by IP address.
TCP 23 Telnet Telnet is an insecure version of SSH. It is usually completely denied.
TCP 25 SMTP SMTP is the Simple Mail Transfer Protocol. It handles two tasks. As a Mail Submission Agent it receives email for all the email accounts set up on your server. As a Mail Transfer Agent, it can also forward email to other servers. The submission tasks are now handled more often by port 587. And the transfer tasks are usually configured to require a password, and relaying (forwarding email for accounts not even listed on the server) is turned off completely to stop SPAM relaying. Either way it needs to be allowed.
UDP and TCP 53 DNS DNS stands for Domain Name System. It translates domains names like www.smashingmagazine.com into IP addresses. DNS data is transferred between servers using TCP, and DNS queries are handled with UDP. Unless your server is acting as a DNS server, only outgoing UDP needs to be allowed. This allows your server to look up domain names. For instance, if you have a website which does PayPal Instant Notifications, then your Web server probably POSTs a request to www.paypal.com, but first it needs to translate that into an IP address, so it needs to initiate a request from UDP 53. Alternatively, you can put the translation into your /etc/hosts file (on UNIX) so that the DNS request is not required.
TCP 80 HTTP This is your Web server. Allow it so people can browse your websites.
TCP 110 POP3 POP stands for Post Office Protocol. If your server hosts email accounts, this port allows people to check their email.
TCP 143 IMAP IMAP is the Internet Message Access Protocol. It is a more sophisticated alternative to POP3, as it not only stores your incoming email but allows you to create folders too. The person who creates the email account is the one who decides whether it should be POP3 or IMAP.
TCP 443 HTTPS This port is the default for a secure Web server, which requires an SSL certificate.
TCP 465 SMTPS This port was originally used for a secure version of SMTP. This is now done using the normal SMTP port 25.
TPP 587 SMTP Port 587 is the official port for submitting email messages to a server.
TCP 3306 MySQL Deny or restrict only to trusted IP addresses that connect directly to your MySQL databases from across the Internet.
TCP 8443 Plesk Plesk is a server management tool. It runs as a mini-Web server, accessed via your Web browser. If you have a hardware firewall or a definitely permanently static IP address, you can restrict this by IP address. If you use Plesk’s own software firewall, then restricting by IP address is dangerous in case your IP address ever changes, or your office burns down or you need to access it from somewhere else.
TCP 10000 Webmin Webmin is another server management tool. The same caveats apply as with Plesk.
TCP 1024- 65535 Outgoing requests One of the first scenarios in this article was a person using a Web browser. In this case, a high numbered port on the client (1024 or above) connects to the Web server’s port 80. If your Web server looks up a Web page (as in PayPal IPN) it connects from one of its own high numbered ports. On many UNIX computers, the available range is set in the file /proc/sys/net/ipv4/ip_local_port_range. It is often from 1024-4099 or 32768-61000.
ICMP n/a Ping Ping is used to check if a computer is reachable over a network. It uses a different port-less protocol called Internet Control Message Protocol. Some firewalls will also give you control over this.

Full Configuration

For a basic Web and email server with the Plesk management interface, you could configure your firewall to:

  • Allow incoming connections from anywhere to TCP ports 25 (SMTP), 80 (Web server), 110 (POP email accounts), 143 (IMAP email accounts), 443 (secure Web server), 587 (SMTP)
  • Restrict incoming connections to ports TCP 22 (SSH), 8443 (Plesk unless you use Plesk to configure the firewall)
  • Allow outgoing connections from any port on the server to the remote TCP ports 25 (SMTP), 80 (web), 443 (secure web), 587 (SMTP) and UDP 53 (DNS lookups)
  • Deny everything else

If you allow additional services, you will need to open up additional ports. This could include automatic backups, security scans or remote database access.

Checking With Telnet

All computers come with a Telnet client which allows you to connect to a Telnet server on port 23 and run commands. But the Telnet client can also connect to other ports, so it is a very useful way to check a firewall.

To use it, you’ll need to open a Terminal or Command Prompt. On a Mac with OS X, go to ApplicationsUtilities and run Terminal. On a PC with Windows, go to StartAll ProgramsAccessories and select “Command Prompt.â€� If you use Ubuntu Linux, it’s under ApplicationsAccessories, and in a similar location for other flavors of Linux.

Once you’ve got the Terminal open, try typing something like this followed by “Enter”:

telnet www.smashingmagazine.com 21

First telnet will translate www.smashingmagazine.com into an IP address and then it will try to connect to its port 21. In this case the connection is refused, which means the FTP server isn’t running at all or is being blocked at the application level. Here are a few more examples:

Using telnet to probe www.smashingmagazine.com.
Using telnet to probe www.smashingmagazine.com.

Ports 21 (FTP) and 25 (SMTP) refuse the connection. Port 23 (Telnet) doesn’t even reply, which may mean that it’s blocked by a firewall. Port 80 (Web server) connects successfully. I could then have issued an HTTP command like:

GET / HTTP/1.1
Host: www.smashingmagazine.com

Followed by a blank line. That would have given me Smashing Magazine’s home page. Instead I pressed Ctrl + ] to close the connection and then quit from Telnet.

Personal Firewalls

Firewalls are not just for servers. Many personal computers also have software firewalls, like the Windows Firewall on Windows Vista or the Mac Firewall in Mac OS X. Underneath, these probably operate in the same way as server firewalls, but a lot of the details are hidden. They often just have an on/off switch which blocks all incoming traffic, with the ability add exceptions by application rather than by port. There are more sophisticated third-party products available which allow for port and IP address restrictions.

Configuration screen for Windows Firewall in Windows XP Service Pack 2.
Configuration screen for Windows Firewall in Windows XP Service Pack 2.

Configuration screen for Mac OS X firewall which is a graphical interface for the underlying UNIX software firewall called ipfirewall.
Configuration screen for Mac OS X firewall which is a graphical interface for the underlying UNIX software firewall called ipfirewall.

Firewall Pitfalls

Firewalls are great for blocking traffic from unwanted sources. If you have access to a firewall, it is advisable to at least limit FTP and SSH to trusted IP addresses. In some senses though, they are very crude. They either allow or deny. They do not care about the contents of the letter.

Things That a Firewall Can’t Block

Therefore, a firewall cannot block SPAM, viruses or hacks.

If your SMTP port 25 is open, then email can be sent to it. There might be 10,000 emails an hour discussing the finer points of Viagra, but they all look legitimate to a firewall. They might contain attached EXE files which will take over your computer once opened, but the firewall won’t notice.

Similarly, all standard Web servers have port 80 or 443 open. The firewall cannot tell the difference between a valid request for your home page, or a piece of software canvassing WordPress installations to look for weak spots, so it can post a Trojan Horse PHP file containing a mini-shell which will notify a hacker somewhere and allow your server to be used for denial of service attacks.

Finally, if a clever hacker somewhere figures out how to gain control over a server by issuing specially formatted requests to your IMAP port 143, the firewall won’t lift a finger.

Therefore, even if you have a firewall, you still need to worry about SPAM and viruses in emails, and you still need to keep your server software and websites up-to-date with the latest security patches.

Locking Yourself Out

Software firewalls have an extra pitfall. It is entirely possible to lock yourself out of your own server. The example above showed a software firewall configured within Plesk, which usually runs on port 8443. If you mistakenly blocked port 8443 and saved the configuration, then you would not be able to login again and undo it. This could also happen if you restricted Plesk by IP address and then your IP address changed.

You would then have to login via SSH and manually figure out how to reverse the rule by editing IPtables. If you had also blocked SSH, your only recourse would be to call the hosting company. They might have to attach a physical keyboard and screen to your server in order to login and remove the rule.

Conclusion

This article has covered the basics of firewalls, and has hopefully given you a clear idea of how they operate. The analogy to a pair of massive apartment complexes is not perfect, but provides an insight into the world of ports.

In conclusion, if you have access to a firewall, you should use it. For software firewalls, you can safely restrict SSH and FTP by IP address and block any other services you are not using. For hardware firewalls, you can reliably restrict access to Plesk or your chosen server management tool.

Credits of image on start page: Wikipedia.

(cp)


© Paul Tero for Smashing Magazine, 2013.


All About Unicode, UTF8 & Character Sets


  

This is a story that dates back to the earliest days of computers. The story has a plot, well, sort of. It has competition and intrigue, as well as traversing oodles of countries and languages. There is conflict and resolution, and a happyish ending. But the main focus is the characters — 110,116 of them. By the end of the story, they will all find their own unique place in this world.

This story will follow a few of those characters more closely, as they journey from Web server to browser, and back again. Along the way, you’ll find out more about the history of characters, character sets, Unicode and UTF-8, and why question marks and odd accented characters sometimes show up in databases and text files.

Warning: This article contains lots of numbers, including a bit of binary — best approached after your morning cup of coffee.

ASCII

Computers only deal in numbers and not letters, so it’s important that all computers agree on which numbers represent which letters.

Let’s say my computer used the number 1 for A, 2 for B, 3 for C, etc., and yours used 0 for A, 1 for B, etc. If I sent you the message HELLO, then the numbers 8, 5, 12, 12, 15 would whiz across the wires. But for you, 8 means I, so you would receive and decode it as IFMMP. To communicate effectively, we would need to agree on a standard way of encoding the characters.

To this end, in the 1960s the American Standards Association created a 7-bit encoding called the American Standard Code for Information Interchange (ASCII). In this encoding, HELLO is 72, 69, 76, 76, 79 and would be transmitted digitally as 1001000 1000101 1001100 1001100 1001111. Using 7 bits gives 128 possible values from 0000000 to 1111111, so ASCII has enough room for all lower case and upper case Latin letters, along with each numerical digit, common punctuation marks, spaces, tabs and other control characters. In 1968, US President Lyndon Johnson made it official — all computers must use and understand ASCII.

Trying It Yourself

There are plenty of ASCII tables available, displaying or describing the 128 characters. Or you can make one of your own with a little bit of CSS, HTML and Javascript, most of which is to get it to display nicely:

<html>
<body>
<style type="text/css">p {float: left; padding: 0 15px; margin: 0; font-size: 80%;}</style>
<script type="text/javascript">
for (var i=0; i<128; i++) document.writeln ((i%32?'':'<p>') + i + ': ' + String.fromCharCode (i) + '<br>');
</script>
</body>
</html>

This will display a table like this:

Do-It-Yourself Javascript ASCII table viewed in Firefox
Do-It-Yourself Javascript ASCII table viewed in Firefox.

The most important bit of this is the Javascript String.fromCharCode function. It takes a number and turns it into a character. In fact, the following four lines of HTML and Javascript all produce the same result. They all get the browser to display character numbers 72, 69, 76, 76 and 79:

HELLO
&#72;&#69;&#76;&#76;&#79;
<script>document.write ("HELLO");</script>
<script>document.write (String.fromCharCode (72,69,76,76,79));</script>

Also notice how Firefox displays the unprintable characters (like backspace and escape) in the first column. Some browsers show blanks or question marks. Firefox squeezes four hexadecimal digits into a small box.

The Eighth Bit

Teleprinters and stock tickers were quite happy sending 7 bits of information to each other. But the new fangled microprocessors of the 1970s preferred to work with powers of 2. They could process 8 bits at a time and so used 8 bits (aka a byte or octet) to store each character, giving 256 possible values.

An 8 bit character can store a number up to 255, but ASCII only assigns up to 127. The other values from 128 to 255 are spare. Initially, IBM PCs used the spare slots to represent accented letters, various symbols and shapes and a handful of Greek letters. For instance, number 200 was the lower left corner of a box: +, and 224 was the Greek letter alpha in lower case: a. This way of encoding the letters was later given the name code page 437.

However, unlike ASCII, characters 128-255 were never standardized, and various countries started using the spare slots for their own alphabets. Not everybody agreed that 224 should display a, not even the Greeks. This led to the creation of a handful of new code pages. For example, in Russian IBM computers using code page 885, 224 represents the Cyrillic letter ?. And in Greek code page 737, it is lower case omega: ?.

Even then, there was disagreement. From the 1980s Microsoft Windows introduced its own code pages. In the Cyrillic code page Windows-1251, 224 represents the Cyrillic letter a, and ? is at 223.

In the late 1990s, an attempt at standardization was made. Fifteen different 8 bit character sets were created to cover many different alphabets such as Cyrillic, Arabic, Hebrew, Turkish, and Thai. They are called ISO-8859-1 up to ISO-8859-16 (number 12 was abandoned). In the Cyrillic ISO-8859-5, 224 represents the letter ?, and ? is at 207.

So if a Russian friend sends you a document, you really need to know what code page it uses. The document by itself is just a sequence of numbers. Character 224 could be ?, a or ?. Viewed using the wrong code page, it will look like a bunch of scrambled letters and symbols.

(The situation isn’t quite as bad when viewing Web pages — as Web browsers can usually detect a page’s character set based on frequency analysis and other such techniques. But this is a false sense of security — they can and do get it wrong.)

Trying It Yourself

Code pages are also known as character sets. You can explore these character sets yourself, but you have to use PHP or a similar server side language this time (roughly because the character needs to be in the page before it gets to the browser). Save these lines in a PHP file and upload it to your server:

<html>
<head>
<meta charset="ISO-8859-5">
</head>
<body>
<style type="text/css">p {float: left; padding: 0 15px; margin: 0; font-size: 80%;}</style>
<?php  for ($i=0; $i<256; $i++) echo ($i%32?'':'<p>') . $i . ': ' . chr ($i) . '<br>'; ?>
</body>
</html>

This will display a table like this:

Cyrillic character set ISO-8859-5 viewed in Firefox
Cyrillic character set ISO-8859-5 viewed in Firefox.

The PHP function chr does a similar thing to Javascript’s String.fromCharCode. For example, chr(224) embeds the number 224 into the Web page before sending it to the browser. As we’ve seen above, 224 can mean many different things. So, the browser needs to know which character set to use to display the 224. That’s what the first line above is for. It tells the browser to use the Cyrillic character set ISO-8858-5:

<meta charset="ISO-8859-5">

If you exclude the charset line, then it will display using the browser’s default. In countries with Latin-based alphabets (like the UK and US), this is probably ISO-8859-1, in which case 224 is an a with grave accent: à. Try changing this line to ISO-8859-7 or Windows-1251 and refresh the page. You can also override the character set in the browser. In Firefox, go to View > Character Encoding. Swap between a few to see what effect it has. If you try to display more than 256 characters, the sequence will repeat.

Summary Circa 1990

This is the situation in about 1990. Documents can be written, saved and exchanged in many languages, but you need to know which character set they use. There is also no easy way to use two or more non-English alphabets in the same document, and alphabets with more than 256 characters like Chinese and Japanese have to use entirely different systems.

Finally, the Internet is coming! Internationalization and globalization is about to make this a much bigger issue. A new standard is required.

Unicode To The Rescue

Starting in the late 1980s, a new standard was proposed – one that would assign a unique number (officially known as a code point) to every letter in every language, one that would have way more than 256 slots. It was called Unicode. It is now in version 6.1 and consists of over 110,000 code points. If you have a few hours to spare you can watch them all whiz past.

The first 128 Unicode code points are the same as ASCII. The range 128-255 contains currency symbols and other common signs and accented characters (aka characters with diacritical marks), and much of it is borrowed ISO-8859-1. After 256 there are many more accented characters. After 880 it gets into Greek letters, then Cyrillic, Hebrew, Arabic, Indic scripts, and Thai. Chinese, Japanese and Korean start from 11904 with many others in between.

This is great – no more ambiguity – each letter is represented by its own unique number. Cyrillic ? is always 1071 and Greek a is always 945. 224 is always à, and H is still 72. Note that these Unicode code points are officially written in hexadecimal preceded by U+. So the Unicode code point H is usually written as U+0048 rather than 72 (to convert from hexadecimal to decimal: 4*16+8=72).

The major problem is that there are more than 256 of them. The characters will no longer fit into 8 bits. However Unicode is not a character set or code page. So officially that is not the Unicode Consortium’s problem. They just came up with the idea and left someone else to sort out the implementation. That will be discussed in the next two sections.

Unicode Inside The Browser

Unicode does not fit into 8 bits, not even into 16. Although only 110,116 code points are in use, it has the capability to define up to 1,114,112 of them, which would require 21 bits.

However, computers have advanced since the 1970s. An 8 bit microprocessor is a bit out of date. New computers now have 64 bit processors, so why can’t we move beyond an 8 bit character and into a 32 bit or 64 bit character?

The first answer is: we can!

A lot of software is written in C or C++, which supports a “wide character”. This is a 32 bit character called wchar_t. It is an extension of C’s 8 bit char type. Internally, modern Web browsers use these wide characters (or something similar) and can theoretically quite happily deal with over 4 billion distinct characters. This is plenty for Unicode. So – internally, modern Web browers use Unicode.

Trying It Yourself

The Javascript code below is similar to the ASCII code above, except it goes up to a much higher number. For each number, it tells the browser to display the corresponding Unicode code point:

<html>
<body>
<style type="text/css">p {float: left; padding: 0 15px; margin: 0; font-size: 80%;}</style>
<script type="text/javascript">
for (var i=0; i<2096; i++)
  document.writeln ((i%256?'':'<p>') + i + ': ' + String.fromCharCode (i) + '<br>');
</script>
</body>
</html>

It will output a table like this:

A selection of Unicode code points viewed in Firefox
A selection of Unicode code points viewed in Firefox

The screenshot above only shows a subset of the first few thousand code points output by the Javascript. The selection includes some Cyrillic and Arabic characters, displayed right-to-left.

The important point here is that Javascript runs completely in the Web browser where 32 bit characters are perfectly acceptable. The Javascript function String.fromCharCode(1071) outputs the Unicode code point 1071 which is the letter ?.

Similarly if you put the HTML entity &#1071; into an HTML page, a modern Web browser would display ?. Numerical HTML entities also refer to Unicode.

On the other hand, the PHP function chr(1071) would output a forward slash / because the chr function only deals with 8 bit numbers up to 256 and repeats itself after that, and 1071%256=47 which has been a / since the 1960s.

UTF-8 To The Rescue

So if browsers can deal with Unicode in 32 bit characters, where is the problem? The problem is in the sending and receiving, and reading and writing of characters.

The problem remains because:

  1. A lot of existing software and protocols send/receive and read/write 8 bit characters
  2. Using 32 bits to send/store English text would quadruple the amount of bandwidth/space required

Although browsers can deal with Unicode internally, you still have to get the data from the Web server to the Web browser and back again, and you need to save it in a file or database somewhere. So you still need a way to make 110,000 Unicode code points fit into just 8 bits.

There have been several attempts to solve this problem such as UCS2 and UTF-16. But the winner in recent years is UTF-8, which stands for Universal Character Set Transformation Format 8 bit.

UTF-8 is a clever. It works a bit like the Shift key on your keyboard. Normally when you press the H on your keyboard a lower case “h” appears on the screen. But if you press Shift first, a capital H will appear.

UTF-8 treats numbers 0-127 as ASCII, 192-247 as Shift keys, and 128-192 as the key to be shifted. For instance, characters 208 and 209 shift you into the Cyrillic range. 208 followed by 175 is character 1071, the Cyrillic ?. The exact calculation is (208%32)*64 + (175%64) = 1071. Characters 224-239 are like a double shift. 226 followed by 190 and then 128 is character 12160: ?. 240 and over is a triple shift.

UTF-8 is therefore a multi-byte variable-width encoding. Multi-byte because a single character like ? takes more than one byte to specify it. Variable-width because some characters like H take only 1 byte and some up to 4.

Best of all it is backward compatible with ASCII. Unlike some of the other proposed solutions, any document written only in ASCII, using only characters 0-127, is perfectly valid UTF-8 as well – which saves bandwidth and hassle.

Trying It Yourself

This is a different experiment. PHP embeds the 6 numbers mentioned above into an HTML page: 72, 208, 175, 226, 190, 128. The browser interprets those numbers as UTF-8, and internally converts them into Unicode code points. Then Javascript outputs the Unicode values. Try changing the character set from UTF-8 to ISO-8859-1 and see what happens:

<html>
<head>
<meta charset="UTF-8">
</head>
<body>
<p>Characters embedded in the page:<br>
<span id="chars"><?php echo chr(72).chr(208).chr(175).chr(226).chr(190).chr(128); ?></span>
<p>Character values according to Javascript:<br>
<script type="text/javascript">
function ShowCharacters (s) {var r=''; for (var i=0; i<s.length; i++)
  r += s.charCodeAt (i) + ': ' + s.substr (i, 1) + '<br>'; return r;}
document.writeln (ShowCharacters (document.getElementById('chars').innerHTML));
</script>
</body>
</html>

If you are in a hurry, this is what it will look like:

A sequence of numbers shown using the UTF-8 character set
The sequence of numbers above shown using the UTF-8 character set

Same sequence of numbers shown using the ISO-8859-1 character set
Same sequence of numbers shown using the ISO-8859-1 character set

If you display the page  using the UTF-8 character set, you will see only 3 characters: H??. If you display it using the character set ISO-8859-1, you will see six separate characters: H�¯â¾€ . This is what is happening:

  1. On your Web server, PHP is embedding the numbers 72, 208, 175, 226, 190 and 128 into a Web page
  2. The Web page whizzes across the Internet from the Web server to your Web browser
  3. The browser receives those numbers and interprets them according to the character set
  4. The browser internally represents the characters using their Unicode values
  5. Javascript outputs the corresponding Unicode values

Notice that when viewed as ISO-8859-1 the first 5 numbers are the same (72, 208, 175, 226, 190) as their Unicode code points. This is because Unicode borrowed heavily from ISO-8859-1 in that range. The last number however, the euro symbol €, is different. It is at position 128 in ISO-8859-1 and has the Unicode value 8364.

Summary Circa 2003

UTF-8 is becoming the most popular international character set on the Internet, superseding the older single-byte character sets like ISO-8859-5. When you view or send a non-English document, you still need to know what character set it uses. For widest interoperability, website administrators need to make sure all their web pages use the UTF-8 character sets.

Perhaps the Ã� looks familiar – it will sometimes show up if you try to view Russian UTF-8 documents. The next section describes how character sets get confused and end up storing things wrongly in a database.

Lots Of Problems

As long as everybody is speaking UTF-8, this should all work swimmingly. If they aren’t, then characters can get mangled. To explain way, imagine a typical interaction a website, such as a user making a comment on a blog post:

  1. A Web page displays a comment form
  2. The user types a comment and submits.
  3. The comment is sent back to the server and saved in a database.
  4. The comment is later retrieved from the database and displayed on a Web page

This simple process can go wrong in lots of ways and produce the following types of problems:

HTML Entities

Pretend for a moment that you don’t know anything about character sets – erase the last 30 minutes from your memory. The form on your blog will probably display itself using the character set ISO-8859-1. This character set doesn’t know any Russian or Thai or Chinese, and only a little bit of Greek. If you attempt to copy and paste any into the form and press Submit, a modern browser will try to convert it into HTML numerical entities like &#1071; for ?.

That’s what will get saved in your database, and that’s what will be output when the comment is displayed – which means it will display fine on a Web page, but cause problems when you try to output it to a PDF or email, or run text searches for it in a database.

Confused Characters

How about if you operate a Russian website, and you have not specified a character set in your Web page? Imagine a Russian user whose default character set is ISO-8859-5. To say “hi”, they might type ??????. When the user presses Submit, the characters are encoded according to the character set of the sending page. In this case, ?????? is encoded as the numbers 191, 224, 216, 210, 213 and 226. Those numbers will get sent across the Internet to the server, and saved like that into a database.

If somebody later views that comment using ISO-8859-5, they will see the correct text. But if they view using a different Russian character set like Windows-1251, they will see ??????. It’s still Russian, but makes no sense.

Accented Characters with Lots of Vowels

If someone views the same comment using ISO-8859-1, they will see ¿àØÒÕâ instead of ??????. A longer phrase like ? ???? ???? ??? ?????? (“nice to see you” in a formal way to a female), submitted as ISO-8859-5, will show up in ISO-8859-1 as Ã� âÞÖÕ àÃ�ÔÃ�. It looks like that because the 128-255 range of ISO-8859-1 contains lots of vowels with accents.

So if you see this sort of pattern, it’s probably because text has been entered in a single byte character set (one of the ISO-8859s or Windows ones) and is being displayed as ISO-8859-1. To fix the text, you’ll need to figure out which character set it was entered as, and resubmit it as UTF-8 instead.

Alternating Accented Characters

What if the user submitted the comment in UTF-8? In that case the Cyrillic characters which make up the word ?????? would each get sent as 2 numbers each: 208/159, 209/128, 208/184, 208/178, 208/181 and 209/130. If you viewed that in ISO-8859-1 it would look like: �ŸÑ€�¸�²�µÑ‚.

Notice that every other character is a � or Ñ. Those characters are numbers 208 and 209, and they tell UTF-8 to switch to the Cyrillic range. So if you see a lot of � and Ñ, you can assume that you are looking at Russian text entered in UTF-8, viewed as ISO-8859-1. Similarly, Greek will have lots of Î and �, 206 and 207. And Hebrew has alternating ×, number 215.

Vowels Before a Pound and Copyright Sign

A very common issue in the UK is the currency symbol £ getting converted into £. This is exactly the same issue as above with a coincidence thrown in to add confusion. The £ symbol has the Unicode and ISO-8859-1 value of 163. Recall that in UTF-8 any character over 127 is represented by a sequence of two or more numbers. In this case, the UTF-8 sequence is 194/163. Mathematically, this is because (194%32)*64 + (163%64) = 163.

Visually it means that the if you view the UTF-8 sequence using ISO-8859-1, it appears to gain a  which is character 194 in ISO-8859-1. The same thing happens for all Unicode code points 161-191, which includes © and ® and ¥.

So if your £ or © suddenly inherit a Â, it is because they were entered as UTF-8.

Black Diamond Question Marks

How about the other way around? If you enter ?????? as ISO-8859-5, it will get saved as the numbers shown above: 191, 224, etc. If you then try to view this as UTF-8, you may well see lots of question marks inside black diamonds: ?. The browser displays these when it can’t make sense of the numbers it is reading.

UTF-8 is self-synchronzising. Unlike other multi-byte character encodings, you always know where you are with UTF-8. If you see a number 192-247, you know you are at the beginning of a multi-byte sequence. If you see 128-191 you know you are in the middle of one. There’s no danger of missing the first number and garbling the rest of the text.

This means that in UTF-8, the sequence 191 followed by 224 will never occur naturally, so the browser doesn’t know what to do with it and displays ?? instead.

This can also cause £ and © related problems. £50 in ISO-8859-1 is the numbers 163, 53 and 48. The 53 and 48 cause no issues, but in UTF-8, 163 can never occur by itself, so this will show up as ?50. Similarly if you see ?2012, it is probably because ©2012 was input as ISO-8859-1 but is being displayed as UTF-8.

Blanks, Question Marks and Boxes

Even if they are fully up-to-speed with UTF-8 and Unicode, a browser still may not know how to display a character. The first few ASCII characters 1-31 are mostly control sequences for teleprinters (things like Acknowledge and Stop). If you try to display them, a browser might show a ? or a blank or a box with tiny numbers inside it.

Also, Unicode defines over 110,000 characters. Your browser may not have the correct font to display all of them. Some of the more obscure characters may also get shown as ? or blank or a small box. In older browsers, even fairly common non-English characters may show as boxes.

Older browsers may also behave differently for some of the issues above, showing ? and blank boxes more often.

Databases

The discussion above has avoided the middle step in the process – saving data to a database. Databases like MySQL can also specify a character set for a database, table or column. But it is less important that the Web pages’ character set.

When saving and retrieving data, MySQL deals just with numbers. If you tell it to save number 163, it will. If you give it 208/159 it will save those two numbers. And when you retrieve the data, you’ll get the same two numbers back.

The character set becomes more important when you use database functions to compare, convert and measure the data. For example, the LENGTH  of a field may depend on its character set, as do string comparisons using LIKE and =. The method used to compare strings is called a collation.

Character sets and collations in MySQL are an in-depth subject. It’s not simply a case of changing the character set of a table to UTF-8. There are further SQL commands to take into account to make sure the data goes in and out in the right format as well. This blog is a good starting point.

Trying It Yourself

The following PHP and Javascript code allows you to experiment with all these issues. You can specify which character set is used to input and output text, and you can see what the browser thinks about it too.

<?php
$charset = $_POST['charset']; if (!$charset) $charset = 'ISO-8859-1';
$string = $_POST['string'];
if ($string) {
        echo '<p>This is what PHP thinks you entered:<br>';
        for ($i=0; $i<strlen($string); $i++) {$c=substr ($string,$i,1); echo ord ($c).': '.$c.' <br/>';}
}       
?>      
<html>
<head>
<meta charset="<?=$charset?>">
</head>
<body>
<form method="post">
<input name="lastcharset" type="hidden" value="<?php echo $charset?>"/>
Form was submitted as: <?php echo $_POST['lastcharset']?><br/>
Text is displayed as: <?php echo $charset?><br/>
Text will be submitted as: <?php echo $charset?><br/>
Copy and paste or type here:
<input name="string" type="text" size="20" value="<?php echo $string?>"/><br/>
Next page will display as:
<select name="charset"><option>ISO-8859-1<option>ISO-8859-5
<option>Windows-1251<option>ISO-8859-7<option>UTF-8</select><br/>
<input type="submit" value="Submit" onclick="ShowCharacters (this.form.string.value); return 1;"/>
</form>
<script type="text/javascript">
function ShowCharacters (s) {
  var r='You entered:';
  for (var i=0; i<s.length; i++) r += '\n' + s.charCodeAt (i) + ': ' + s.substr (i, 1);
  alert (r);
}
</script>
</body>
</html>

This is an example of the code in action. The numbers at the top are the numerical values of each of the characters and their representation (when viewed individually) in the current character set:

Example of inputting and output in different character sets
Example of inputting and output in different character sets. This shows a £ sign turning into a ? in Google Chrome.

The page above shows the previous, current and future character sets. You can use this code to quickly see how text can get really mangled. For example, if you pressed Submit again above, the ? has Unicode code point 65533 which is 239/191/189 in UTF-8 and will be displayed as �50  in ISO-8859-1. So if you ever get £ symbols turning into �, that is probably the route they took.

Note that the select box at the bottom will change back to ISO-8859-1 each time.

One Solution

All the encoding problems above are caused by text being submitted in one character set and viewed in another. The solution is to make sure that every page on your website uses UTF-8. You can do this with one of these lines immediately after the <head> tag:

<meta charset="UTF-8">
<meta http-equiv="Content-type" content="text/html; charset=UTF-8">

It has to be one of the first things in your Web page, as it will cause the browser to look again at the page in a whole new light. For speed and efficiency, it should do this as soon as possible.

You can also specify UTF-8 in your MySQL tables, though to fully use this feature, you’ll need to delve deeper.

Note that users can still override the character set in their browsers. This is rare, but does mean that this solution is not guaranteed to work. For extra safety, you could implement a back-end check to ensure data is arriving in the correct format.

Existing Websites

If your website has already been collecting text in a variety of languages, then you will also need to convert your existing data into UTF-8. If there is not much of it, you can use a PHP page like the one above to figure out the original character set, and use the browser to convert the data into UTF-8.

If you have lots of data in various character sets, you’ll need to first detect the character set and then convert it. In PHP you can use mb_detect_encoding to detect and iconv to convert. Reading the comments for  mb_detect_encoding, it looks like quite a fussy function, so be sure to experiment to make sure you are using it properly and getting the right results.

A potentially misleading function is utf8_decode. It turns UTF-8 into ISO-8859-1. Any characters not available in ISO-8859-1 (like Cyrillic, Greek, Thai, etc) are turned into question marks. It’s misleading because you might have expected more from it, but it does the best it can.

Summary

This article has relied heavily on numbers and has tried to leave no stone unturned. Hopefully it has provided an exhaustive understanding of character sets, Unicode, UTF-8 and the various problems that can arise. The morals of the story are:

  • You need to know the character set in order to make sense of non-Latin text
  • Internally, browsers use Unicode to represent characters
  • Make sure all your Web pages specify the UTF-8 character set

For a slightly different approach to this subject, this 2003 character set article is excellent. Thank you for sticking with this epic journey.

(il)

Image credits (front page): nevsred.


© Paul Tero for Smashing Magazine, 2012.


Introduction To Linux Commands


  

At the heart of every modern Mac and Linux computer is the “terminal.� The terminal evolved from the text-based computer terminals of the 1960s and ’70s, which themselves replaced punch cards as the main way to interact with a computer. It’s also known as the command shell, or simply “shell.� Windows has one, too, but it’s called the “command prompt� and is descended from the MS-DOS of the 1980s.

Mac, Linux and Windows computers today are mainly controlled through user-friendly feature-rich graphical user interfaces (GUIs), with menus, scroll bars and drag-and-drop interfaces. But all of the basic stuff can still be accomplished by typing text commands into the terminal or command prompt.

Using Finder or Explorer to open a folder is akin to the cd command (for “change directory�). Viewing the contents of a folder is like ls (short for “list,� or dir in Microsoft’s command prompt). And there are hundreds more for moving files, editing files, launching applications, manipulating images, backing up and restoring stuff, and much more.

So, why would anyone want to bother with these text commands when you can use the mouse instead? The main reason is that they are very useful for controlling remote computers on which a GUI is not available, particularly Web servers, and especially Linux Web servers that have been stripped of all unnecessary graphical software.

Sometimes these lean Linux servers are managed through a Web browser interface, such as cPanel or Plesk, letting you create databases, email addresses and websites; but sometimes that is not enough. This article provides a broad introduction to text commands and the situations in which they are useful. We’ll cover the following:

  • Why knowing a few commands is useful;
  • Issuing commands on your own computer;
  • Using SSH to log into your Web server;
  • Getting your bearings: pwd, cs ls;
  • Viewing and moving files: cat, more, head, tail, mv, cp, rm;
  • Searching for files: find;
  • Looking through and editing files: grep, vi;
  • Backing up and restoring files and databases: tar, zip, unzip, mysqldump, mysql;
  • File permissions: chmod.

Why Knowing A Few Linux Commands Is Useful

As a website developer or server administrator, you would gain a big asset in becoming comfortable with these commands: for website emergencies, to configure a server and for your CV. It can also save you money. Many hosting companies offer fully managed servers but at a high monthly premium. Or else they charge by the hour for technical support.

Perhaps you need to archive some big files or make a change to the httpd.conf file or figure out why your website’s images have suddenly stopped loading. You might not want to pay $50 to your server’s administrator for a five-minute job. This article gives you the tools to make such changes yourself.

And why “Linux� commands? Two main types of servers are available today: Windows and UNIX. UNIX-based servers include Linux (which split off in 1991), Mac OS X (2002) and several more traditional UNIX systems, such as BSD, Solaris and HP-UX. Linux commands are basically UNIX commands and so will run on all of them. In fact, I use the term “Linux� here only because it is more common and less frightening than “UNIX.� Windows servers, on the other hand, have a much smaller market share and are more often controlled through GUIs, such as Remote Desktop and VNC, rather than the command line.

In fact, a November 2011 survey showed that Apache accounted for about 65% of all Web servers. Apache usually runs in the popular LAMP configuration: Linux, Apache, MySQL and PHP. Microsoft was a distant second, with 15%. Third place nginx runs on Linux, UNIX, Mac and Windows. So, the commands in this article will work on at least two thirds of all servers.

Issuing Commands To Your Own Computer

You can quickly experiment with text commands on your own computer. On Mac with OS X, go to Applications → Utilities, and run Terminal. On a PC with Windows, go to Start → All Programs → Accessories, and choose “Command Prompt.� On Ubuntu Linux, go to Applications → Accessories, and choose Terminal.

On Windows you should see this:

The Windows command prompt

This is the command line (i.e. shell, prompt or terminal) on your own computer. You can type dir on Windows or ls on Linux or Mac followed by “Enter� to see a list of the files in the current “directory� (i.e. folder, location or path).

All we will be doing for the rest of this article is opening up one of these terminals on a remote computer: your Web server.

You may have used VNC or Remote Desktop, which allow you to actually view the desktop on someone else’s computer: your screen shows their screen, your mouse controls their mouse, your keyboard mimics their keyboard.

The terminal is similar to this but without the fancy menus or scroll bars. If you were to plug a keyboard and screen into your Web server, sitting in a fireproof basement somewhere, you would probably see one of these terminals, waiting patiently for your user name and password.

Using SSH To Log Into Your Web Server

The application SSH, or Secure Shell, is used to log into Web servers. It often takes the same user name and password as FTP, but it has to be allowed by your host. If you have a dedicated Web server, it is probably already allowed. If you use cloud hosting, then you might need to request it first. If you are on shared hosting, you’ll definitely need to request it, and the administrator may refuse.

On Linux or Mac, open up Terminal as described above and type the following:

ssh -l username www.myserver.com

The -l stands for “log in as,� and your user name goes after it. If SSH is allowed, then it will ask for a password. If not, you’ll get an error message, like this one:

SSH Command and Connection Error

Running the ssh command and being denied access

On Windows, you will need to download some SSH software. Putty is a popular and easy choice. It downloads as a single EXE file, which you can save to your desktop and run right away. Type your website as the host name, check the SSH box under “Connection Type,� and click “Open.� It will ask for your user name and then your password.

Running Putty on Windows

Running Putty on Windows in order to SSH to your Web server

Once successfully logged in, you will usually see a welcome message. After that, you will be presented with a few letters and a $ sign (or a # sign if you have logged in as root). The letters often represent your user name and where you’ve come from, or the name of the server. A ~ indicates that you are in your home directory. The $ is the prompt; it indicates that you can start typing commands now, something like:

Successful SSH to a server

A successful SSH log-in to a Web server. The $ means we can start typing commands.

The next section introduces a few basic commands.

Getting Your Bearings

On Windows, when you go to “My Documents� from the Start menu, it opens your “My Documents� directory in Windows Explorer and shows the contents. If some nosy colleague walked by and asked “What directory are you in?� you could say “I’m in my documents.�

If you SSH’ed to a server as the user “admin,� you would land in admin’s home directory, probably /home/admin. You can verify this by typing the command pwd, which shows your current location (i.e. folder, directory or path).

The pwd and ls commands

The pwd command tells you where you are, cd changes the directory and ls shows the contents of a directory.

To change to another directory, use the cd command with the destination, like so:

cd /

This will change the directory to /, the top of the whole UNIX directory structure. The command ls lists the contents of the current directory, in this case /.

In the screenshot above, the terminal is color-coded. Dark-blue entries are subdirectories, and black entries are files. A lot of the interesting stuff on Web servers happens in the /etc, /home and /var directories. Using just cd and ls, you can explore your server and find out where stuff is.

When using cd, you can specify the new directory absolutely (beginning with a slash, like /var/www) or relative to your current location (without the initial slash). You can also go up a directory with two dots. Practice with the sequence below, pressing “Enter� after each command. Can you guess what the last command will tell you?

cd /var
ls
cd www
ls
cd ..
pwd

Viewing And Moving Files

On many Linux servers, websites are located in /var/www/vhosts. You can check on your server by doing the following:

cd /var/www/vhosts
ls

If you see a list of websites, you can move into one of them. Within the website’s main directory, you will probably see the same files that you see when you FTP to the website, things such as httpdocs (where your website’s files are), httpsdocs (if you have a separate secure website), conf (configuration files), statistics (logs and compiled statistics), error_docs, private and more.

You can then change into your website’s public-facing directory, which is myserver.com/httpdocs in this example:

cd myserver.com
ls
cd httpdocs
ls

Now you have arrived, and you can run a new command, cat, which displays the contents of a file. For instance, if you have an index.html file, run:

cat index.html

If your index.html file is more than a few lines long, it will rush past in a blur. You can use the more command to show it slowly, one page at time. After you type the command below, it will show you the first page. You can press the space bar to show the next page, “Enter� to show the next line, and Q to quit.

more index.html

You can also show just the first few or last few lines of a file with the head and tail commands. It shows 10 lines by default, but you can pass in any number:

head index.html
tail -20 index.html

If you would like to rename this file, use the mv command, short for “move�:

mv index.html indexold.html

Similarly, the cp is the copy command, and rm removes files.

cp index.html indexold.html
rm indexold.html

Below is a string of commands in action. In order, it confirms the current directory with pwd, looks at the contents with ls, views index.html with cat, then renames it with mv, and finally removes it with rm, with a lot of ls in between to show the changes.

The cat and mv commands

The cat, mv and rm commands in action, for displaying, moving and then removing a file.

More Advanced Tip: Changing the Prompt

Note that in our initial examples, the full prompt included the current directory. For instance, in [admin@myserver /]$, the / indicated that the user was in the / directory. In the example directly above, it was removed, or else it would have crowded the screenshot by constantly saying [admin@myserver /var/www/vhosts/myserver.com/httpdocs]$.

You can change the prompt to whatever you want by setting the PS1 environment variable. Here are a couple of examples, the latter including the user, host and current directory:

PS1="[woohoooo ]$ "
PS1='[${USER}@${HOSTNAME} ${PWD}]$ '

Searching For Files

On big websites, files can get lost. Perhaps you vaguely remember uploading a new version of your client’s logo about four months ago, but it has since fallen out of favor and been replaced. Now, out of the blue, the client wants it back. You could download everything from the server using FTP and search the files using Finder or Explorer. Or you could log in and search using the command line.

The find command can search through files by name, size and modified time. If you just give it a directory, it will list everything that the directory contains. Try this:

find /var/www

You will probably see lots and lots of file names whizzing past. If you have many websites, it could continue for a couple of minutes. You can stop it by hitting Control + C (i.e. holding down the Control key on your keyboard and pressing the letter C). That’s the way to interrupt a Linux command. A more useful command would be:

find /var/www | more

The pipe symbol (|) takes the output of one command (in this case, the long list of files produced by find) and passes it to another command (in this case, more, which shows you one page of files at a time). As above, press the space bar to show the next page, and Q to quit.

To search for a specific file name, add -name and the file name. You can use \* as a wild card (the backslash is not always necessary but is good practice with the find command). You can combine searches using -o (for “or�). If you leave out the -o, it becomes an “and.�

find /var/www -name logo.gif
find /var/www -name \*.gif
find /var/www -name \*.gif -o -name \*.jpg

You can also search by size by adding -size. So, you could look for all GIFs between 5 and 10 KB:

find /var/www -name \*.gif -size +5k -size -10k

Similarly, to find a file that was last changed between 90 and 180 days ago, you can use -ctime:

find /var/www -name \*.gif -ctime +90 -ctime -180

In both of these cases, you will probably also want to know the actual file size and date last changed. For this, you can add -printf, which is similar to the C function printf in that you use the % sign to output various information. This command outputs the file size (up to 15 characters wide), the date and time changed (down to the nanosecond) and the file name:

find /var/www -name \*.gif -size +5k -size -10k -ctime +90 -ctime -180 -printf "%10s  %c  %p\n"

With that whopper, you have hopefully found the missing file. Here is an example:

Variations on the find command

Searching for all GIFs within a single website, and displaying the file sizes, changed times and file names.

Another useful parameter is -cmin, which lets you see files that have changed in the last few minutes. So, if something goes wrong on a website, you can run this to see everything that has changed in the last 10 minutes:

find /var/www -cmin -10 -printf "%c %p\n"

This will show files and directories that have changed. Thus, it won’t show files that have been removed (because they are no longer there), but it will show the directories that they were removed from. To show only files, add -type f:

find /var/www -cmin -10 -type f -printf "%c %p\n"

More Advanced Tip: Reading the Manual

I didn’t have to remember all of the variations above. I consulted the manual several times, like so:

man find

While reading a manual page, the controls are the same as more: space bar for paging, “Enter� to go forward one line and Q to quit. The up and down arrows also work. You can search within a page of the manual by typing / and a keyword, such as /printf. This will jump you to the next occurrence of that term. You can search backwards with ?printf, and you can repeat the search by pressing N.

Looking Through And Editing Files

Most visual code editors allow you to search through many files when you’re looking for a particular variable or bit of HTML. You can also do this directly on the server using the command grep. This is useful when something goes wrong on a complex website with hundreds of files and you have to find the error and fix it fast.

Let’s say you view the HTML source and see that the error happens right after <div id="left">. You can let grep do the searching for you. Give it the thing to be searched for and the files to search in. These commands change to the website directory and grep through all files ending in php. You need to put quotes around the HTML because it contains spaces, and the inner quotes have to be escaped with backslashes:

cd /var/www/vhosts/myserver.com/httpdocs/
grep "<div id=\"left\">" *.php

This will tell you which files in the current directory contain that bit of HTML. If you want to search in subdirectories, you can use the -r option with a directory at the end, instead of a list of files. The single dot tells it to start in the current directory.

grep -r "<div id=\"left\">" .

Alternatively, you could use the find command from above to tell it which files to look in. To put a command within a command, enclose it in back apostrophes. The following searches only for the HTML in PHP files modified in the last 14 days:

grep "<div id=\"left\">" `find . -name \*.php -ctime -14`

You can also add -n to show the line numbers, as in this example:

Using grep to look for things inside files

Searching for a bit of HTML within the PHP files in the current directory

And how do you quickly fix an error when you find it? To do that, you will need to start up a Linux text editor. Different editors are available, such as pico and emacs, but the one that is guaranteed to be there is vi. To edit a file, type vi and the file name:

vi index.php

vi is a complex editor. It can do most of the amazing things that a fully featured visual editor can do, but without the mouse. In brief, you can use the arrow keys to get around the file (or H, J, K and L on very basic terminals where even the arrow keys don’t work). To delete a character, press X. To delete a whole line, press DD. To insert a new character, press I. This takes you into “insert mode,� and you can start typing. Press the Escape key when finished to go back to “command mode.� Within command mode, type :w to save (i.e. write) the file and :q to quit, or :wq to do both at the same time.

The vi editor also supports copying and pasting, undoing and redoing, searching and replacing, opening multiple files and copying between them, etc. To find out how, look for a good vi tutorial (such as “Mastering the VI Editor�). Note also that on many computers, vi is just a shortcut to vim, which stands for “vi improved,� so you can follow vim tutorials, too.

The Linux editor vi

Editing files with the vi text editor

More Advanced Tip: Tab Completion

When changing directories and editing files, you might get tired of having to type the file names in full over and over again. The Terminal loses some of its shine this way. This can be avoided with command-line completion, performed using tabs.

It works like this: start typing the name of a file or a command, and then press Tab. If there is only one possibility, Linux will fill in as much as it can. If nothing happens, it means there is more than one possibility. Press Tab again to show all of the possibilities.

For example, if above I had typed…

vi i

… And then pressed Tab, it would have filled in the rest for me…

vi index.php

… Unless several files started with I. In that case, I would have had to press Tab again to see the options.

Backing Up And Restoring Files And Databases

Some Linux servers do support the zip command, but all of them support tar, whose original purpose was to archive data to magnetic tapes. To back up a directory, specify the backup file name and the directory to back up, such as:

cd /var/www/vhosts/myserver.com/httpdocs/
tar czf /tmp/backup.tgz .

The czf means “create zipped file.� The single dot stands for the current directory. You can also back up individual files. To back up just things changed in the last day, add the find command:

tar cfz /tmp/backup.tgz `find . -type f -ctime -1`

Both of these commands put the actual backup file in the temporary /tmp directory — if the backup file is in the same directory that you are backing up, it will cause an error. You can move the file to where you need it afterwards. To see what is in an archive, use the tzf options instead:

tar tfz /tmp/backup.tgz

Linux tar command

Creating and showing the contents of a backup file using tar

To restore things, use xzf, for “extract from zipped file.� First, run a listing as above to check what’s in there, and then restore one or more of the files. The second command restores all of the files from the archive into the current directory:

tar xfz /tmp/backup.tgz ./index.php ./test.php
tar xfz /tmp/backup.tgz

If your server has the zip command, then run these commands to do the same thing:

cd /var/www/vhosts/myserver.com/httpdocs/
zip -r /tmp/backup.zip .
zip -r /tmp/backup.zip `find . -type f -ctime -1`
unzip -l /tmp/backup.zip
unzip /tmp/backup.zip test.php
unzip /tmp/backup.zip

If your Web server uses MySQL, then you might want to regularly back up your data. For this, there is the mysqldump command. The format of the command is:

mysqldump --user="username" --password="password" --add-drop-table database

Replace the user name, password and database with your values. Instead of specifying a database, you can use -A to dump all databases. If you get errors about table locking, you can add --single-transaction. Once you submit the user name and password, this will output a load of SQL in a long blur. To save the output to a file, you will need to use the > symbol. This sends the output of a command to a file.

mysqldump --user="username" --password="password" --add-drop-table database > /tmp/db.sql

To restore a database backup, you can use the mysql command. This command lets you run SQL statements from the command line. For example, the following command gets you into the database:

mysql --user="username" --password="password" dbname

At the mysql> prompt, you can type an SQL statement such as:

mysql> SHOW TABLES;
mysql> SELECT * FROM customers;

For restoring, you’ll need to use the pipe (|), which will send the output from one command into another. In this case, cat will output the database backup file and send it into the mysql command:

cat /tmp/db.sql | mysql --user="username" --password="password" dbname

If people are looking over your shoulder while you’re doing this, you might not want to type the password directly into the command. In this case, just leave it out, and mysql or mysqldump will ask for it instead.

cat /tmp/db.sql | mysql --user="username" --password dbname

Once you’ve created the database backup file, you can include it in the backups we did above:

tar czf /tmp/backup.tgz . /tmp/db.sql

More Advanced Tip: Hidden Files and Wildcards

Many websites use a file called .htaccess to implement URL rewriting and password protection. In UNIX, all files starting with a single dot are hidden. They won’t show up when you do ls, and they won’t get backed up if you do this:

tar czf /tmp/backup.tgz *

The * is a wildcard. Before the command executes, the * is replaced with all non-hidden files in the current directory. To include hidden files as well, it’s better to back up the whole directory as above using a single dot:

tar czf /tmp/backup.tgz .

To show hidden files when doing a directory listing, add -a to the command:

ls -a
ls -la

File Permissions

If you use FTP regularly to upload files to websites, then you might be familiar with permissions. All files and directories on Linux (and Mac, Windows and other UNIX systems) have an owner, a group and a set of flags specifying who can read, write and execute them.

The list of user names (and, thus, potential file owners) on a UNIX system is stored in the file /etc/passwd. You can try:

more /etc/passwd

The Apache Web server is started by a command when the Web server boots up. But the user who starts Apache is often a restricted and unprivileged user, such as nobody or apache or www-data. This is for security reasons, to prevent someone from hacking into the website and then gaining control of the whole server. You can find out who that user is by running the command below and looking in the first column. The ps aux command shows all of the processes running on the server, and grep shows only processes that contain the word “apache.�

ps aux | grep apache

This can cause conflicts, though. If you upload a file to a website via FTP and log in as admin, then the file will be owned by admin. If Apache was started by the user named nobody, then Apache might not be able to read that file and won’t be able to send it to any users who request it when viewing the website. Instead, users will see a broken image or a message such as “403 Forbidden. You don’t have permission to access that file.�

A subtler and more common problem is when an image can be viewed but not overwritten or removed via the website’s content management system (CMS). In this case, the user nobody can read the file but can’t write to it.

You can view permissions using the ls command with an extra -l, like so:

ls -l

ls command with long list format

The command ls -l shows information about permissions, owners, size and date.

This directory contains three files, with three subdirectories shown in green. The first letter on each line indicates the type: d for directory and - for normal file. The next nine letters are the permissions; they indicate the read, write and execute permissions for the owner, group and everyone else. After the number (which represents the size) is the owner and group for the file. These files are all owned by admin. This is followed by the file size (less useful for directories) and the date and time of the last modification.

Below is another example of three files in an images subdirectory. Two of the files were uploaded by admin via FTP, and Apache was started by the user www-data. One of the files will be unviewable through a Web browser. Which do you think it is?

Bad permissions

The answer is bg.jpg. Both bg.jpg and logo2.gif have the same permissions: only the owner can read and write them. The logo2.gif file is OK because the owner is www-data, so that file can be accessed, read and returned by Apache. The logo.gif file is also OK because it has r in all three permissions (i.e. owner, group and everyone else). But bg.jpg will fail because only the user admin can read it, not the user who started Apache. If you were to access that file in a Web browser, you would see something like this:

What happens when you try to access a file without the correct permissions in a browser.

These sorts of errors can be resolved with the chmod command, which changes file permissions. The three sets of permissions are represented in commands with u (“user� or owner), g (“group�), o (“other� or everyone else) or a (“all�). So, to enable all users to read bg.jpg, either of these would work:

chmod go+r images/bg.jpg
chmod a+r images/bg.jpg

If this file were also part of a CMS, then you’d have to also add write permissions before the CMS could overwrite it:

chmod a+rw images/bg.jpg

You can also make these changes to all files in all of the subdirectories by adding -R. This recursive operation is not supported by some FTP programs and so is a useful command-line tool:

chmod -R a+rw images/

Directories also need the x (“execute� permission), but files generally don’t (unless they are in a cgi-bin). So, you can give everything rwx (read, write and execute) permissions and then take away the x from the files:

chmod -R a+rwx images/
chmod -R a-x `find images/ -type f`

However, this does leave everything rather open, making it easier for hackers to gain a foothold. Ideally, your set of permissions should be as restrictive as possible. Files should be writable by the Apache user only when needed by the CMS.

More Advanced Tip: Chown and the Superuser

Another useful permissions command is chown. It changes the owner of a file. However, you have to be logged in as a user with sufficient privileges (such as root) in order to run it. To make www-data the owner of bg.jpg, run this:

chown www-data images/bg.jpg

This will probably return “Permission denied.� You have to run the command as the superuser. For this, you will need to find the root password for your server, and then run the following:

sudo chown www-data images/bg.jpg

You will definitely need to be the superuser if you want to edit configuration files, such as Apache’s:

sudo vi /etc/httpd/conf/httpd.conf

If you want to become the superuser for every command, run this:

su

This is dangerous, though, because you could easily accidentally remove things — especially if you are using the rm command, and particularly if you’re using it in recursive mode (rm -r), and most especially if you also force the changes and ignore any warnings (rm -r -f).

Conclusion

This article has introduced some very useful Linux commands, a potential asset for any aspiring Web worker and a surefire way to impress a dinner date.

For a few more commands related specifically to website crashes, check out the Smashing Magazine article “What to Do When Your Website Goes Down.� For a broader view, try this list of Linux commands. And the “Mastering the VI Editor� tutorial mentioned above explains vi well.

Hopefully, you now have the tools and confidence to pitch in the next time one of your websites has a problem.

(al)


© Paul Tero for Smashing Magazine, 2012.


Introduction To URL Rewriting





 



 


Many Web companies spend hours and hours agonizing over the best domain names for their clients. They try to find a domain name that is relevant and appropriate, sounds professional yet is distinctive, is easy to spell and remember and read over the phone, looks good on business cards and is available as a dot-com.

Or else they spend thousands of dollars to purchase the one they really want, which just happened to be registered by a forward-thinking and hard-to-find squatter in 1998.

They go through all that trouble with the domain name but neglect the rest of the URL, the element after the domain name. It, too, should be relevant, appropriate, professional, memorable, easy to spell and readable. And for the same reasons: to attract customers and improve in search ranking.

Fortunately, there is a technique called URL rewriting that can turn unsightly URLs into nice ones — with a lot less agony and expense than picking a good domain name. It enables you to fill out your URLs with friendly, readable keywords without affecting the underlying structure of your pages.

This article covers the following:

  1. What is URL rewriting?
  2. How can URL rewriting help your search rankings?
  3. Examples of URL rewriting, including regular expressions, flags and conditionals;
  4. URL rewriting in the wild, such as on Wikipedia, WordPress and shopping websites;
  5. Creating friendly URLs;
  6. Changing pages names and URLs;
  7. Checklist and troubleshooting.

What Is URL Rewriting?

If you were writing a letter to your bank, you would probably open your word processor and create a file named something like lettertobank.doc. The file might sit in your Documents directory, with a full path like C:\Windows\users\julie\Documents\lettertobank.doc. One file path = one document.

Similarly, if you were creating a banking website, you might create a page named page1.html, upload it, and then point your browser to http://www.mybanksite.com/page1.html. One URL = one resource. In this case, the resource is a physical Web page, but it could be a page or product drawn from a CMS.

URL rewriting changes all that. It allows you to completely separate the URL from the resource. With URL rewriting, you could have http://www.mybanksite.com/aboutus.html taking the user to …/page1.html or to …/about-us/ or to …/about-this-website-and-me/ or to …/youll-never-find-out-about-me-hahaha-Xy2834/. Or to all of these. It’s a bit like shortcuts or symbolic links on your hard drive. One URL = one way to find a resource.

With URL rewriting, the URL and the resource that it leads to can be completely independent of each other. In practice, they’re usually not wholly independent: the URL usually contains some code or number or name that enables the CMS to look up the resource. But in theory, this is what URL rewriting provides: a complete separation.

How Does URL Rewriting Help?

Can you guess what this Web page sells?

http://www.diy.com/diy/jsp/bq/nav.jsp?action=detail&fh_secondid=11577676

B&Q went to all the trouble and expense of acquiring diy.com and implementing a stock controlled e-commerce website, but left its URLs indecipherable. If you guessed “brown guttering,� you might want to considering playing the lottery.

Even when you search directly for this “miniflow gutter brown� on Google UK, B&Q’s page comes up only seventh in the organic search results, below much smaller companies, such as a building supplier with a single outlet in Stirlingshire. B&Q has 300+ branches and so is probably much bigger in budget, size and exposure, so why is it not doing as well for this search term? Perhaps because the other search results have URLs like http://www.prof…co.uk/products/brown-miniflo-gutter-148/; that is, the URL itself contains the words in the search term.

screenshot

Almost all of these results on Google have the search term in their URLs (highlighted in green). The one at the bottom does not.

Looking at the URL from B&Q, you would (probably correctly) assume that a file named nav.jsp within the directory /diy/jsp/bq/ is used to display products when given their ID number, 11577676 in this case. That is the resource intimately tied to this URL.

So, how would B&Q go about turning this into something more recognizable, like http://www.diy.com/products/miniflow-gutter-brown/11577676, without restructuring its whole website? The answer is URL rewriting.

Another way to look at URL rewriting is like a thin layer that sits on top of a website, translating human- and search-engine-friendly URLs into actual URLs. Doing it is easy because it requires hardly any changes to the website’s underlying structure — no moving files around or renaming things.

URL rewriting basically tells the Web server that
/products/miniflow-gutter-brown/11577676 should show the Web page at: /diy/jsp/bq/nav.jsp?action=detail&fh_secondid=11577676,
without the customer or search engine knowing about it.

Many factors (or “signals�), of course, determine the search ranking for a particular term, over 200 of them according to Google. But friendly and readable URLs are consistently ranked as one of the most important of those factors. They also help humans to quickly figure out what a page is about.

The next section describes how this is done.

How To Rewrite URLs

Whether you can implement URL rewriting on a website depends on the Web server. Apache usually comes with the URL rewriting module, mod_rewrite, already installed. The set-up is very common and is the basis for all of the examples in this article. ISAPI Rewrite is a similar module for Windows IIS but requires payment (about $100 US) and installation.

The Simplest Case

The simplest case of URL rewriting is to rename a single static Web page, and this is far easier than the B&Q example above. To use Apache’s URL rewriting function, you will need to create or edit the .htaccess file in your website’s document root (or, less commonly, in a subdirectory).

For instance, if you have a Web page about horses named Xu8JuefAtua.htm, you could add these lines to .htaccess:

RewriteEngine On
RewriteRule   horses.htm   Xu8JuefAtua.htm

Now, if you visit http://www.mywebsite.com/horses.htm, you’ll actually be shown the Web page Xu8JuefAtua.htm. Furthermore, your browser will remain at horses.htm, so visitors and search engines will never know that you originally gave the page such a cryptic name.

Introducing Regular Expressions

In URL rewriting, you need only match the path of the URL, not including the domain name or the first slash. The rule above essentially tells Apache that if the path contains horses.htm, then show the Web page Xu8JuefAtua.htm. This is slightly problematic, because you could also visit http://www.mywebsite.com/reallyfasthorses.html, and it would still work. So, what we really need is this:

RewriteEngine On
RewriteRule   ^horses.htm$   Xu8JuefAtua.htm

The ^horses.htm$ is not just a search string, but a regular expression, in which special characters — such as ^ . + * ? ^ ( ) [ ] { } and $ — have extra significance. The ^ matches the beginning of the URL’s path, and the $ matches the end. This says that the path must begin and end with horses.htm. So, only horses.htm will work, and not reallyfasthorses.htm or horses.html. This is important for search engines like Google, which can penalize what it views as duplicate content — identical pages that can be reached via multiple URLs.

Without File Endings

You can make this even better by ditching the file ending altogether, so that you can visit either http://www.mywebsite.com/horses or http://www.mywebsite.com/horses/:

RewriteEngine On
RewriteRule   ^horses/?$   Xu8JuefAtua.html  [NC]

The ? indicates that the preceding character is optional. So, in this case, the URL would work with or without the slash at the end. These would not be considered duplicate URLs by a search engine, but would help prevent confusion if people (or link checkers) accidentally added a slash. The stuff in brackets at the end of the rule gives Apache some further pointers. [NC] is a flag that means that the rule is case insensitive, so http://www.mywebsite.com/HoRsEs would also work.

Wikipedia Example

We can now look at a real-world example. Wikipedia appears to use URL rewriting, passing the title of the page to a PHP file. For instance…

http://en.wikipedia.org/wiki/Barack_obama

… is rewritten to:

http://en.wikipedia.org/w/index.php?title=Barack_obama

This could well be implemented with an .htaccess file, like so:

RewriteEngine On
#Look for the word "wiki" followed by a slash, and then the article title
RewriteRule   ^wiki/(.+)$   w/index.php?title=$1   [L]

The previous rule had /?, which meant zero or one slashes. If it had said /+, it would have meant one or more slashes, so even http://www.mywebsite.com/horses//// would have worked. In this rule, the dot (.) matches any character, so .+ matches one or more of any character — that is, essentially anything. And the parentheses — ( ) — ask Apache to remember what the .+ is. The rule above, then, tells Apache to look for wiki/ followed by one or more of any character and to remember what it is. This is remembered and then rewritten as $1. So, when the rewriting is finished, wiki/Barack_obama becomes w/index.php?title=Barack_obama

Thus, the page w/index.php is called, passing Barack_obama as a parameter. The w/index.php is probably a PHP page that runs a database lookup — like SELECT * FROM articles WHERE title='Barack obama' — and then outputs the HTML.

screenshot

You can also view Wikipedia entries directly, without the URL rewriting.

Comments and Flags

The example above also introduced comments. Anything after a # is ignored by Apache, so it’s a good idea to explain your rewriting rules so that future generations can understand them. The [L] flag means that if this rule matches, Apache can stop now. Otherwise, Apache would continue applying subsequent rules, which is a powerful feature but unnecessary for all but the most complex rule sets.

Implementing the B&Q Example

The recommendation for B&Q above could be implemented with an .htaccess file, like so:

RewriteEngine On
#Look for the word "products" followed by slash, product title, slash, id number
RewriteRule  ^products/.*/([0-9]+)$   diy/jsp/bq/nav.jsp?action=detail&fh_secondid=$1 [NC,L]

Here, the .* matches zero or more of any character, so nothing or anything. And the [0-9] matches a single numerical digit, so [0-9]+ matches one or more numbers.

The next section covers a couple of more complex conditional examples. You can also read the Apache rewriting guide for much more information on all that URL rewriting has to offer.

Conditional Rewriting

URL rewriting can also include conditions and make use of environment variables. These two features make for an easy way to redirect requests from one domain alias to another. This is especially useful if a website changes its domain, from mywebsite.co.uk to mywebsite.com for example.

Domain Forwarding

Most domain registrars allow for domain forwarding, which redirects all requests from one domain to another domain, but which might send requests for www.mywebsite.co.uk/horses to the home page at www.mywebsite.com and not to www.mywebsite.com/horses. You can achieve this with URL rewriting instead:

RewriteEngine On
RewriteCond   %{HTTP_HOST}   !^www.mywebsite.com$         [NC]
RewriteRule   (.*)           http://www.mywebsite.com/$1  [L,R=301]

The second line in this example is a RewriteCond, rather than a RewriteRule. It is used to compare an Apache environment variable on the left (such as the host name in this case) with a regular expression on the right. Only if this condition is true will the rule on the next line be considered.

In this case, %{HTTP_HOST} represents www.mywebsite.co.uk, the host (i.e. domain) that the browser is trying to visit. The ! means “not.� This tells Apache, if the host does not begin and end with www.mywebsite.com, then remember and rewrite zero or more of any character to www.mywebsite.com/$1. This converts www.mywebsite.co.uk/anything-at-all to www.mywebsite.com/anything-at-all. And it will work for all other aliases as well, like www.mywebsite.biz/anything-at-all and mywebsite.com/anything-at-all.

The flag [R=301] is very important. It tells Apache to do a 301 (i.e. permanent) redirect. Apache will send the new URL back to the browser or search engine, and the browser or search engine will have to request it again. Unlike all of the examples above, the new URL will now appear in the browser’s location bar. And search engines will take note of the new URL and update their databases. [R] by itself is the same as [R=302] and signifies a temporary redirect.

File Existence and WordPress

Smashing Magazine runs on the popular blogging software WordPress. WordPress enables the author to choose their own URL, called a “slug.� Then, it automatically prepends the date, such as http://coding.smashingmagazine.com/2011/09/05/getting-started-with-the-paypal-api/. In your pre-URL rewriting days, you might have assumed that Smashing Magazine’s Web server was actually serving up a file located at …/2011/09/05/getting-started-with-the-paypal-api/index.html. In fact, WordPress uses URL rewriting extensively.

screenshot

WordPress enables the author to choose their own URL for an article.

WordPress’ .htaccess file looks like this:

RewriteEngine On
RewriteBase /  
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

The -f means “this is a file� and -d means “this is a directory.� This tells Apache, if the requested file name is not a file, and the requested file name is not a directory, then rewrite everything (i.e. any path containing any character) to the page index.php. If you are requesting an existing image or the log-in page wp-login.php, then the rule is not triggered. But if you request anything else, like /2011/09/05/getting-started-with-the-paypal-api/, then the file index.php jumps into action.

Internally, index.php (probably) looks at the environment variable $_SERVER['REQUEST_URI'] and extracts the information that it needs to find out what it is looking for. This gives it even more flexibility than Apache’s rewrite rules and enables WordPress to mimic some very sophisticated URL rewriting rules. In fact, when administering a WordPress blog, you can go to Settings → Permalink on the left side, and choose the type of URL rewriting that you would like to mimic.

screenshot

WordPress’ permalink settings, letting you choose the type of URL rewriting that you would like to mimic.

Rewriting Query Strings

If you are hired to recreate an existing website from scratch, you might use URL rewriting to redirect the 20 most popular URLs on the old website to the locations on the new website. This could involve redirecting things like prod.php?id=20 to products/great-product/2342, which itself gets redirected to the actual product page.

Apache’s RewriteRule applies only to the path in the URL, not to parameters like id=20. To do this type of rewriting, you will need to refer to the Apache environment variable %{QUERY_STRING}. This can be accomplished like so:

RewriteEngine On
RewriteCond   %{QUERY_STRING}           ^id=20$                   
RewriteRule   ^prod.php$             ^products/great-product/2342$      [L,R=301]
RewriteRule   ^products/(.*)/([0-9]+)$  ^productview.php?id=$1             [L]

In this example, the first RewriteRule triggers a permanent redirect from the old website’s URL to the new website’s URL. The second rule rewrites the new URL to the actual PHP page that displays the product.

Examples Of URL Rewriting On Shopping Websites

For complex content-managed websites, there is still the issue of how to map friendly URLs to underlying resources. The simple examples above did that mapping by hand, manually associating a URL like horses.htm with the file or resource Xu8JuefAtua.htm. Wikipedia looks up the resource based on the title, and WordPress applies some complex internal rule sets. But what if your data is more complex, with thousands of products in hundreds of categories? This section shows the approach that Amazon and many other shopping websites take.

If you’ve ever come across a URL like this on Amazon, http://www.amazon.co.uk/High-Voltage-AC-DC/dp/B00008AJL3, you might have assumed that Amazon’s website has a subdirectory named /High-Voltage-AC-DC/dp/ that contains a file named B00008AJL3.

This is very unlikely. You could try changing the name of the top-level “directory� and you would still arrive on the same page, http://www.amazon.co.uk/Test-Voltage-AC-DC/dp/B00008AJL3.

The bit at the end is what really matters. Looking down the page, you’ll see that B00008AJL3 is this AC/DC album’s ASIN (Amazon Standard Identification Number). If you change that, you’ll get a “Page not found� or an entirely different product: http://www.amazon.co.uk/High-Voltage-AC-DC/dp/B003BEZ7HI.

The /dp/ also matters. Changing this leads to a “Page not found.� So, the B00008AJL3 probably tells Amazon what to display, and the dp tells the website how to display it. This is URL rewriting in action, with the original URL possibly ending up getting rewritten to something like:
http://www.amazon.co.uk/displayproduct.php?asin=B00008AJL3.

Features of an Amazon URL

This introduces some important features of Amazon’s URLs that can be applied to any website with a complex set of resources. It shows that the URL can be automatically generated and can include up to three parts:

  1. The wordsIn this case, the words are based on the album and artist, and all non-alphanumeric characters are replaced. So, the slash in AC/DC becomes a hyphen. This is the bit that helps humans and search engines.
  2. An ID numberOr something that tells the website what to look up, such as B00008AJL3.
  3. An identifierOr something that tells the website where to look for it and how to display it. If dp tells Amazon to look for a product, then somewhere along the line, it probably triggers a database statement such as SELECT * FROM products WHERE id='B00008AJL3'.

Other Shopping Examples

Many other shopping websites have URLs like this. In the list below, the ID number and (suspected) identifier are in bold:

  • http://www.ebay.co.uk/itm/Ian-Rankin-Set-Darkness-Rebus-Novel-/140604842997
  • http://www.kelkoo.com/c-138201-lighting/brand/caravan
  • http://www.ciao.co.uk/Fridge_Freezers_5266430_3
  • http://www.gumtree.com/p/for-sale/boys-bmx-bronx-blaze/97669042
  • http://www.comet.co.uk/c/Televisions/LCD-Plasma-LED-TVs/1844

A significant benefit of this type of URL is that the actual words can be changed, as shown below. As long as the ID number stays the same, the URL will still work. So products can be renamed without breaking old links. More sophisticated websites (like Ciao above) will redirect the changed URL back to the real one and thus avoid creating the appearance of duplicate content (see below for more on this topic).

screenshot

Websites that use URL rewriting are more flexible with their URLs — the words can change but the page will still be found.

Friendly URLs

Now you know how to map nice friendly URLs to their underlying Web pages, but how should you create those friendly URLs in the first place?

If we followed the current advice, we would separate words with hyphens rather than underscores and capitalize consistently. Lowercase might be preferable because most people search in lowercase. Punctuation such as dots and commas should also be turned into hyphens, otherwise they would get turned into things like %2C, which look ugly and might break the URL when copied and pasted. You might want to remove apostrophes and parentheses entirely for the same reason.

Whether to replace accented characters is debatable. URLs with accents (or any non-Roman characters) might look bad or break when rendered in a different character format. But replacing them with their non-accented equivalents might make the URLs harder for search engines to find (and even harder if replaced with hyphens). If your website is for a predominately French audience, then perhaps leave the French accents in. But substitute them if the French words are few and far between on a mainly English website.

This PHP function succinctly handles all of the above suggestions:

function GenerateUrl ($s) {
  //Convert accented characters, and remove parentheses and apostrophes
  $from = explode (',', "ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,e,i,ø,u,(,),[,],'");
  $to = explode (',', 'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,e,i,o,u,,,,,,');
  //Do the replacements, and convert all other non-alphanumeric characters to spaces
  $s = preg_replace ('~[^\w\d]+~', '-', str_replace ($from, $to, trim ($s)));
  //Remove a - at the beginning or end and make lowercase
  return strtolower (preg_replace ('/^-/', '', preg_replace ('/-$/', '', $s)));
}

This would generate URLs like this:

echo GenerateUrl ("Pâtisserie (Always FRESH!)"); //returns "patisserie-always-fresh"

Or, if you wanted a link to a $product variable to be pulled from a database:

$product = array ('title'=>'Great product', 'id'=>100);
echo '<a href="' . GenerateUrl ($product['title']) . '/' . $product['id'] . '">';
echo $product['title'] . '</a>';

Changing Page Names

Search engines generally ignore duplicate content (i.e. multiple pages with the same information). But if they think they are being manipulated, search engines will actively penalize the website, so avoid this where possible. Google recommends using 301 redirects to send users from old pages to new ones.

When a URL-rewritten page is renamed, the old URL and new URL should both still work. Furthermore, to avoid any risk of duplication, the old URL should automatically redirect to the new one, as WordPress does.

Doing this in PHP is relatively easy. The following function looks at the current URL, and if it’s not the same as the desired URL, it redirects the user:

function CheckUrl ($s) {
  // Get the current URL without the query string, with the initial slash
  $myurl = preg_replace ('/\?.*$/', '', $_SERVER['REQUEST_URI']);
  //If it is not the same as the desired URL, then redirect
  if ($myurl != "/$s") {Header ("Location: /$s", true, 301); exit;}
}

This would be used like so:

$producturl = GenerateUrl ($product['title']) . '/' . $product['id'];
CheckUrl ($producturl); //redirects the user if they are at the wrong place

If you would like to use this function, be sure to test it in your environment first and with your rewrite rules, to make sure that it does not cause any infinite redirects. This is what that would look like:

screenshot

This is what happens when Google Chrome visits a page that redirects to itself.

Checklist And Troubleshooting

Use the following checklist to implement URL rewriting.

1. Check That It’s Supported

Not all Web servers support URL rewriting. If you put up your .htaccess file on one that doesn’t, it will be ignored or will throw up a “500 Internal Server Error.�

2. Plan Your Approach

Figure out what will get mapped to what, and how the correct information will still get found. Perhaps you want to introduce new URLs, like my-great-product/p/123, to replace your current product URLs, like product.php?id=123, and to substitute new-category/c/12 for category.php?id=12.

3. Create Your Rewrite Rules

Create an .htaccess file for your new rules. You can initially do this in a /testing/ subdirectory and using the [R] flag, so that you can see where things go:

RewriteEngine On
RewriteRule   ^.+/p/([0-9]+)   product.php?id=$1    [NC,L,R]
RewriteRule   ^.+/c/([0-9]+)   category.php?id=$1    [NC,L,R]

Now, if you visit www.mywebsite.com/testing/my-great-product/p/123, you should be sent to www.mywebsite.com/testing/product.php?id=123. You’ll get a “Page not found� because product.php is not in your /testing/ subdirectory, but at least you’ll know that your rules work. Once you’re satisfied, move the .htaccess file to your document root and remove the [R] flag. Now www.mywebsite.com/my-great-product/p/123 should work.

4. Check Your Pages

Test that your new URLs bring in all the correct images, CSS and JavaScript files. For example, the Web browser now believes that your Web page is named 123 in a directory named my-great-product/p/. If the HTML refers to a file named images/logo.jpg, then the Web browser would request the image from www.mywebsite.com/my-great-product/p/images/logo.jpg and would come up with a “File not found.�

You would need to also rewrite the image locations or make the references absolute (like <img src="/images/logo.jpg"/>) or put a base href at the top of the <head> of the page (<base href="/product.php"/>). But if you do that, you would need to fully specify any internal links that begin with # or ? because they would now go to something like product.php#details.

5. Change Your URLs

Now find all references to your old URLs, and replace them with your new URLs, using a function such as GenerateUrl to consistently create the new URLs. This is the only step that might require looking deep into the underlying code of your website.

6. Automatically Redirect Your Old URLs

Now that the URL rewriting is in place, you probably want Google to forget about your old URLs and start using the new ones. That is, when a search result brings up product.php?id=20, you’d want the user to be visibly redirected to my-great-product/p/123, which would then be internally redirected back to product.php?id=20.

This is the reverse of what your URL rewriting already does. In fact, you could add another rule to .htaccess to achieve this, but if you get the rules in the wrong order, then the browser would go into a redirect loop.

Another approach is to do the first redirect in PHP, using something like the CheckUrl function above. This has the added advantage that if you rename the product, the old URL will immediately become invalid and redirect to the newest one.

7. Update and Resubmit Your Site Map

Make sure to carry through your new URLs to your site map, your product feeds and everywhere else they appear.

Conclusion

URL rewriting is a relatively quick and easy way to improve your website’s appeal to customers and search engines. We’ve tried to explain some real examples of URL rewriting and to provide the technical details for implementing it on your own website. Please leave any comments or suggestions below.

(al)


© Paul Tero for Smashing Magazine, 2011.


My Favorite Programming Mistakes

Advertisement in My Favorite Programming Mistakes
 in My Favorite Programming Mistakes  in My Favorite Programming Mistakes  in My Favorite Programming Mistakes

Over my programming career, I have made a lot of mistakes in several different languages. In fact, if I write 10 or more lines of code and it works the first time, I’ll get a bit suspicious and test it more rigorously than usual. I would expect to find a syntax error or a bad array reference or a misspelled variable or something.

Mwnt-beach in My Favorite Programming Mistakes

Coastline near Mwnt on the west coast of Wales. Read on to find out why this is halfway to being a very special place.

I like to classify these mistakes into three broad groups: cock-ups (or screw-ups in American English), errors and oversights.

A cock-up is when you stare blankly at the screen and whisper “Oops�: things like deleting a database or website, or overwriting three-days worth of work, or accidentally emailing 20,000 people.

Errors cover everything, from simple syntax errors like forgetting a } to fatal errors and computational errors.

When an error is so subtle and hard to find that it is almost beautiful, I would call it an oversight. This happens when a block of code is forced to handle a completely unforeseen and very unlikely set of circumstances. It makes you sit back and think “Wowâ€�: like seeing a bright rainbow or shooting star, except a bit less romantic and not quite as impressive when described to one’s partner over a candlelit dinner.

This article discusses some of the spectacular and beautiful mistakes I have made, and the lessons learned from them. The last three are my favorites.

Leaving Debug Mode On

The first two mistakes in this article were full-fledged cock-ups.

When I first started freelancing, I wrote a set of PHP libraries for handling database queries, forms and page templating. I built a debugging mode into the libraries at a fairly deep level, which depended on a global variable called $DEBUG.

I also kept a local copy of every major website I worked on, for developing, debugging and testing. So, whenever a problem occurred, I could set $DEBUG=1; at the top of the page, and it would tell me various things, such as all the database statements it was running. I rarely used this debug method on live websites; it was for local usage only.

Except for one day when I was working late at night, debugging a minor problem on a popular e-commerce website. I put $DEBUG=1; at the top of several pages and was switching between them. It was all a tired midnight blur, but in the end I somehow added the debugging variable to the most important page on the website, the one after the user clicks “Pay now,� and I uploaded it to the live website.

The next morning, I went out early for the whole day. I got home at 9:00 pm to find 12 increasingly frustrated messages on my answering machine and a lot more emails. For about 20 hours, whenever a customer clicked pay, they saw something like this:

Debug-mode-db-statements in My Favorite Programming Mistakes

What customers saw when they clicked “Pay.�

It took me about 10 seconds to fix, but a lot longer to apologize to my client for a day’s worth of lost orders.

Lessons Learned

I held an internal inquiry into this issue and established the following:

  1. Avoid working late at night;
  2. Make a full test order whenever I make a change to the order processing, however minor;
  3. Make sure debug statements never see the light of day on a live website;
  4. Provide some emergency contact details for me and/or a back-up programmer.

Thoughtful Debugging

For the third requirement, I implemented a couple of functions like this, to make sure that debugging messages are outputted only when I am looking at the website:

function CanDebug() {
 global $DEBUG;
 $allowed = array ('127.0.0.1', '81.1.1.1');
 if (in_array ($_SERVER['REMOTE_ADDR'], $allowed)) return $DEBUG;
 else return 0;
}
function Debug ($message) {
  if (!CanDebug()) return;
  echo '<div style="background:yellow; color:black; border: 1px solid black;';
  echo 'padding: 5px; margin: 5px; white-space: pre;">';
  if (is_string ($message)) echo $message;
  else var_dump ($message);
  echo '</div>';
}

Then, whenever I want to output something for debugging, I call the Debug function. This calls CanDebug to check the requesting IP address and the $DEBUG variable. The $allowed array contains my IP address for local testing (127.0.0.1) and my broadband IP address, which I can get from WhatIsMyIPAddress.com.

Then I can output things like this:

$DEBUG = 1;
Debug ("The total is now $total"); //about a debugging message
Debug ($somevariable); //output a variable
Debug ("About to run: $query"); //before running any database query
mysql_query ($query);

And I can be confident that no one but me (or anyone sharing my IP address, such as my boss) will ever see any debugging messages. Assuming that the variables above were set, the above code would look like this:

Debug-yellow in My Favorite Programming Mistakes

Outputting debugging statements.

For extra safety, I could have also put the error messages inside HTML comments, but then I would have had to sift through the HTML source to find the bit I was looking for.

I have another related useful bit of code that I can put at the top of a page or configuration file to ensure that all PHP notices, warnings and errors will be shown to me and only me. If the person is not me, then errors and warnings will be outputted to the error log but not shown on screen:

if (CanDebug()) {ini_set ('display_errors', 1); error_reporting (E_ALL);}
else {ini_set ('display_errors', 0); error_reporting (E_ALL & ~E_NOTICE);}

Debuggers

The method above is useful for quickly finding errors in very specific bits of code. There are also various debugging tools, such as FirePHP and Xdebug, that can provide a huge amount of information about a PHP script. They can also run invisibly, outputting a list of every function call to a log file with no output to the user.

Xdebug can be used like this:

ini_set ('xdebug.collect_params', 1);
xdebug_start_trace ('/tmp/mytrace');
echo substr ("This will be traced", 0, 10);
xdebug_stop_trace();

This bit of code logs all function calls and arguments to the file /tmp/mytrace.xt, which will look like this:

Xdebug-example in My Favorite Programming Mistakes

Contents of an Xdebug stack trace showing every function call.

Xdebug also displays much more information whenever there is a PHP notice, warning or error. However, it needs to be installed on the server, so it is probably not possible in most live hosting environments.

FirePHP, on the other hand, works as a PHP library that interacts with an add-on to Firebug, a plug-in for Firefox. You can output stack traces and debugging information directly from PHP to the Firebug console — again, invisible to the user.

For both of these methods, a function like CanDebug above is still useful for making sure that not everyone with Firebug can view the stack traces or generate big log files on the server.

Turning Debug Mode Off

Debugging emailing scripts is more involved. Definitively testing whether a script is sending an email properly is hard without actually sending the email. Which I once did by mistake.

A few years ago, I was asked to create a bulk emailing script to send daily emails to over 20,000 subscribed users. During development, I used something similar to the CanDebug function above, so that I could test the emailing script without actually sending an email. The function to send emails looked something like this:

function SendEmail ($to, $from, $subject, $message) {
  if (CanDebug() >= 10) Debug ("Would have emailed $to:\n$message");
  else {
    if (CanDebug()) {$subject = "Test to $to: $subject"; $to = "test@test.com";}
    mail ($to, $subject, $message, "From: $from");
  }
}

If I set $DEBUG=1, it would send the emails (all 20,000 of them) to a test address that I could check. If I set $DEBUG=10, it would tell me that it was trying to send an email but not actually send anything.

Soon after launch, a problem arose with the script. I think it ran out of memory from doing some inefficient processing 20,000 times. At some point, I went into fix something, forgot to set my $DEBUG variable (or else my broadband IP address had inconveniently changed) and mistakenly emailed 20,000 people.

I apologized to the agency I was working for, but thankfully nothing much came of it. I guess that spam filters blocked many of the messages. Or perhaps the recipients were merely pleased that the email did not contain anything for them to do or read.

Lessons Learned

I was very glad that I just put “test� in the subject and message of the test email, and not some statement reflecting how frustrated I was getting at that particular bug. I learned a few lessons:

  1. Be extra careful when testing bulk emailing scripts — check that the debug mode is working.
  2. Send test emails to as few people as possible.
  3. Always send polite test messages, like “Please ignore, just testing.â€� Don’t say something like “My client is a ninny,â€� in case it gets sent to 20,000 unsuspecting investors.

PHP Blank Page

Now we’re in the realm of hard-to-spot errors, rather than cock-ups. If you’d like to see a hard-to-debug error in PHP, bury the following somewhere deep in your code:

function TestMe() {TestMe();}
TestMe();

Depending on the browser and the server’s Apache and PHP versions, you might get a blank page, a “This Web page is not available,â€� a fatal error due to running out of memory, or the option to “Saveâ€� or “Openâ€� the page, like this:

Test-save-as in My Favorite Programming Mistakes

Infinite recursion, as dealt with by Firefox 3.6.

It basically causes infinite recursion, which can cause a Web server thread to run out of memory and/or crash. If it crashes, a small trace may or may not be left in the error log:

[Mon Jun 06 18:24:10 2011] [notice] child pid 7192
  exit signal Segmentation fault (11)

But this gives little indication of where or why the error occurred. And all of the quick debugging techniques of adding lines of output here or there may not help much, because as long as the offending code gets executed, the page will seem to fail in its entirety. This is mainly because PHP only periodically sends the HTML it generates to the browser. So, adding a lot of flush(); statements will at least show you what your script was doing immediately before the recursive error.

Of course, the code that leads to this error might be much more convoluted than the above. It could involve classes calling methods in other classes that refer back to the original classes. And it might only happen in certain hard-to-duplicate circumstances and only because you’ve changed something else somewhere else.

Lessons Learned

  1. Know the locations of error log files, in case something gets recorded there.
  2. This is where stack-tracing debuggers such as Xdebug can be really handy.
  3. Otherwise, set aside plenty of time to go through the code line by line, commenting out bits until it works.

Wrong Variable Type

This error often happens with databases. Given the following SQL statements…

CREATE TABLE products (
  id INT PRIMARY KEY AUTO_INCREMENT,
  name VARCHAR(60),
  category VARCHAR(10),
  price DECIMAL(6,2)
);
INSERT INTO products VALUES (1, 'Great Expectations', 'book', 12.99);
INSERT INTO products VALUES (2, 'Meagre Expectations', 'cd', 2.50);
INSERT INTO products VALUES (3, 'Flared corduroys', 'retro clothing', 25);

… can you guess what is returned when you run the following?

SELECT * FROM products WHERE category='retro clothing';

The answer is nothing, because the category column is only 10 characters long, and so the category of the last product is cut off at retro clot. Recently edited products or new menu items suddenly disappearing can create a lot of confusion. But fixing this is generally very easy:

ALTER TABLE products MODIFY category VARCHAR(30);
UPDATE products SET category='retro clothing' WHERE category='retro clot';

Database-col-error in My Favorite Programming Mistakes

The category has been cut off after 10 characters, as shown in phpMyAdmin.

I made a more serious error with the first major e-commerce website that I worked on. At the end of the ordering process, the website would ask the customer for their credit card details and then call a Java program, which would send a request to Barclays ePDQ system to take the payment. The amount was sent as the number of pence. Not being very familiar with Java, I based the code on an example I found, which represented the total as a short integer:

short total;

The Java program was called on the command line. If it returned nothing, then the transaction was considered successful, emails were sent, and the order was fulfilled. If there was an error in processing the card, the program returned something like “Card not authorized� or “Card failed fraud checks.�

Short integers can store a value between -32768 and +32767. This seemed plenty to me. But I neglected that this was in pence, not pounds, so the highest possible total was actually £327.67. And the really bad news was that if the amount was higher than that, then the Java program simply crashed and returned nothing, which looked exactly like a successful order and was processed as normal.

It took a few months and several large unpaid transactions before the error was spotted, either by the accounting department or a vigilant and honest customer. I believe they recovered all of the payments in the end.

Lessons Learned

  1. When assigning a type to a database column or variable, be generous and flexible, and try to plan ahead.
  2. Make sure that a program succeeding responds differently to a program crashing.

1p Errors

Among my favorite mistakes are those that cause a discrepancy of just 1 pence (or cent, öre or other denomination). I like them because they are usually very subtle and hard to trace and often boil down to a rounding error. I have to become a mathematical detective, a job that I would readily do if enough work was available.

For a website a few years ago, I needed to create a quick JavaScript function to output a monetary amount. I used this:

<script type="text/javascript">
function GetMoney (amount) {return Math.round (amount * 100) / 100;}
</script>

However, it was quickly discovered that amounts like 1.20 were displayed as 1.2, which looks unprofessional. So, I changed it to this:

<script type="text/javascript">
function GetMoney (amount) {
  var pounds = Math.floor (amount);
  var pence = Math.round (amount * 100) % 100;
  return pounds + '.' + (pence < 10 ? '0' : '') + pence;
}
</script>

The main difference is the extra 0 in the last line. But now that the pence is computed separately, the modulus % operator is needed to get the remainder when the amount is divided by 100. Try to spot the unlikely circumstances under which this code would cause an error.

It happened on a website that sold beads. I have since learned that beads can be sold in a huge range of amounts and configurations, including customized mixes containing fractional quantities. Once, a customer bought 1.01 of an item costing £4.95, and ended up paying just £4.00. This is because the amount was passed as 4.9995. The rounded pence was 100, and % 100 left 0 pence, and so the pounds were floored to 4.

Beads-getmoney in My Favorite Programming Mistakes

A subtle rounding error on Beads Unlimited‘s website, where 101 beads sold at £4.95 per 100 were billed as £4 instead of £5.

This is still just a rounding error, a superset of 1p errors. I made a quick change to fix it:

<script type="text/javascript">
function GetMoney (amount) {
  var pounds = Math.floor (amount);
  var pence = Math.floor (amount * 100) % 100;
  return pounds + '.' + (pence < 10 ? '0' : '') + pence;
}
</script>

This wasn’t a great fix, though, because it rounded £4.9995 down to £4.99, which put it out of sync with any corresponding server-side calculations. But even more dramatically, when someone ordered 0.7 of something costing £1.00, it ended up displaying 69p instead of 70p! This is because floating-point numbers like 0.7 are represented in binary as a number more like 0.6999999999999999 (as described in a recent Smashing Magazine article), which would then be floored to 69 instead of rounded up to 70.

This is a true 1p error. To fix this, I added another rounding at the beginning:

<script type="text/javascript">
function GetMoney (amount) {
  var pence = Math.round (100 * amount);
  var pounds = Math.floor (pence / 100);
  pence %= 100;
  return pound + '.' + (pence < 10 ? '0' : '') + pence;
}
</script>

Now, I had four fairly complicated lines of code to do one very simple thing. Today, while writing this article, I discovered a built-in Javascript function to handle all of this for me:

<script type="text/javascript">
function GetMoney (amount) {return amount.toFixed (2);}
alert (GetMoney (4.9995) + ' ' + GetMoney (0.1 * 0.7));
</script>

Discounting With PayPal

PayPal is a 1p error waiting to happen. Many websites offer voucher codes that give a percentage off each order, computed at the end of the order. If you ordered two items costing 95p, the subtotal would be £1.90, and you would receive a 19p discount, for a total of £1.71.

However, PayPal does not support this type of discounting. If you want PayPal to display the items in your shopping basket, you have to pass each one separately with a price and quantity:

<input name="item_name_1" type="hidden" value="My Difficult Product" />
<input name="amount_1" type="hidden" value="0.99" />
<input name="quantity_1" type="hidden" value="1" />

Thus, you have to discount each item separately. 10% off of 95p leaves 85.5p. PayPal doesn’t accept fractional amounts, so you have to round up to 86p, for a grand total of £1.72 in PayPal, or round down to 85p, for a total of £1.70.

To solve this, I had to also make the website discount each item individually. Instead of just doing 10% × £1.90, it accumulates the discount item by item, using a whole amount of pence each time. Assuming $items is a PHP array of order item objects:

$discount = 0; $discountpercent = 10;
foreach ($items as $item) {
 $mydiscount = floor ($item->price * $discountpercent) / 100;
 $item->priceforpaypal = $item->price - $mydiscount;
 $discount += $mydiscount * $item->quantity;
}

Lessons Learned

  1. Don’t reinvent the wheel, even very small wheels that look easy from the outside.
  2. If you get a 1p discrepancy, check where and how numbers are rounded.
  3. Avoid representing prices using floats when possible. Instead, store the pence or cents as integers; and in databases, use a fixed-point type like DECIMAL.

Daylights Savings

I would not call the last two mistakes in this list “errors.â€� They require a very specific set of fairly rare circumstances, so they are more “oversightsâ€� on the programmer’s part. Oversights are like the acts of terrorism that are excluded by home insurance policies. They go beyond what a programmer could reasonably be expected to think of in advance.

Can you guess what is wrong with the following seemingly innocuous line of code, which selects orders that were completed more than one week ago?

mysql_query ("SELECT * FROM orders WHERE completeddate < '" .
  date ('Y-m-d H:i:s', (time() - 7 * 86400 + 600)) . "'")

I used a similar line in a system for a weekly repeating order. It looked up orders that were completed last week, duplicated them, and processed them for the current week. 86,400 is the number of seconds in a day, so time() - 7 * 86400 was exactly one week ago, and +600 gives it a leeway of 10 minutes.

This was a low-budget method of implementing repeating orders. Given more time, I would have created a separate table and/or shopping basket to differentiate between repeating and non-repeating items. As it happened, this code worked well for several months and then mysteriously failed in late March.

It took ages to recover from the oversight and to process those orders manually. And even longer to find the reason, especially because I had to fool the whole website into thinking that it was a different date.

I’ve pretty much given the trick away in the title of the section: I forgot to account for daylight savings, when one week is less than 7*86400 seconds.

Compare the following three ways of getting the date exactly one week ago. The last is the most elegant. I only recently discovered it:

$time = strtotime ('28 March 2011 00:01');
echo date ('Y-m-d H:i:s', ($time - 7 * 86400)) . '<br/>';
echo date ('Y-m-d H:i:s', mktime (date ('H', $time), date ('i', $time), 0,
  date ('n', $time), date ('j', $time) - 7, date ('Y', $time)));
echo date ('Y-m-d H:i:s', (strtotime ('-1 week', $time))) . '<br/>';

Lessons Learned

Drawing general lessons from a mistake like this is difficult, but there is a specific lesson here:

  1. On websites that repeat things, remember to consider time zones and daylight savings.
  2. Consider storing all times and dates in UTC (Coordinated Universal Time).
  3. Don’t reinvent the time wheel either: strtotime is a powerful function.

The next time I do a website for repeating orders, I won’t make that mistake.

Spam Error

My favorite mistake of all time is an even subtler oversight. Can you spot what is unusual about these made-up email addresses:

  • beckyrsmythe@somewhere.com
  • glynnfrenk@someplace.co.uk

A few years ago, spammers began targeting contact forms on websites, injecting headers and forcing the forms to send millions of messages to harvested addresses and later just to the form’s usual recipient.

This necessitated anti-spam filtering directly on the Web page that processed the form. When I was first asked to do this, I combined a few anti-spam scripts that I found on the Internet. Spammers now often put blocks of random letters in their messages to try to fool spam filters. So, one anti-spam technique is to check for these random letters by looking for certain consonants in a row.

I read somewhere that words with more than six consonants in a row are extremely rare in Latin-alphabet languages. The most consonants in a row in English is six: in “latchstring.� Other languages like Polish have many more diphthongs than English (dz, sz, cz), so I used seven to be on the safe side. The PHP code uses a regular expression and looks something like this:

foreach ($_POST as $key=>$val) {
        if (preg_match ('/[bcdfghjklmnpqrstvwxyz]{7,}/i', $val))
                die ("<h1>Spam Detected</h1><p>Too many consonants in $val</p>");
}

I had to revisit the script when it blocked someone with an email address like the ones above:

Spam-error in My Favorite Programming Mistakes

A customer whose email address had seven or more consonants in a row would have received this upon submitting a form.

Based on a small sample of 10,000, I found that approximately 0.2% of all email addresses would be filtered as spam, according to the rule above. One valid email address had nine consonants in a row. Increasing the number of allowed consonants from seven to ten decreases the script’s usefulness significantly, so instead I considered the letter “yâ€� a vowel.

This worked well, until a customer from Cwmtwrch near Swansea attempted to place an order. According to my sample, only 1 in 5000 customers have a name, email or address like this. Small but important, especially if you are one of them. So, I allowed “w� as a vowel, too. You can check for this in your own customer database with a MySQL query like the following:

SELECT CONCAT_WS(' ',firstname,lastname,email,city,address1,address2) AS thefields
FROM visitors HAVING LENGTH(thefields)>20 AND thefields RLIKE '[bcdfghjklmnpqrstvwxz]{7,}'

Lessons Learned

I learned that my anti-spam script was blocking potential customers only once my client forwarded me their complaints. When I received the first one (an email address containing a couple of “yâ€�s for vowels), I was amazed. It seemed so unlikely. A couple of weeks later, when shoppers in a small Welsh village were still mysteriously unable to place an order, I almost didn’t believe it. It seems that if a piece of code has a hole, someone somewhere will fall into it. So, I’ve learned to do the following:

  1. Take all error reports and complaints seriously. They may uncover something amazing like this.
  2. Jot down the really unlikely mistakes. You will impress other programmers… or me, at least

More specifically, logging everything that is processed by a spam filter is useful, because you can then try to spot any false positives or false negatives and use them to improve the filter.

Conclusion

Programming mistakes come in many shapes and sizes. This article has ranged from the very obvious cock-ups to the extremely subtle oversights. And it looks like they all support Murphy’s Law: if something can go wrong, it will.

However, for every mistake found, reported and fixed, probably a few more aren’t. Either they aren’t found (because they are so incredibly subtle that the set of circumstances that would cause them has never happened) or they aren’t reported (because most users don’t bother reporting errors — which is why any error reports that do come in should be taken seriously) or they aren’t fixed (because doing so would be too time-consuming or expensive).

Mistakes are also more likely to be found on popular websites, mainly because so many more people are putting those websites to work, but partly because fixing one mistake could cause another somewhere else.

The best lessons, therefore, are to plan ahead and to debug thoughtfully.

(kw)


© Paul Tero for Smashing Magazine, 2011.


  •   
  • Copyright © 1996-2010 BlogmyQuery - BMQ. All rights reserved.
    iDream theme by Templates Next | Powered by WordPress