For those who don't want to read a lengthy blog post:
Note: This technique is only appropriate for body HTML, not for HTML attributes. Hack #3 below is the best approach in those cases.
Users can be malicious
As any good web developer knows, it's important to be constantly vigilant with the handling of user data. We avoid buffer overflows and format string exploits (remember those?) by using safer languages or being careful in our C. To avoid SQL injection, we never build database queries by concatenating user-supplied data. These measures protect the integrity of the data on our servers, but what about our (non-malicious) users?
Escaping user input
Many developers have opinions on the proper place to escape user input. I'll review three options:
- Before storage
- On the backend, while building the HTML
One approach is to escape data as soon as you receive it. This is widely used because it's one of the most foolproof methods: just filter the input once and forget about. The core problem with this approach is that in its usual implementation, it's essentially a one-way operation; the data that the user originally submitted is no longer retrievable.
You've probably noticed clumsy implementations on various discussion sites that allow editing: your edit window shows your text all mangled. All your < have been converted to <. Furthermore, if don't change it back, the next edit will be &lt;. This highlights another weakness of this method: it is susceptible to double-escaping, since there is often no bookkeeping to indicate that the data has been escaped already.
I prefer to maintain the data in its original form and handle it with care elsewhere.
On the backend
For years, the languages commonly used for web development have included libraries that properly handle HTML escaping. Good developers clearly indicate in the code and documentation where user-created data exists, and they use appropriate libraries to escape all such data as it is converted into an HTML page. Barring problems in the library implementation or lapses in vigilance, this is a solid approach, and it allows a lot more flexibility than the previously-discussed method. For example, it is now possible to echo the input as-is back to the creator for editing purposes.
On the frontend
The unsafe way
This approach is vulnerable to every problem I outlined at the outset. Don't do this in your code! It's not much more difficult to do it in a safe way.
The safe way
This uses the browser's own knowledge of which characters are sensitive to properly escape the string. It's fast, and according to quirksmode, it's supported by every browser out there. But when we're concatenating strings or building some widget via a class hierarchy, this isn't always possible. Sometimes we need to escape the string way before we add it to a DOM node. Enter various hacks to make that happen.
Hack #1: inline
I see this all the time, and I've been guilty of similar methods on both the backend and the frontend. You know it's inefficient, but it's trivial and it works and it's basically a one-liner. Then you notice a random bug where you converted < to > by mistake, or maybe you forgot the pesky semicolon, so you decide to create a canonical escape function. Then it turns out that you sometimes need to escape part of an HTML tag attribute. You eventually settle on something like the following.
Hack #2: the catchall
This works pretty well. It handles the most important cases, but you know in the back of your mind that it's wasteful. You've traversed the string five times (creating five new strings!) just to return the escaped version, and you have " characters all over the page where they aren't necessary. Your programming aesthetic takes over and one evening you convert it.
Hack #3: more efficient catchall
Now you're pretty happy. You only traverse the string once. You handle escaping both within and outside of attributes. But eventually you want to un-escape the strings you've escaped. In the process of writing that function, you learn that there are 252 named entities in HTML 4, in addition to literal entities like &#dddd; (decimal) and &xhhhh; (hex). Wow, you think, there must be a better way. And then you think back to "the safe way":
Can't we leverage that? We can!
The browser already knows how to escape strings; the