Input versus Data, Validation versus Sanitization


Oct 8, 2007

Reading articles, browsing marketing materials, and listening to presentations about application security, you hear variations on a theme:

"Input validation is absolutely critical to application security, and most application risks involve tainted input at some level" - OWASP

While I don't think authors overstate the importance of problems stemming from invalid data, I have noticed these discussions gloss over two key points.

  1. Input validation is only part of the problem. Output validation is important as well.
  2. Validation (in the general sense) has two distinct concerns: validation and sanitization.

Input validation is only part of the problem. Output validation is important as well.

All data input from an untrusted source should be validated. If you enter a blog comment, I want to make sure it isn’t empty, it is less than 500 words, and it isn’t spam and won’t get my readers RickRolled. However, as that data is output from the web application, it should be validated as well. Here’s why:

Think about cross-site scripting - we really want to prevent tainted data from exiting the system to the rendered web page on the client. This occurs when the data is output, not input. SQL injection is also tainted data exiting the system (through a SQL query) and parameterized queries are output validation. And since these different validation rules might process the same data (say, a blog comment that is reflected in a subsequent page for approval and then stored to the database), it makes more sense to validate the data upon exit, rather than on entrance.

It’s like international air travel - you pass through customs at your arrival airport (output), because at your departure airport (input), they don’t know the rules for what’s allowed and what isn’t.

Thus, I propose that “Data Validation” be used in favor of “Input Validation” as a more accurate (although less precise) term to include input and output validation.

Validation (in the general sense) has two distinct concerns: validation and sanitization.

Validation is a Boolean operation which gives a yes or no answer to the question “Is this data acceptable in the current context?”

Sanitization (which includes encoding, escaping, and stripping) refers to transforming data in some manner so as to make it acceptable in the current context.

These two approaches can be used independently or in concert and the correct way to perform these operations from a security perspective is highly dependent on the context in which they are used.

So validation is (usually) both validation and sanitization.

Another issue which might be brought up in the subject of validation is canonicalization, which is a separate issue that warrants its own future blog post.

Just some food for thought when designing validation mechanisms - it’s not all yes or no decisions, and it’s not all at the front door.