URL encoding/decoding = worst thing EVER.

What's worse is that every stack thinks it's somehow responsible for decoding. For example, try's say you're trying to access this URI:

/F%20

Firefox will decode that if you put into your address bar to "/F " (a space). So when you construct the URL, you need to do an extra urlencode:

<a href="/F%2520">

Unfortunately, if this request is getting passed through to Apache's mod_rewrite or mod_proxy, it will do a urldecode on it. So by the time you get it in PHP, you'll actually get "F " instead of "F%20".

My advice to anybody who is thinking of creating a CMS:

  • Deal with internationalization early on. Most languages/frameworks do this for free, but most PHP apps have pretty shitty internationalization support due to the fact that mySQL and PHP only added really solid internationalization support recently. UTF-8 is your friend (UTF-16 if you want to get real fancy). Know that mySQL stores in latin1 by default; MAKE SURE YOU SET THIS UP CORRECTLY FROM THE GET-GO!
  • Deal with how you're going to store data from the get-go, taking into account special characters. You'd be surprised how much MediaWiki converts page titles (it's not even consistent urlencoding/decoding; it's this halfway encoding/decoding based on certain special characters, prob due to the very limitations I listed above). My suggestion: always store what the user sends you. No magic. No encoding. No special string magic. Just store it.
  • If you have multiple stacks, figure out how you're going to transport data between the stacks. JSON? Through GET? Through POST? Write test cases that check for non-latin1 encoded characters AND urlencoded characters (or strings that look urlencoded but really aren't). Having to go back and deal with those types of issues on existing systems will be your worst nightmare

The biggest annoyance for me is when a seemingly dumb stack does something smart, like mod_proxy or mod_rewrite. All I want it to do is receive the request, then send it over to this script. I don't want you to do magic parsing on the string! Argh!!!!

Because of this, we have to double encode titles when we know they're going to be passed through mod_proxy.

And then we gotta replicate that logic in C#, PHP, and Javascript.

*head explodes*

Posted by roy on June 14, 2007 at 12:06 PM in Web Development | 3 Comments

Related Entries

Want to comment with Tabulas?. Please login.

Comment posted on June 14th, 2007 at 02:19 PM
i understand your pain... but here's a reason why you wouldn't want to store everything the user gives you ad verbatim:

security

users will send you all sorts of crap and you need to make sure that it's safe to store those strings. therefore, you encode them into "safe strings"... and then we run into problems, but at least your system doesn't explode. just your head ;)
Comment posted on June 14th, 2007 at 02:28 PM
yes, one needs to validate strings (tabulas is particulary picky about where you can input HTML and JS, for example), but once those strings have been validated, storing encoded version of those strings = BAD

if i have a page called "F%20", it should be stored as such in the database. in mediawiki's case, it actually converts it to "F%2520" in the database store itself.

sad pandas...
Comment posted on June 14th, 2007 at 05:08 PM
sad pandas?