“this is my sweet string” == “dGhpcyBpcyBteSBzd2VldCBzdHJpbmcZ”?
Those two are actually equivalent, as proven by Amazon SimpleDB. We started seeing these mysterious strings in our SimpleDB data which is supposed to be a direct upload of SQL data for use in a UI. I automatically assumed it had something to do with special characters and proper encoding, as we have seen in our processes before. But this case was more unique because instead of just mangling the special character, it has managed to blow out the entire string… WTF
The culprit was an “End of medium” ASCII control charachter. These ASCII control charachters are all non printable. Once I had this figured out some more googling led me to the answer of why the whole string was unrecognizeable, base64 encoding:
What’s happening/changed: http://www.dibonafide.com/?p=25
The official documentation: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/index.html?InvalidCharacters.html
In my case the charachters were more of an anomaly and I just wanted to be rid of them. I had some options of taking care of it during the upload to SimpleDB or at the DB level. What worked easier was just throwing this handy function on my DB, and converting in the view I’m using to get data to upload:
http://iso30-sql.blogspot.com/2010/10/remove-non-printable-unicode-characters.html
Per the official documentation above the response object for the item will actually indicate what type of encoding it’s using. A few sites have mentioned they just base64 encoded on upload, and then decoded when its being used for display. I think with it specified in the item itself, you can probably just design your UI or whatever is consuming the data to check for base64 encoded strings, then decode, and remove any invalid characters there.
Now if I could just get WordPress to stop using invalid XML characters in its markup… that’s a topic for another day!