Some issues we face in developments are hard. The internet has a myriad of component parts that we have the unenviable task of cobbling together into what the end user sees. Given that there are so many layers between a user firing up a browser or interface, frontend, backend, database, server, network etc troubleshooting can be hard. Sometimes bugs are introduced that are obvious and take moments to identify and hopefully implement a fix, other times tracking down an issue can take hours or days.
I’d like to share one such issue my development team recently had the displeasure of having to track down. We recently completed a project to launch an iPhone application. In theory this should be easy to test. There’s a limited number of iPhone devices and its a relatively small number of variables. If it works on the iPhone 4, 3GS and 3G etc its likely to be good. This particular app however introduce another tier. Data for the application was to be updated on a periodic basis so the good design decision was taken that data would come from a web service. Now the app which was previously self contained had a number of extra variables introduced. We could now potentially have problems in the network, server, database or back-end tiers. More testing to be done but nothing that was insurmountable. A few development months later we have a polished app pulling data from a JSON web service all running nicely. App is submitted into the wild and the fun begins.
Although the app had been tested numerous times on numerous iPhones on the various networks with no problems we started getting reports of the app not pulling in data. App is again installed on multiple devices and our testing can’t easily reproduce the error. Was it an issue with the app itself? Or was it an issue in one of the other tiers. Given that the app worked well most of the time we decided to concentrate our efforts on determining the conditions under which its was failing. Lets take a quick look at the variables we had in play:
- iPhone model – 4, 3GS, 3, 2, 1
- Connectivity – WIFI, 3G
- Networks – O2, Orange, T-Mobile, Three, Vodafone
Lots of combinations here but a finite set, working through the combinations methodically we discovered that the app was failing to load data only when it was being run over 3G intermittently on some networks. The key here was the intermittent nature over 3G. Something was wrong the data sometimes over 3G – that’s a lot of some’s (never a good thing). Here’s what we did know. The intermittent issue returned the following error over 3G:
kCFErrorDomainCFNetwork error 303
Googling this did not return too much helpful information, it certainly didn’t relate to the HTTP 303 error code. We were further on but not really close to a solution. The error pointed to something to do with the network and therefore the transport of the data. Was there something wrong with the data?
As I said before the data was be sent as JSON, what could possibly be wrong with the data over 3G? Reading up a little on 3G suggested that the packet size for data transport is smaller over 3G could this be what was causing the issue? We embarked on a number of avenues relating to the data, was it:
- Data encoding – could some of the UTF-8 encoded characters be going wrong over 3G?
- Data Length – Was the length of the data – 46K be too long?
- Data Transport – was the fact the output was being transmitted gzipped a problem on 3G?
- URL structure – was 3G getting confused by the URLs themselves?
All of the above were tried with some success. Like I said this was intermittent, sometimes it worked other times it didn’t, just as we thought we’d found the solution continued testing revealed we hadn’t. Two fruitless days later we were getting frustrated and of course our client wasn’t too happy either. Finally we had the idea to look at how the data was being returned. JSON should be returned with an “application/json” Content-Type header, this was what was occurring, could it be that some networks, or even some network transmitters were having issues with this Content-Type? Could changing it from “application/json” to the more common “text/html” make a difference? Indeed it could, we had success, after much testing we finally had a fix!
I’m posting this so that hopefully others who come across this error don’t have to go through what we did, and also to highlight just how difficult a job developers have sometimes. A special shout out to @MozMorris who went through two days of pain to find this solution!