The past couple of days I’ve had one of those real head-scratching problems you get as a software developer from time to time – the classic intermittent problem that apparently has no explanation.
It came up quite by chance – we have a customer who for one reason or another massively underestimated the volume of transactions they were going to put through their system, as a result some parts of their system have been throwing intermittent time-out errors when they’ve been querying big datasets. As a result I had to do a quick fix to one of their modules to increase the time-out on the ADO.Net command objects to allow the queries to successfully return.
Once I’d done that, I did what we would usually do in our team and did some quick informal testing to make sure nothing else was affected, and unfortunately in those tests another part of the application started throwing intermittent SQLServerconnection errors – even more strangely these weren’t coming from the point in the code where the connection was being made, but from deep down in the ExecuteReader method. As far as the code was concerned, it should be using an existing connection – why was it trying to make a new connection now?
At this point it is probably worth discussing what the code was actually doing. In essence it is two nested database queries – the code runs through the results set of the first, and based on the contents of each row does various different things, including running other database queries. All of this was being done using multiple command objects, all created using our core database access library. Digging in to what the library was doing, it was providing a single connection, and generating the commands from that single connection. It all worked fine, except when the size of the dataset started to get big, at which point the errors started to occur.
How the underlying database actually copes with this varies. In SQLServer 2005 parlance, the code is using Multiple Active Result Sets or MARS, Oracle has no problem doing it, and it is supported against all versions of Oracle in ADO.Net. But it was only supported from SQLServer 2005 – attempting to do the same in SQLServer 2000 using the SQLServer specific provider would result in an error being thrown. However even with a SQLServer 2000 back end, if you used the ADO.Net 2.0 OleDB provider it would work, with the OleDB working some magic in between to get around the problem.
All the code in this particular application was written using the generic ADO.Net objects, the idea being that it will work with any of the ADO.Net providers. Although the database being used is SQLServer 2005 – which supports MARS – the factory object that was being created was using the OleDB provider, to connect to the SQLServer 2005 back end. After a bit of digging around I came across this useful blog posting from 2004 with a question and answer over MARS – the relevant bits are towards the end – the OleDB magic bit is that provider fakes the MARS functionality by connecting and disconnecting multiple connection objects in the background – only in very specific situations does it use the SQLServer 2005 functionality – this was what was making the connection that was causing the error! Not surprisingly, this has a big impact on performance, and it was pretty obvious that something else was going awry here as well. Interestingly the article finishes off by advising only using the SqlClient library to talk to SQLServer – the warning is pretty stark:
IMHO it means that you should ONLY use SqlClient to talk to Sql Server, the risk of running into this fake MARS behavior is too great. I have seen this â€œfeatureâ€? (fake MARS) cost hundreds of thousands of dollars an hour in lost sales as the server was inundated with unnecessary non-pooled connection open requests.
Understanding that, it looked like something was going wrong in the MARS support with in the OleDB provider. Quite apart from that, the algorithm it was using was going to perform badly with what we were trying to do – the ideal solution would just be to drop in the SqlClient instead, but unfortunately ditching the OleDB provider and going for the SqlClient instead isn’t really an option in this situation, as the application is slotting in with a whole suite of other applications that use OleDB, so the only option was coding around the restriction to ensure that the fake MARS functionality doesn’t operate. So out goes using our database library to manage connections, and it was back to a solution where I’m manually creating a connection object for each command and cleaning them up afterwards, and this seems to have got around the problem – in order to avoid this situation I just need the fine grain control over when things are being created.