I recently saw the post Your Developers Aren’t Bricklayers, They’re Writers and I must say that I loved the post.
In this post there are references to things like "The best programmers are up to 28 times better than the worst programmers" and, even if this might be true for some cases, I am of the opinion that the best programmers simply can't be compared to the worst ones. That's because the worst developers will never get the job done and will probably destroy the work of the other developers. That is, if they are really the worst, the more they "work", the worst is the result and they will never achieve the same level of code quality as the good ones or especially of the best ones.
But maybe the worst developers that were being compared are those that are good enough to get the job done. By default they don't care about quality but are able to do some better quality code if that's really needed by standards, code reviews or something. I will still not say they will be able to compare themselves to the "best" when dealing with complex problems, but maybe it is still possible to compare them when doing simple development. I say this because the more complex the problem is, the bigger the chances that "bad" developers will never be able to do it. So, that 28 times better may hold some true depending on the case.
Another View On Worst
Maybe there's a different kind of "worst developers". Those developers that "almost know" something, being able to convince others that they actually know what they are doing, but create more problems than solutions. That is, to be the "worst" they need to have some knowledge to be able to do bad things. Someone that's not a developer at all and tries to pass as one can't be seen as the "worst developer" as he will be identified as a non-developer and forbidden from doing bad things (or even if he is not forbidden, he will not be able to hide the bad things he did).
So, being able to actually deliver something that pass local tests and "works" for some time is the "way to go" to be one of the worst. When things work fine, there's nothing to worry. When things simply don't work at all, customers and managers stop paying and cancel contracts. But when things "seem to work", yet the applications crash randomly, some data is lost/corrupted with a certain frequency or similar and when every new change introduces lots of regressions because of terrible programming practices, then it is the moment that a lot of money will be wasted trying to fix these "small things". And, if the developer was really "good" in doing the worst job, after years trying to fix things, it will arrive to the point that things must be redone from the scratch, as there's nothing to reuse from that code.
Real Cases
Well, this post is about things developers aren't supposed to do... yet the worst developers will do them. The cases I will explain are real cases and went into production.
In my opinion, any developer will immediately see how wrong these things are. Yet, if you are not a developer and you trust the developer knows what he is doing, you may never see the code and believe it is fine as these things pass some basic tests (in some cases even unit tests made by other developers) but those things usually fail in production and are the source of lots of stress and wasted money.
Catching Exceptions
Good exception handling is great. Some things may go wrong (like the network going down) and appropriately dealing with it is necessary. But what should you do if, when inserting records, some users are seeing an exception that there's a duplicate key?
For example, you look at the code and it does something like:
int id = GetMaxId() + 1;
Then, it does an insert using that new id. Is this code wrong?
Well... it works locally with a single user but when running concurrently the exceptions start to happen. Two users may get the MaxId at the same time before any new record is inserted, so they get the same value, and then try to insert the record with that Id + 1. Only one of them will succeed.
How would you fix the issue?
1) Use a different id generator?
2) Will do a try/catch, retrying the action sometimes if the action fails?
2.1) If it keeps failing, show the error after some failed retries?
2.2) Continue retrying forever?
3) Completely eat the exception and show that the action completed successfully?
I asked what you would do, but actually what matters is what the worst developer would do. Remember, he is "smart" enough to fool people. He will not let the code run forever if something is wrong as that might fail some tests too easily. Eating the exception directly will be lazy and may also fail some other tests. He will use a mix of items 2.1 and 3. That is, he will retry only sometimes (not forever)... yet, after the last retry he will show that the operation completed successfully.
Considering this is an operation to which users don't see the inserted record immediately (like many of those "Contact Us" pages) or that such record will still show for some time thanks to some kind of cache, users will see that the operation completed and will believe everything is fine at that moment. They may even see the ID that was generated, yet it doesn't mean anything as that ID is from another record.
Some days later, if they don't receive an answer (considering a "contact us" page) or their record disappear, they may complain, but then the developer can always say they registered the wrong id (if they have the Id at all) or that they forgot to save.
The motivation behind this attitude is that a site that shows error messages is seen as bad. When no error messages are shown, things "are great". Of course, other developers may not agree but, if they don't know this is how things are working, it is always possible to blame the database (or its administrator) for the "data corruptions".
Solving "Connection Timeouts"
Locally the site always worked fine. In production, many users were facing "database connection time-outs". This was intermittent. The site usually started to become slower and slower, then users started to see that message, then it worked again.
Looking at the site configuration it had a connection pool of 100 connections. The site sometimes had more than 100 active sessions, yet the sessions remained active for 20 minutes after the users disconnected. That means it was almost impossible that 100 users were doing database requests at the same time for this to be an issue.
So, what was the right thing to do?
- Look if the connections were being kept alive per session?
- Look if there were places that weren't disposing the connections (so they are not returning to the pool)?
- Increase the connection pool size to 9,999,999?
Actually, the real case I had to deal with was not caused by connections being stored in the session. The developer was aware that he should not do it. He also knew that everytime he opened a connection to be used by a single method, he must close it (and correctly used the using
clause). The problem was that he created some methods to instantiate, open the connection and return the connection (so that method couldn't close the connection) and, in his head, a connection created by another method could never be closed, it was the responsibility of the garbage collector to do it.
What happened is that those connections weren't returning to the pool until the next garbage collection and the garbage collections only happen when .NET considers it is running short on memory. If all the 100 connections are used but .NET sees the computer has plenty of memory, there are no garbage collections, and new connections time-out after 60 seconds because the pool is full.
Yet, keeping generating page objects, instantiating connection objects and then the timeout exceptions consumes memory, so eventually the garbage collector ran and the site started working again.
What's the problem of the connection pool of 9,999,999? The database server started to die. But then the developer considered that it was the database administrator's problem (and this case only came to me because the database administrator asked me for help, as he was convinced it was a problem in the site's code).
Maybe part of the problem is that the connection pool doesn't open all connections at once. If it tried to do that, the site wouldn't start and the developer who did that change would be forced to use a more realistic connection pool size (or maybe he would find the perfect balance that's always screwing with the database server and also having the time out exceptions, who knows?).
Improving Overall Performance and Availability
The site has always been slow. It was known that there were many bad practices that made the site much slower than it needed to be and ideally every part will be "replaced with time".
I think every developer with some experience knows cases like this. The promise of replacing bad parts with better ones never works, in particular because developers are always required to do other things and don't have time to deal with old code, which is like a spaghetti, so it's impossible to replace a single part without affecting the entire site.
One guy sees in the internet that to make sites faster and more available/dependable we must use farms. That's great when there are many computers involved but personally I can't say that putting a farm in a single computer will really make things faster. In many cases it is even the opposite. It can give some benefits to the memory allocation and garbage collection but there are so many other issues, that we must really evaluate the case.
So, what do you think he did?
- Explained that he believed that using a Web Farm will improve performance?
- He started to test a Web Farm in his computer or on the test computers, to see if the performance really improved or if there were other problems?
- He simply changed the server's configuration in production directly?
If you think he did the item 3 directly, you are wrong. The guy told everybody in a meeting that the site could work better if a farm was used. I explained to him that the site already worked as a farm in the past and that it didn't anymore because there were too many corruptions caused by the simple fact that the site was not made to run in a farm. For example, there were too many in-memory locks, local caches and similar that don't work in a farm. It would require a major refactoring to make it work properly, but that was already the purpose of the new site (it already existed, it simply didn't have all the functionalities yet).
Well... for two months everything was OK and, one day, the site starts to give exceptions that the ViewState
is corrupted. My first impression was that someone put the site into a farm, so I asked:
- Did anyone put the server in a farm?
- Yes, I did. This improves performance.
- Sure... and it also causes these errors.
- Impossible! IIS is prepared to run in a farm and it handles all concurrency issues.
- IIS is prepared to run in a farm, but it doesn't solve issues in our site. Our site isn't ready for the farm... also, I didn't know that you put the server into a farm, but this was my first guess when I saw the error.
The guy tried to discuss a little more, but at that time the manager understood that it was a bad idea, remembering the problems faced in the past, so he asked me to disable the farm.
In this particular case, the problem was not even in how the site was programmed. IIS was simply generating a different per-process key to encrypt view states. Every time a post back ended up in a different process, there was a view state corruption. This error would have happened even in single computer tests (and it was simple to fix it) but the guy "knew" that a farm will never cause problems, so no tests were necessary and he changed the production server's configuration directly... I don't know if he didn't remember that meeting or if he was trying to prove that I was wrong and that the farm was the best solution of all times.
Note that I only discovered the issue immediately because that particular site already had that issue in the past. If that was not the case, how much time would the team spend trying to understand what was wrong? Nobody aside that guy knew that there were changes to the server. So why did the server started to have issues? The worst is that redeploying a working site into production will not work, as the farm configuration is not part of the site deployment.
Thread Synchronization (or not)
I will start this case with this pseudo-code:
double total = CalculateSomethingSlow() + CalculateSomethingElseSlow();
As you can see, to get the total we are doing two slow calculations. They are really CPU intensive tasks and I can assure you that calculating them in parallel is perfectly possible.
That means that we can use threads, tasks or any kind of resource that allows us to parallelize the code. But let's be smart, we don't need to put both methods to run in secondary threads or tasks. Only one of them needs to go to a secondary thread, while the other action is kept running on the current thread.
That is, logically speaking:
- We start
CalculateSomethingSlow()
on another thread; - Then we do
CalculateSomethingElseSlow()
on the current thread; - Finally sum the value we just calculated with the result coming from the other thread.
And that "result coming from the other thread" is the source of the problem.
What if the other thread didn't finish its work yet?
If we use the Task<double>
in .NET, trying to read its Result
property will block the current thread until a result is available. But the developer did something like:
double firstResult = 0;
ThreadPool.QueueUserWorkItem((ignored) => firstResult = CalculateSomethingSlow());
double secondResult = CalculateSomethingElseSlow();
return firstResult + secondResult;
And by doing this, as soon as CalculateSomethingElseSlow()
finished, the sum of firstResult
and secondResult
was returned, even if the ThreadPool
thread still didn't finish its work. In that case, zero was seen in the firstValue
and the wrong total was returned.
This case actually had many variants. Some of them even used a Thread.Sleep(100)
or similar just before the return.
Honestly, I believe thread synchronization using timers is one of the most common flaws in many different applications and systems.
Passing redirect parameters "correctly"
This case is specific to ASP.NET Web Forms. There are many reasons to make one page actually get its contents from another page or simply redirect to another page. Methods like Response.Redirect
, Server.Transfer
etc.
Each one of these methods has its pros and cons, in particular with parameter passing. So, why not avoid all the complexities?
Forget about changing the query string (things like ?id=Something&sort=Name
added after the page URL), forget about using untyped contexts and use something "much better". Simply declare the variables correctly typed (so, no cast needed)... as static
! The static variables will be accessible independently on the page that's currently running, so when Page1 sets a static variable and then redirects to Page2, that Page2 can read that static variable directly. There's no query string manipulation, no confusion if this is the first parameter (so the ? character must be used) or any extra parameter (so the & must be used), no need to encode strings, no casts or anything.
Well, I went to fast this time and I forgot to put the three alternatives, so let me put them now. Page1 opened and it actually finished its work, so the user must receive the contents of Page2, what the developer should do?
- Use
Response.Redirect()
to Page2, maybe having to manipulate the query string if parameters are needed? - Use
Server.Transfer()
, having the possibility to manipulate the query string or using the HttpContext to set variables seen by both pages? - Use either
Response.Redirect()
or Server.Transfer()
and pass rightly typed contents from one page to the other by using static variables?
OK. I already said that the item 3 was used. So, why is it bad?
Because it doesn't matter the thread, the user or anything. A static variable is shared amongst everybody. So, a single user may test the site and everything works (data from page1 arrives to page2 correctly). If two or more users are connected to the site, it is possible that 2 users request page1, which is redirected to page2, yet only the parameters of one of the users is seen by page two (being used for both).
This is a normal concurrency issue: Two users request Page1. Both instances set the same static variable, so only one remains. Then both instances of page2 only see one variable set. If it is a search page, I may search for cat, another user for dog and both of us may receive the results either for cat or dog. But that can be much worse when security critical information is involved (which was the real case).
STM
Software Transactional Memory, also known as STM, is a very interesting concept. It actually means that you can start a transaction, modify in-memory data and, if you rollback the transaction, the memory modifications are gone without any special effort on your part.
There was even a STM.NET being developed by Microsoft for some time. I really don't know what happened, if I remember well the technology was simply abandoned because solving all the small details actually made things so complicated that the benefits of the STM didn't pay in a language like C#.
Yet, this case happened even before STM.NET was announced. After all, why wait? TransactionScope
was already there!
No kidding, what about creating an entire system using the TransactionScope
and "knowing" that if you don't commit a TransactionScope
all changes are reverted, even in-memory changes?
No need to test or anything. Simply assume that all the in-memory structures that are modified in parallel with the database will be restored as soon as the database transaction is.
I don't know how to put many alternatives in this case. The biggest problem here is that all the tests were for the cases where the transaction completed successfully (in theory transactions were never supposed to fail, "that would only happen if the database was down"). Yet, when they failed, the in-memory structures (cache or anything else) still held the modified data, while the database held the previous version of the data. This caused lots of data corruptions after the first one happened.
And, even after showing that the contents of a list or an array weren't recovered after finishing a TransactionScope
block without a commit, the guy was still not convinced. According to him, "I must have done something wrong when using the TransactionScope".
Making Things Faster and Safer
This case didn't happen in the same company as the previous ones (yes, all the others happened at the same place... they were made like that during many years, by many different people and when I entered the company I had to deal with all of them at once) yet this particular case is the one that got me more amazed.
I can't even say this was a "programmer's fault". It was the entire architecture idea, supported by many architects!
Maybe I should draw this to explain, but I will try to explain with text.
Old Architecture
- Clients connect by http (non-secure) to the server.
"Faster" and "Safer" Architecture
- Clients connect by http (non-secure) to a middle-tier server;
- This server connects via https (secure) to another server, in the same building.
Their explanation:
- As there's an https server, all the communication becomes secured;
- As there are two servers per request instead of only one, we have 2x performance.
If you did not see the problem, let me explain:
Any attacker would attack the communication between the clients and the first server, which is unsecure. No-one will ever attack the communication between the middle-tier server to the final server, as to do that the attacker needs be inside the organization. And, even if the attacker was indeed inside, he could still attack the external communication to the middle tier server, so protecting the communication after the data arrived is useless (it could work if the entire communication was protected, as a way to protect from possible attackers on the inside, but it doesn't work by only protecting things inside).
Also, the middle-tier server was not doing any kind of caching or similar that could help improve performance. The only thing it was doing was receiving a request, encrypting it and forwarding it to the other server, then receiving the answer, decrypting it and sending it to the client. In the best scenario the communication speed would be the same, with an added lag. In the worst case everything would be slower, as the final server now uses https, not http, with is a slower protocol (has the added cryptography to consume CPU cycles).
A similar design could work very well. If the clients communicated to the middle-tier through https, so the middle tier will use its CPU power to encrypt and decrypt messages, forwarding it to the final server without encryption (so that final server will not have the encryption overhead). Yet, they didn't want to do that, as clients weren't supposed to change. Clients should keep using normal http requests... and things would be made "more secure" "thanks" to that middle-tier.
The end... for now
Well, I finished talking about these absurd cases that should never go into production. Yet I saw all of them. I am pretty sure that if I keep trying to remember I will be able to get much more cases but actually I don't want to have nightmares. Feel free to post your own cases as messages of this post and I hope your learned something with this post... or better, I hope you didn't, because if you did humanity might be in danger.
CodeProject