Wednesday, October 31, 2007

Continuous Integration exposes the Atomic Change-Set illusion

Many source control systems offer the feature of "atomic check-in". That is, when the developer checks in their changes, they are grouped together in such a way that they are all successful or unsuccessful together. Everyone agrees that say this is a core requirement of a good source control system.

The primary purpose of atomic check-in is to force the developer to resolve conflicts with other code before committing their changes. From that perspective, atomic check-in works just fine.

But there is another requirement that atomic check-in is used for. That is the ability to group arbitrary sets of changes together (an atomic change-set). From this perspective, the purpose of atomic check-in is to support the ability to easily merge a change from one code-branch into another.

If you have multiple check-ins for one "real" functionality change, then atomic check-ins start losing their helpfulness. This was not that apparent when developers only checked in code once a week. But those days are gone. Today, developers doing continuous integration check-in several times every day.

And merging those changes has become hard (again).

Source control systems are missing an additional level of change management - the ability to group sets of changes together into a single meaningful change-set.

Do any source control systems do this? I think AccuRev might, although I have not used it so I don't know for sure. Certainly Perforce, CVS and SubVersion do not. (Perforce jobs do not count).

Tuesday, October 30, 2007

Shu Ha Ri, Summarized (or, the story of my life)

You may occasionally see the term Shu Ha Ri bandied about in people that think a lot about Agile and Lean. I think Alistair Cockburn was the first to use it in relation to software development.

The idea is that we should teach the beginner (Shu) some techniques, or best practices. Ha is the stage where the beginner has learned that there are some underlying principles, and begins to explore those. Ri is the master stage, where the master adapts and invents new techniques.

"Shu Ha Ri" is implicitly a craft-based way of thinking of software development. The individual words imply the different levels of craftsman, from apprentice through journeyman, to master.

The more interesting aspect of the term relates to communication. It can be hard for Shu-level and Ri-level to communicate. This is because Ri believes there is no single best solution to a problem (everything depends - he will implement a best solution by adapting as he creates it). But Shu needs firm guidance - he needs to be told a good-enough way to solve the problem.

One way for Ri to communicate is using the principle of strong opinions, weakly held. That is, he should communicate as if he is certain that what he says is correct. But it is all an act. He is not certain, and definitely not attached to the opinion.

In short, if I am Ri then I should feel free to present my opinions or theories as dogma.

If the other individual can show a better way, then I will accept that (it's easy, since I was never attached to the original opinion anyway).

Monday, October 29, 2007

Goal Driven Development

One of the foundations of all Agile techniques is that of frequent iterations developing working software. I think maybe this is a developer-centric view of a deeper concept.

When we talk about iterations, what we are really trying to do is to work with a client to break down their "big goals" into smaller ones. We ask them to call these mini-goals "stories" or "backlog items". Perhaps "goal" is a better word. Why...?

For one thing, it is less feature-attached. A backlog item or story usually translates to a feature. Maybe thats not the intention,but that is how it works out.

I'm digressing though. The main thought is that one or more iterations fulfill some client goal. This goal often corresponds with a release. That is a very concrete view of reality. Anything smaller is not meaningful to a client.

We ask them to shuffle backlog items around in order to decide which features make it into a release. What we should be asking them to do is to decide on the minimal set of functionality that meets their goal. Anything after that is meeting some other goal.

On another tangent...

When we ask developers to think in terms of backlog items and stories, we are making it too easy for them. They need to be made aware that the code they write has to work towards the client's goals. If a "feature" can be avoided or changed by meeting the goal in another way, then the developer needs to realize that.

For example? .... The client proposes a feature. Under certain circumstances, when the user clicks the Save button, we should pop up a dialog asking them if they are sure they want to do this. They must click "Yes" to continue.

Seems like a pretty normal feature. It will make a fine backlog item. But what is the goal of the client? In this example (based on real-life), the answer was that we needed users to manually assert their intentions.

The implementation of a dialog box is a poor solution (as it almost always is). A better one was to supply a checkbox that became enabled (and required) when the right conditions occurred. Thus, the user had to pro-actively assert that they were sure this is what they wanted.

The point is this - when we tell developers to think in terms of backlog items, we are really saying "think in terms of features". Experienced developers will intuitively sidestep this and get to the real goal. Less experienced developers will do exactly what the backlog item suggests.

If we had stated the requirement as a goal instead of a feature, then the less experienced developer would have had to suggest the details of the feature. This implies design, and thought. Exactly what we want in an Agile environment.

Friday, October 26, 2007

Sql Tip - Detecting and removing duplicate rows

I just did this, so I'm jotting down the technique for reference...

(Too) often, databases get duplicate rows inserted. For the SQL newbie (or even those with a few years under their belts) it can be hard to figure out how to fixup the data. This is a technique that works for me...

Assume a simple table with 3 data columns and an integer Id column.

First, detecting the duplicate data:
SELECT Col1, Col2, Col3
FROM Table1
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) > 1

Simple, no? Next, fixing it. First step in fixing sql data is the following:
BEGIN TRANSACTION

...my sql will go here...

ROLLBACK TRANSACTION


That is, protect yourself from your mistakes. Once you are done testing, you can remove the transaction statements and run the sql live.

Next, the actual SQL:
DELETE older
FROM Table1 AS older
INNER JOIN Table1 as newer
ON older.Col1 = newer.Col1
AND older.Col2 = newer.Col2
AND older.Col3 = newer.Col3
AND older.Id < newer.Id


(The less-than criteria at the end is the main trick to this).

Password lengths on websites

I register on a lot of websites. Using KeePass, it is no problem to store a different, randomly generated password for each site.

But...almost all websites I register on limit my password length. Some to as low as 6-8 characters, and many to 12-16. Often they do not explicitly state the minimum length, but the password text box is limited in length.

If you know anything at all about how passwords are kept securely, you will know:
The length of the password has no impact on how much space it takes to store that password.

So why do websites limit us to shorter passwords? I can think of only one reasonable explanation. Our passwords are not being stored securely.

Further aggravating the situation, many passwords are limited to alphanumeric characters and digits. This leaves even longer passwords open to attack. Again, the only reason to limit user's choices is because they are not being stored securely.

So what can we do about it? The single most important thing you can do is to use a different password for each website. Then, if one of your passwords is cracked, the rest of your online world is not compromised.

Since there is no easy way to remember all your passwords, you should use a password manager (such as KeePass) to store all your passwords.

One final note. Use an especially secure password to secure the rest of your passwords. The simplest way to do this is by using a short phrase. For example, your password could be:

This is Steve's password. It is kinda long to type, but it is a strong one.

Thursday, October 25, 2007

The myth of code re-use

It is disturbing for me when I review new code, and I notice that it is almost identical to similar code elsewhere. Usually, copy-and-paste programming is the cause.

Sometimes it is ok, because what is being copied is essentially configuration metadata. But mostly it is a bad thing.

I want to talk about a scenario that I see often, and I think is very common in software development.

We work on some problem domain where we will need multiple implementations. We gain enough understanding to output a version 1.0 implementation, wherein we develop some re-usable parts and some not-so reusable parts.

This is ok. We have learned, and when we come to doing a second implementation we will apply what we learned to make it even better. Or that is how it should be! But it does not happen.

What happens instead is that programmers love re-use (and why wouldn't we - it makes our jobs seem easier). We love it so much, that we will use copy-paste to achieve it. That is, we will copy implementation 1.0 and then try to shove implementation 2.0 into that box.

Never mind that we do not understand the problem sufficiently to know if implementation 2.0 is sufficiently like implementation 1.0 to use the same box. We will copy the box, and then try and mold it to our needs.

This is a recipe for untidy, silly code that cannot handle the little edge cases that come up, because implementation 2.0 is never the same as 1.0.

So what is the solution?

Young grasshopper...forget re-use. It is a red herring. A diversion, an evil distraction. It is not achievable in the way you think.

Forget re-using implementation 1.0. Implementation 2.0 is a chance to start over with a clean slate. An empty page, a new design. The only re-use is in your head - refining and learning. Implementation 2.0 is your chance to apply what you have learned to the problem.

The secret you have to accept is this:

Increasing your understanding of the problem domain is the only way you will achieve sustainable re-use.

When someone (or groups of someones) increase their knowledge of the problem domain to the point at which they achieve a form of enlightenment, then sustainable re-use is not only possible - it is inevitable.

It may take 3 implementations before you achieve enlightenment, or it may take more. The simple lesson is this:

Don't try to re-use existing code. What you want to re-use is an API - a way of thinking of the problem domain. Keep trying new things to improve the way you solve the problem. Re-use will find you when you are ready.

Wednesday, October 24, 2007

Productivity secrets - Debugging

Debugging. The process by which a programmer discovers how his program works.

I am a hyperproductive programmer. That means I can output 3-20 times the work that an average programmer can. (Arrogant? Maybe. Still true though).

One of the biggest reasons I am productive is that I make a habit of not "debugging". This is my theory:

Debugging code is always wasted time

The only output of debugging is a greater understanding of the code. But there are better ways to understand code. I can read it. I can refactor it so as to make it easier to read. Leading to Corollary one:

Code reading and refactoring are more important skills than debugging.

Time spent debugging is not only wasted, it smells of poor code quality. If I lack understanding, that means that the code was too hard to read.

Programmers that do Test-Driven development understand this productivity boost. Once you write unit tests, you find that you no longer have to debug. Leading to Corollary two:

Improved up-front quality leads to less debugging.

Moving on. Maintainability. An ugly word. A nicer word is Soluble, (or grokkable). My productivity is far higher when I am dealing with code that I "grok". Put me on a new project, and it will take me a while to come up to full speed. Much of that time will be debugging, refactoring, or reading code. Leading to Corollary three:

Solubility of code has a direct impact on time spent debugging.

(That is so obvious that it may be a truism).

In summary - debugging is a symptom caused by underlying causes of poor code quality, poor programmer skills, and code that is hard to read.

If you notice yourself, or other programmers debugging code, then ask yourself - which combination of the above is a problem? Then fix it.

Friday, October 19, 2007

PerfectAPI.com

I purchased the perfectAPI.com domain earlier in the week. You should see this blog already there at blog.perfectAPI.com. (For now, the blog is still hosted at blogspot, but the rest of the site is hosted by go-daddy).

My idea for perfectAPI.com is to concentrate on small, specialized functionality, and present it in a way that is, for want of a better word, "perfect".

Of course, "perfect" does not exist, but my API design skills (and rigorous testing) will make it so far beyond what people expect, that it may as well be.

The site itself is using Drupal for now. I also took a look at WordPress and Joomla. Both were very, very good, but I was swayed by Drupal's clean-looking default theme and easy-to-setup friendly urls. I may change my mind later, but I'll stick with it for now.

I am still figuring out where I will focus my energy first. I'm not going to be writing the "heavy" pieces myself. To that end, I've been looking at various open-source components on CodePlex and SourceForge...

Tuesday, October 16, 2007

API Design vs. OO Design

Traditional OO lore teaches us that objects are things that have both data and behavior. Blindly following this rule can lead us to make poor design choices, especially around what many refer to as "business objects".

The pattern is that these objects already have data, so we seek to add behavior as well. In this way we can feel happy and content that we have a true "object", and we are successful OO programmers.

The problem is that adding behavior as a sort of "suffix" to an object is ignoring a more important aspect of objects, which is that they should do one thing, and do it well. Add too many "suffix" behaviors, and pretty soon you can have a tightly coupled bowl of spaghetti.

This is not just theoretical - I have seen it happen, more than once. I've even been guilty of it.

So what is the solution? When we have classes that are primarily data, should we resist adding behavior?

My answer is "it depends". To understand why, we need to take a small detour into API design...

Sometimes, programmers expect things to be a certain, simple way. They do not want to ask a FactoryLocator for an IObjectPersistorFactory, use that to get an IObjectPersistor, and finally tell the IObjectPersistor to Save their object to the database. They just want to write:

myObject.Save()
or
myObject.Load(id)

This ActiveRecord implementation is easy to write and easy to read. In short, it is good because it is a nice API for the client of the object. It has drawbacks (no transaction support, high risk of coupling to database). But in many systems, this API will be sufficient.

So the ActiveRecord "suffix" is mostly ok. What other behaviors can we add? How about validation? The save method should probably validate before it saves, so as to ensure we have good data in the database. How about some initial field values for new objects? And some event driven behavior - let field A be defaulted when field B changes? And we need properties for other objects. MyCustomer.Address.ZipCode works real nice. We can even lazy-load the Address property. Not too hard.

Hmm. Question. If we save the Customer object, should the Address save too? Probably. So we need to add some more code to the Save method for that.

etc. etc.

You get the picture (I hope). You can create a perfectly functional system in this way, but the coupling of all functionality to a single class will make it difficult to change in any substantial way. It will also have poor quality, because we are ignoring several key principles, such as DRY and Open-Closed.

There is only one way in which you can mitigate the problem. Use code-generation to generate your "business object" implementations. This mitigates quality problems substantially (DRY does not apply to generated code). It also forces you to either state some things declaratively (such as required fields), or else move them into their own dedicated area.

MSBuild and dogfood

MSBuild is Microsoft's answer to NAnt. That is, it is an XML scripting language that supports the automation of compilation of .NET solutions.

The theory is, MSBuild is what Visual Studio uses internally to compile solutions, so it should be exactly equivalent to calling MSBuild from a build-server. We are led to believe that Microsoft has "eaten their dogfood" with regard to MSBuild.

The reality is, that is hogwash! If you have Visual Studio 2005, take a look at the MSBuild command line that appears in the output window when you compile a project. It specifies the name of every file in the project. In other words, Visual Studio independently parses the project (which it should not do because that is MSBuild's job), and then calls MSBuild with a highly customized set of parameters that has no doubt been well tested to work in a variety of scenarios.

The point being, it is not feasible to duplicate that command line on a build server. The result is some non-trivial level of frustration! I am sure there are more, but the things I have noticed are:
  • MSBuild requires references to projects that Visual Studio does not. Thus, what compiles in Visual Studio will not necessarily compile on the build server
  • MSBuild manages project dependencies in a different way than Visual Studio. The result is that it is possible for a full rebuild on the build machine to silently FAIL to build one of the projects.
Aaargh!

Tuesday, October 09, 2007

Presenter-Model View with Controllers

At my current (soon to be gone) workplace, we have a unique style of doing our UI....

I think I'll call what we have Presenter-Model View with Controllers. (There is a View and there is a very rich Presenter Model. There are Controllers too).

We mostly drop generic container controls onto forms with zero or minimal code. We have extended properties to be able to bind those controls at design-time. (The appearance is determined at run-time). We have bi-directional deep (multiple dots) data binding, which allows the view to be completely driven by the Presenter Model.

The Presenter Model is more than simply a device for binding a form. It is a first class object in the system, used by security. It also supplies Validation.

Underlying that, we have a custom O/R Mapper with integrated support for database structure evolution.

It took a long time to set that all up, and it saddens me that the product will die soon :(

Sunday, October 07, 2007

The end of a software company

Lean principles teach us to recognize a constraint in the system, and "elevate" (fix) it. I was reading what Amit Rathore was writing about this, and it got me to thinking...

My current work environment is a going-out-of-business software company. Our parent company is continuing in the same market, but they will be using a different piece of software than the one we created. (Due to acquisitions, there were 2 parts of the company creating the same kind of product).

The way our business unit worked is that we would get a sales order for a new deployment of our software. The client would specify their requirements in some sort of RFP. We would then spend time developing the missing pieces (required functionality that was not yet present), convert their existing data to our own format, train the users, and deploy.

In doing this, it appeared to me that Testing and Inventory (time between completing work and Deploying it) were large constraints. In other words, we would build a lot of software, but no-one would look closely at it until we deployed it months (up to a year) later.

It was a decision of the business to expect high quality work, but acceptance-testing that same work was so unimportant that not a single person was assigned full-time to it. Even traditional regression, or smoke testing was not prioritized. To the test-infected, this sounds a little crazy!

But it wasn't. Because of the RFP (contract-driven) style of the client relationship, there was simply no justification for spending more on acceptance testing, (we did not have sufficient real client input to know for sure that we were building what they wanted). If it meet the letter of the RFP, then it was ok.

During big-bang style deployments, we would run around and fix the high-priority issues (mixture of bugs and changed requirements) until the product was working to the satisfaction of the client. Without built in development quality and Agile responsiveness, this is a nightmare. With those things in place, it is just highly stressful.

The extended deployment periods were our acceptance tests. They were also when we found the most about *actual* client requirements. Ultimately (with exceptions) the client was satisfied (but not necessarily happy). And the Product Manager made sure to build more realistic requirements into the next version.

Back to the failure of the business...

In this model, the satisfaction of the next client is directly proportional to how closely their RFP matched a previous one. This was our downfall, because we came into an expansionist period where each new client was a completely new RFP. We were breaking into new product areas and new geographic areas. And our unit was expensive (because we were developing large amounts of good quality, well designed, extensible software).

In this expansionist period we had great difficulty in creating happy clients, because they would each get the worst possible result - an initial, painful deployment where the best possible outcome was meeting the "letter" of the requirements. (Actually, we did a little better than that, but only through a lot of good, dedicated people doing heroic things).

We tried to slow down the expansion but conditions (new-sales-driven upper management) would not allow this. Although they did not think of it in those terms, management did try to elevate the testing/deployment constraint, by working more closely with new clients. This was done by increasing the size of the client-relationship staff.

In the end, we had some successes (and some failures), and the parent company eventually came to the decision to end our unit. This was not a direct result of our failures, but I am sure that had we been more successful, it would have gone down differently.

Thursday, October 04, 2007

HashTable of HashTables

Today I discovered that the effective limit of a HashTable with random hashes is around 65,000 items. This is because the hash-key is a 4-byte integer (32 bits). The way the stats work out, you should expect collisions from about 2^(32/2) = 65536 items. In many scenarios (mine included), that risk is too high!

It is not hard to come up with a unique enough random hash-key. The problem is that the HashTable will only allow that key to be an integer. So my workaround is to create *two* keys instead. This will increase the limit to 2^(64/2) = 4294967296 items. If I have that many items in memory, I will have other problems!

Once you have two keys, use the first to key into HashTable-1. Then make each element of HashTable-1 a new HashTable, and use the 2nd key in that one. (In a totally random scenario, each 2nd-level HashTable will only have a single element. So it would be best to initialize it with that in mind, so as not to use too much memory).

My particular scenario (an identity map) uses .NET types as the first key, and database IDs as the second. This is not the best-case for randomness, but it is sufficient for my purposes (actually, guaranteed unique).

Wednesday, October 03, 2007

"Done", or "Done-Done"

Maybe we're different from other Agile shops, but "what does 'done' look like?" is one of the primary questions in our process. It is almost a running joke - "is it done, or is it done-done?"

To prove you are "done", you have to agree (with the client) on a reasonable repeatable demonstration (i.e. test). Thus, "done" is also synonymous with "accepted", and to some extent, "tested".

It is obvious (to me) that you can never really move a piece of functionality out of "development" until you are able to agree on a definition of "done".

For large tasks (taking weeks or months), it is very helpful to break the task down into meaningful chunks. Being able to define "done" for each of those tasks is an invaluable technique. The "doneness" criteria allows clear borders to be drawn between each chunk, and helps assure that each is meaningful functionality.