Top 10 Architectural, Organizational and Process Related Failures
When we created AKF Partners in 2007, our first handful of clients had their share of engineering, infrastructure and technology problems, but they also shared something else we were not expecting: organization and process issues that appeared to sometimes create the problem and at other times stand in the way of permanent fixes.
This insight, and the resulting collection of common cures (or recommendations) became the basis of our first book, The Art of Scalability.
This article is a view into some of the many points we make in The Art of Scalability. While nowhere near a complete list, it represents a “Top 10” list of the most common architecture, organizational and process related failures in small, medium and large sized product groups.
Architectural Failures
1. Designing Implementations Rather Than Architecting Solutions — Which of the following approaches would you use to describe your product architecture?
- “We are a Java shop running GlassFish on Apache Felix with MySQL and MongoDB persistent storage.”
- “We have 3-tier architecture, divided by services into fault isolation zones with no synchronous communication between them. Our persistent storage tier is a combination of relational and NoSQL that uses native replication to horizontally shard data.”
The correct answer is “B”. Answer “A” better describes how a potential architecture is implemented or deployed on any given day, but it does not describe how the solution scales, fails or works. Any deployment attribute (programming language, operating system, database choice, hardware choice, etc.) will scale or fail dependent upon the way in which those systems interact. Properly architected solutions allow for components (implementations) to be replaced over time as the need arises. They do not rely on any given solution for scalability or availability.
2. Failing to Design for Failures — Everything fails. Hardware, software, datacenters, ISPs, processes, and people — especially people. When a system is architected correctly, and key services or customer segments are isolated into fault isolation zones, the effect (or blast radius) of any failure is contained and minimized. The failure of checkout functionality on ecommerce platforms should not bring down the ability to search, browse, and add products to a cart (which can then be purchased later). Extremely high usage by a “noisy” customer should not cause all customers to have a terrible experience.
3. Scaling Up Instead of Out — Many products still rely on a relational database (MySQL, Oracle, SQL Server, etc.) as the primary persistent storage. Instead of taking the time to segment customers across many small databases (each hosting multiple customers for cost efficiencies), many companies still rely on larger, faster hardware to scale a monolithic system. This is “scaling up” and will eventually lead to higher costs per transaction and larger impacts upon failure as the company grows. Further, capital efficiency suffers, as the next larger system stays mostly idle for some time until it is fully utilized. In the extreme case, the largest system simply isn’t large enough, and your product fails (witness eBay circa 1999 or PayPal circa 2004).
4. Using the Wrong Tool (or Maslow’s Law) — Ask a carpenter to fix your toilet and he will bring a hammer. You probably won’t like the results. Similarly, ask someone whose primary area of expertise is a database to help with product architecture, and they are likely to incorporate a database. The relational database is still the king of persistent storage and is often the best fit for solutions requiring strict ACID compliance or data with inherent relationships (e.g. products that map organizational hierarchies, product catalog hierarchies, etc.). However, we now have many alternative options to create a polyglot persistence tier. If you have data that naturally exists in a document format, i.e. JSON, then a document store may be a better solution. Storing data in its natural state must always be counter balanced with the overhead of learning and supporting an additional technology.
Organizational Failures
5. Organizing by Function — In the past, when we still built and sold software, the role of functional managers was to isolate their function so as to avoid distractions and maximize functional work. This worked well when each team had a very specific role to perform, and the work was passed down a product assembly line. Today’s SaaS companies produce a service that changes rapidly mdash; often every two weeks but sometimes multiple times a day. This requires that the product manager speak with the engineers frequently and that the infrastructure engineers provide input before coding begins. Organizing by function blocks this communication, resulting in poor quality, slow time to market, low levels of innovation and conflict between the teams. Today’s best performing teams are multi-disciplinary, autonomous, and own a service from idea through development into support (what we call “cradle-to-cradle” for an evolving service). If you doubt this principle, ask yourself, “What do we do when there is a crisis?” Almost invariably, the answer is “grab people from different teams, put them in a room, and ask them to solve the problem.” If we do this when it’s important, why don’t we do it every day?
6. Creating Large Teams — Another major failure with scaling organizations is having too many people working on the same code base. When the team grows to more than 15 or 20 engineers, communication and coordination overhead starts to take a toll. Conflicts arise in resource scheduling (environments), merges and decision making. These conflicts take time away from producing more features for customers, which reduces value creation. Fault isolation for services (see item #2 under Architecture) can create natural separations in the product that eliminate these conflicts. It is fine for a team to own multiple services (login, registration, search) but no two teams should own the same service.
7. Failing to Tend Your Garden — We believe great managers are always seeding, feeding, and weeding their “garden” of employees. This means bringing on new talent (seeding), developing team members (feeding), and when necessary letting them go (weeding). To grow the best garden, you must tend it, which means constantly evaluating your team. We like to think about an employee’s performance across three axes: skill, growth, and behaviors. The skill category is how well they know and perform the role for which they were hired. If they are a Java engineer, how well do they code in Java? The growth category is whether they have the ability to grow beyond their current role. Are they ready to be a senior engineer? Are they capable of being an architect? Are they interested in being a manager? The final category, behaviors, is how well their actions are aligned to the company culture. This oft-overlooked category has the greatest potential for adverse effects on the team as a whole. The employee might be a great Java developer and capable of architecting a scalable system, but if their behavior impacts the remainder of the team, they are a weed that must be pulled.
Process Failures
8. Failing to Learn — Santayana’s Repetitive Consequences, “Those who cannot learn from history are doomed to repeat it,” is true for product organizations as well. We like to say, “An incident is a terrible thing to waste.” If your service is unavailable for a period of time and all you do is restore service and move on, you’ve wasted a terrific learning opportunity. Every failure should be seen as an opportunity to learn so as not to repeat past mistakes. This requires discipline to take the time to conduct a proper postmortem. If you think that the root cause of your outage was a hardware failure, you’ve missed the mark. Keep asking “why” until you have find root causes in the architecture, people, and process.
9. Implementing Agile as a Panacea — The Agile methodology is great, but it doesn’t solve problems with team structure (see items #5 and #6 in the Organization section) or business problems such as communication between the product and sales organizations. Training and coaching are a must to be successful in implementing Agile. Realizing that Agile is a business process and not just a software practice is even more important to achieving success. Finally, it’s important to understand that Agile can only fix the problems for which Agile was intended. Expecting it to fix predictability in meeting business expectations on specific dates for specific deliverables is like expecting a carpenter to fix your toilet (see #4 in Architecture).
10. Expecting Load and Performance Testing to Identify All Scale Issues — If your company depends on the system to perform on a single day such as cyber Monday or April 15th (tax day in the U.S.) then by all means you should perform load and performance testing (L&P). However, the problem with L&P is that it tests what happens when users behave exactly the way you’ve predicted they would behave. If your software doesn’t change, that’s a pretty effective way of testing. However, most companies’ software changes frequently, and therefore customer behavior changes. Customers run new reports with queries that take longer. They perform twice as many searches when you change the color of the search button. In a nutshell, their behavior is unpredictable. While L&P can be used to compare a baseline — how does this version compare to last version — it should not be expected to catch everything. Instead, if you have customers divided into fault isolation zones (see item #2 in Architecture), then you can roll out new software to a small subset of users and use actual customer behavior on the new software to test how it performs.
Marty Abbott and Mike Fisher are Managing Partners at AKF Partners, a scalability consulting company dedicated to helping companies grow. Their most recent book is the second edition of The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise.