Confusion Matrix in Object Detection with TensorFlow

At the time of this writing, the TensorFlow Object Detection API is still under research and constantly evolving, so it's not strange to find missing pieces that could make the library much more robust for production applications.

If you’ve worked on the field before, you are probably familiar with mAP (mean average precision), a metric that measures the accuracy of object detectors. You can find a great introduction to mAP here, but in short, mAP represents the average of the maximum precisions at different recall values.

The TensorFlow Object Detection API provides several methods to evaluate a model, and all of them are centered around mAP. Unfortunately for those looking for a more conventional confusion matrix, TensorFlow doesn’t offer a solution at this time.

To fill that void, I put together a small script that generates a confusion matrix after running a dataset of images through a model capable of detecting multiple classes of objects in an image. The output matrix has the following format:

  • The horizontal rows represent the target values (what the model should have predicted — the ground-truth)

  • The vertical columns represent the predicted values (what the model actually predicted).

  • Each row and column correspond to each one of the classes supported by the model.

  • The final row and column correspond to the class “nothing” which is used to indicate when an object of a specific class was not detected, or an object that was detected wasn’t part of the ground-truth.

With this information, the script can easily compute the precision and recall for each one of the classes. It would be equally simple —but I left this to the reader— to compute accuracy or any other metrics that come out of the confusion matrix.

You need a couple of things to run the script:

  • The label map used by your model — This is the proto-buff file that you created in order to train your model.

  • A detection record file — This is the file generated by using the /object_detection/inference/infer_detections.py script. This script runs a TFRecord file through your model and saves the results in a detection record file.

Here is an example of running the script:

python confusion_matrix.py --detections_record=testing_detections.record --label_map=label_map.pbtxt

The script will print the confusion matrix along with precision and recall information to the standard output.

In case you missed the link to the code before, here it is again.

How is the confusion matrix computed?

Here is a quick outline of the algorithm to compute the confusion matrix:

  1. For each detection record, the algorithm extracts from the input file the ground-truth boxes and classes, along with the detected boxes, classes, and scores.

  2. Only detections with a score greater or equal than 0.5 are considered. Anything that’s under this value is discarded.

  3. For each ground-truth box, the algorithm generates the IoU (Intersection over Union) with every detected box. A match is found if both boxes have an IoU greater or equal than 0.5.

  4. The list of matches is pruned to remove duplicates (ground-truth boxes that match with more than one detection box or vice versa). If there are duplicates, the best match (greater IoU) is always selected.

  5. The confusion matrix is updated to reflect the resulting matches between ground-truth and detections.

  6. Objects that are part of the ground-truth but weren’t detected are counted in the last column of the matrix (in the row corresponding to the ground-truth class). Objects that were detected but aren’t part of the confusion matrix are counted in the last row of the matrix (in the column corresponding to the detected class).

A good next step could be to integrate this script as part of the evaluation framework coded as part of the Object Detection API. I'll try to get around that at some point.

Premature promotion

(I've got to say that this was meant as an internal memo, so there might be references that will sound weird for people outside my company.)

There are always a thousand ways we can improve our sprints — and when I talk about sprints, I'm referring to agile sprints. Unfortunately, there are also a thousand ways to divert our focus and never get meaningful progress. I've witnessed this over the years where a lot of good intentions never catalyzed in good results. It is beyond my purpose right now to determine why this happens, but I'm going to assume that zeroing in a particularly problematic area and doing everything we can to fix it before moving on will help us get ahead.

And then the problem turns into finding what's that area that will give us the most significant boost. I'm sure you have different ideas, but I want you to think about this one: the current most significant source of inefficiencies in our sprints is the premature promotion of tasks on the board.

(And of course, your mileage will vary and you could be in a different stage, with different problems. Please, don't take it personally when I generalize.)

What is "premature promotion"?

Task promotion refers to the movement of a particular task — or story — from one column to another closer to the finish line. In an agile board, columns are visual artifacts that represent the individual processes in the lifecycle of a task. Premature promotion happens when a unit of work is moved out of a process — or column — without being completely ready.

The most pervasive example of premature promotion in our sprints happens in the transition of work from development to quality assurance and creative review (In my company, we have two different processes guided by QA folks and UI designers.) The problem is not exclusive to this area, but the difference is enough to make everything else irrelevant.

Why does this happen?

When you think about a sprint in terms of different disciplines that collaborate in a task by moving it through the corresponding processes, it is tempting to create virtual walls between team members that reduce the sense of ownership. The goal goes from completing a task — not important anymore — to moving the task to the next person in line — what everyone cares about.

It is also common to see quality assurance and creative review as two safety net processes that will catch anything and everything that's missed during development. Nothing wrong with this, until developers start accelerating to meet their goal — moving a task — trusting that their lack of attention will be caught down the line by the other processes.

If we really care about moving things through the board really fast, we can't blame people for giving us just that.

Is this really that inefficient?

Moving an incomplete task forward is akin selling a car that needs to be recalled two years down the line. By the time it comes back, it is incredibly costly to fix, not even counting the damage to the reputation of the maker.

Here are three of the variables that play a significant role in all the time that's wasted when promoting a task prematurely:

  • Mental context switch every time a team member has to close a task, do something else, and then come back to the same task. This affects everyone involved in the sprint.

  • Downtime while the task moves from processes and waits for somebody to pick it up, acquire the necessary context, and then start working on it. The more the task moves back and forth, the more downtime we incur.

  • Environment setup to accommodate the specifics of particular tasks. This varies from project to project, and include things like branching, deployments, versioning, database state, among others.

How can we measure the actual impact of prematurely promoting tasks? We can certainly generalize to come up with a reasonable estimate that will give us an idea of how wasteful this is.

Let's think about "setup time" as everything involved in having a team member ready to start working on a task efficiently. For our purposes, setup time will only capture the time that we consider wasteful. A particular task that bounces back one time from quality assurance and one time from creative review requires us to factor 4 times the setup time:

  1. Quality Assurance finishes reviewing the task and moves it back to Development.
  2. Development finishes the changes requested and moves the task back to Quality Assurance.
  3. Creative Review reviews the task and moves it back to Development.
  4. Development finishes the changes requested and moves the task back to Creative Review.

How long do we think the setup time is? This varies from task to task, but to be conservative, I'm going to assume that 15 minutes is a good start. Doing the numbers, our task above incurred on 1 hour of wasted time.

Now imagine a sprint of work where a 5-person team tackles 50 different tasks, and imagine that we rack 1 hour of wasted time for each task; the total waste is equivalent to adding more than one extra member to the team!

Of course, all of these numbers are relative to the project, team, cadence, culture, complexity, and whatnot. The point, however, stands: premature promotion of tasks is one of the most significant contributors to the inefficiency of our teams.

What can we do from here?

There's no silver bullet to fix this problem, and it will likely require a lot of focus from the team along with different techniques to minimize the issue.

Fortunately, a good start is to increase the attention to detail of tasks during development, which is not necessarily hard to accomplish. As long as people are aware of the consequences, they will likely get onboard and actively try to fix the problem. Here are some ideas that should see some success:

  • Remove any narrative that foster silos within the different disciplines that work together in a sprint.
  • Improve the definition and acceptance criteria of tasks.
  • Improve the collaboration between developers and quality assurance personnel to make sure they are both working in tandem.
  • When possible, remove artificial obstacles that prevent tasks from being worked on simultaneously by different disciplines.
  • Promote the importance of an increased attention to detail within every individual process.

Conclusion

Every time we look, we see multiple ways to make things better. Our capacity to find problems is great, but it can also hinder our ability to make things happen. Focus is a precious commodity that we should use to help us drive progress. Improving our production sprints is no exception.

From the center of all the little things that could be better, prematurely promoting tasks looks like one of the main factors that reduce the efficiency of our teams. Identifying ways to reduce the problem will yield substantial benefits to the process, the team, and your company overall.

Redirecting your energy

I've seen people spend a lot of time crafting notes for their teammates to document the right order of steps to accomplish a repetitive and cumbersome task.

This is great, but you know what's even better? Spending the time on automating the task, so nobody has to learn the proper order of steps, and there are no more mistakes.

Sometimes you need to redirect your energy towards the right solution. Instructions on how not to mess up are great, but it's always better if there's no way to mess up in the first place.

Want to do something cool for your team? Look around, found a process that depends on a set of instructions, and automate the whole thing.

Don't be the jerk

Putting extra effort is a good thing. Going above and beyond is also a good thing. I think you should try regularly.

I also think that you should select the timing carefully.

If you are the person who waited for an empty office during the winter holiday to start committing code and spam people's emails with tales of your heroic efforts, you won't achieve any recognition.

Part of being a great team player is to respect everyone's time.

Let's assume that you want to do the work anyway; here is how I'd go about it: I'd finish the thing, write (but not send) the corresponding emails, and wait until we are all back at the office. Then, and only then, I'd show off my work.

People appreciate those that work hard and go the extra mile. Just make sure you are helping and not just showing off.

Bad code shouldn't be an option

Some people like to present unit testing as something optional in addition to the code they produce. They usually talk about two options: they can either create good code or good code that comes with tests.

I like to think about it a little bit different: it's either bad code or good code.

You can't write good code if you can't prove it with a test suite. Code without tests is as good as bad code, and it shouldn't be an option for people to settle for bad code.

The more you think about unit testing as indispensable, the closer you get to the developer everyone aspires to be.

TDD is hard, but it's tough to beat well-tested code

I just spent a couple of weeks doing TDD with somebody at work. He knew about TDD and unit testing in general, but he wasn't necessarily convinced about the trade-offs of TDD, especially for somebody with no previous experience doing it.

It takes time to learn TDD well. The initial feeling is that you can go ten times faster if you didn't have to write those darn tests. I get those that stop doing TDD after trying for just a few days; it's hard to envision the long-term gains while you are feeling so much pain to make any meaningful progress.

If you are going through this, don't stop. TDD is hard, but not harder than the other million things that you already learned. TDD will set apart your work; you'll be the one doing what the other hundred aren't just because they didn't take the time to learn it correctly. You can do better by sticking with it and overcoming the challenging phase. Eventually, TDD becomes second nature, and you'll be ripping off the benefits of your hard work.

The fact is that it's tough to beat well-tested code. There'll always be a thousand justifications to avoid unit testing, but at the end of the road, the better code always wins.

When the estimates are too high

It will happen more often than not: after you finish estimating the work, you realize the estimates are higher than what everyone was expecting.

An estimate is an unbiased prediction of how long a project will take or cost, regardless of what is the specific target that you want to accomplish. You can't just reduce your estimates, so here are some suggestions to keep in mind whenever your estimates seem too high:

  • Don't reduce estimates that came directly from the developers. They tend to provide estimates that are too optimistic already, so further reducing them will not help your chances of success.

  • Don't cut the estimates down without discussing the consequences and a plan to mitigate them. You can negotiate (and reduce) commitments, but you can't negotiate estimates.

  • Try to use different estimation techniques to validate your previous results. If these estimates agree, trust them.

  • An excellent way to improve the accuracy of the estimates is using group reviews. Wideband Delphi is a structured group-estimation technique that produces very good results.

  • Look at the least important features, and negotiate them out of the scope. Find the ones with the most uncertainty and start a conversation about them. Keep notes of every assumption you make to reduce the estimate.

  • Always remember that if you torture the data long enough, it will confess to anything. Don't push it.

Right at the intersection

Great, capable individuals can definitively make a difference. I know a lot of them; sharp, hardworking people that could easily impress you with their knowledge after a 5-minute conversation with them.

Sadly, you need something else to build a team. It's not enough to be great at what you do; you also need to be good at working with others. I've seen time and time again talented developers that have no clue about being part of something bigger than themselves.

Just like a baseball team, it's not about packing as many stars as possible under the same roof, but about finding the right balance and chemistry that will lead to winning games.

With every new interview, I've found myself asking more questions to uncover that side of the candidates. I don't stop anymore as soon as I realize how great developer you are, but I try to dig deeper to make sure you'll be a good fit for the rest of the orchestra.

The secret is right at the intersection of a great developer and a great team player.

Things to consider before sending your next estimate

Before sending an estimate for your next project, take some time and consider the following questions:

  • How long will it take you to clarify all the requirements, document them, and prepare the backlog of the project?
  • How long will it take to prioritize and size that backlog?
  • If you need to put together an initial release plan, are you accounting for the time to do it?
  • Are there any third-party integrations that will require further review and you aren't planning on it yet?
  • Do you need to schedule any time to think about performance, scalability and the security of the project?
  • Are you planning enough time to deploy the product?
  • How much time do you need to transition the project to the client?
  • Do you need to do any user training before handing over the project?
  • Is there any data migration that you should include in your estimate?
  • Is there any sort of warranty period that should be planned and estimated?

I keep this list handy, and it's very helpful every time I have to think about a new project. You can expand it to include additional activities (I removed some from the list above because they are unique to my company) and make sure you don't forget to account for that time again.

Padding

If you do a quick Google search for how to get better at estimating software tasks, you'll find multiple people recommending the way they do it. Some of these recommendations are some variation of the following ideas:

  • Come up with your estimate and then multiply this number by 2 (or 3, or even 4!)
  • Come up with your estimate and increase the unit of time you used. For example, if you came up with 2 days, your final estimate should be 2 weeks. If you came up with 4 weeks, your final estimate should be 4 months, and so on.

The principle behind these techniques (and any of their variations) is to add padding to the estimate to cover for unknowns. Despite well intentioned, doing this doesn't really make you better at estimation. It doesn't make your estimates more accurate either, and there's probably an argument about whether this is even ethical.

Would your client understand and agree with the way you are deciding how long things will take? If you feel the need to hide from your customer the way you come up with estimates, then you probably have more work to do on this front.