Archive for the ‘data errors’ tag
Why improving spatial data quality is good for business
Why improving spatial data quality is good for business
One thing I’ve been working on recently has a business requirement for massively improved data quality.
This isn’t uncommon; however, it’s something that’s still unfortunately seen as a costly thing to get right. Why though? Shouldn’t high data quality be a non-option?
To get people to buy into data quality, it’s important to look at the impact of not having good quality data. The most important one is where users downstream start to make bad decisions. If you’ve captured data 100 metres away from where it should be, that 100 metres might make the difference between a ‘yes’ or ‘no’, depending upon business rules on what’s in proximity. That decision might have an impact on customers, sales, or the natural world. Worse, it might be costly in litigation. Worse still, it might even be fatal. It’s increasingly important to get spatial data correct as it’s used in more and more contexts.
In an analogy to coding, there’s a simple principle to apply here: the sooner data errors are fixed, the less cost incurred. If you are able to correct data closest to capture time, the less chance that inaccurate (or erroneous) data is propagated; the less time required to find the error, and the less need to change decisions previously made using that data.

As you walk down the data lifecycle, any changes you make to data are increasingly expensive.
Return on investment comes in when you consider the true cost of erroneous or inaccurate data: initially, it’s a nuisance, but later, it’s almost prohibitively expensive to fix. Many organisations have a rough idea of cost-to-fix; in larger organisations, it could run into millions.
So how do you go about improving data earlier? The most important thing to note here is that you need to validate data as close to capture time as possible. This means having validation rules in place on the capture unit itself. On the desktop, that’s pretty simple – and traditionally, it’s has been done in the database. As we move into a cloud based world, it becomes even more simple, because you’ve got the choice on validation within both the database and the application server tiers (or even services between them).
Choice on a validation tool is an interesting one. There are a number on the market that attack the problem in different ways. One thing is clear: in larger organisations, the cost is usually far, far more to NOT bring data validation in.
However, as lots of data capture is done on mobile and specifically disconnected clients, how do you get validation in place? Well, there’s a number of options. Putting geodatabases onto the client is a good idea – at
least there, you’ve got a concept of data validation close to capture time. But can you perform the same level of richness of verification on a disconnected client as you can on a connected one? I’m not so sure – particularly if you have those business rules at the application level, or your validation tool in installed server-side only.
For me, the best solution is a mix: where available, verify at capture time, even if it’s only for a subset of your business rules. That way you knock out the ‘top soil’ of validation problems. You can still verify data at check-in time, and there should be fewer issues to correct.
Key to a good validation tool is the concept of ‘who can use it’. It needs to be something that not only the technical community can use, but also the business. After all, it’s the business that understand the meaning of the data – so they are more likely to have a better grasp of the rules that need to be put in place. It’s no good having a solution that only the technical community can use, modify and configure, because rules would be put in place initially, but then probably wouldn’t change. As requirements on the data change, rules become obsolete – and the process of degraded data starts again.
As a guiding principle, the more issues you can knock out earlier on, the more likelihood you have of getting data right first time – and ultimately, raised data quality and lower cost is what everyone wants. Getting the rules right, and such that the business can create them, mean that the cost of poor data quality can be vastly reduced – hence maximum RoI.
