Reading Time 6 minutes

Many organisations have recognised the value in harnessing their data to make better business decisions. To enable this “data-driven” decision making, in broad terms, organisations are making two key investments. First, workforce training that empowers people with data literacy and second, data projects that enable the organisation to extract value from its data.

A recent Forrester study found that 82% of business leaders now expect their employees to have basic data literacy skills [1]. However, without providing frictionless access to clean, accurate data, staff training is not as impactful.

Why is there value in data cleansing?

Data cleansing is a crucial step in every data project. Whether you are building a data warehouse, migrating to a new business application, or building machine learning models, clean data is the foundation for success. Providing access to clean data benefits the organisation at all levels, including the following:

  • Increased productivity due to reduced errors (and manual handling to fix those errors)
  • Better internal decision making based on accurate data insights
  • Strategic market alignment – accurate data enables better strategic “data-driven” decisions
  • Reduced data handling and storage costs

The challenges many organisations are faced with, lie in the complexity, volume, and variety of their data. In addition, when looking at the data cleansing problem holistically, it can be difficult to know where to begin and how to make the most impact whilst minimising costs. In this article, we lay the foundations of a structured approach, main considerations, and some helpful insights to benefit your next data project.

Knowing your Data Types

Before embarking on any data cleansing exercise, it is important to understand the source and type of data you will be working with. From the near infinite data types and formats, we can classify data into three distinct categories as follows:

  • Structured Data

Typically, in a table format with labels for each attribute (columns) and records/observations in rows. Common examples of structured data include tables in a relational database like SQL Server, or a structured DataFrame in Python / R (often used for machine learning).

  • Semi-Structured Data

Does not adhere to the strict table structure of a relational database but does provide some organisation and hierarchy to group data into meaningful categories. Common examples are emails (with tags representing the sender and receiver), JSON, HTML and NoSQL databases that use graphs and key-value pairs to represent information.

  • Unstructured Data

Data with no formal framework or pre-defined model. Estimates are that for most organisations, between 80-90% of data generated and stored is unstructured [2]. Common examples are word documents, spreadsheets, media files and communication data like text messages and instance messaging.

For the purposes of this article, we will focus on data cleansing for structured data as the distinct approaches required for semi-structured and unstructured data warrant separate discussions.

Key steps in data cleansing

At each of the following steps the key question to keep top of mind is – what are my final objectives and how does that shape the approach to the current step? Not all objectives are the same, so you need to adapt your data cleansing process to suit the project or outcomes you want.

Data validation

Data can be collected in many ways, some more reliable than others. Data entered manually by human input, is generally riddled with errors. The temperature gauge on high-tech machinery in a factory may be more accurate but no system is flawless. Therefore, it is important to understand how our data is collected and what level of error is acceptable according to our key objectives. The data validation process can include:

  • Identifying missing values (imputation)

How we treat missing values depends on our use case, collection method and criticality of the missing data. For example, if we are running an online survey, we may discard submissions that are incomplete. But to stay with our high-tech machinery example, if the recorded temperature has missing readings, we may want to keep the record to understand if there is a pattern and some external influence that is affecting the way we collect the data. Typically, some data is better than no data and we may also come across scenario’s where keeping parts of a record is important. In these cases, we may choose to ‘fill in’ the missing pieces by using a mean or median value. Other (more advanced) data imputation methods include Hot/Cold Deck Imputation, Regression Imputation and Stochastic regression imputation.

  • Incorrect Data Types

Not all data collection processes validate data as it comes in. We may end up with datasets that contain correct information but are in the wrong fields. Alternatively, we may have string values where numbers are expected. For example, in a person’s age field someone may write ‘thirty’ instead of the value 30. In these cases, if we are expecting a numeric value, we may inadvertently cause errors in downstream processes.

  • Duplicates

Like missing values, duplicates may be the result of invalid data collection, human error, or a poorly designed data system. Using our online survey example, we may identify that the same user has submitted information multiple times and determine that the best course of action is to consolidate the duplicates into a single record. There may also be examples where duplicate records should be unique but are missing a key attribute like a timestamp or ID that separates them from the others.

Data Accuracy

Once the data collection process has been validated at the source, we move to identifying any inaccuracies within the data itself. This process is highly dependent on the specific use case, however, there are some general guidelines as follows:

  • Identifying impossible values

For example, for a date field an input of 31/31/31 would be an unacceptable value. For the temperature gauge on our high-tech machine, we know a value of -300°C cannot occur. When impossible values are identified in our dataset, we need to decide how to treat them – is it acceptable to delete these values? Do we re-adjust these values with a mean/ median value so we can retain the rest of the record?

  • Outliers

Outliers are values that may be ‘possible’ but fall far away from the expected or average range. For example, if our survey participant enters an age of 120, this is technically possible but would fall far outside the average. We need to question and analyse these outliers to ensure they do not skew our analysis results.

  • Near-Duplicates

What is the difference between duplicates and near-duplicates? Well, duplicates are easy to find as they are identical. Near-Duplicates may be caused by somebody submitting their survey twice, with a correction to some part of the record. If this is allowed to happen, it may be exceedingly difficult to validate which record is accurate. Near-duplicates can be flagged in the data and reviewed independently or by using another dataset to cross reference. For example, when somebody submits an inaccurate home address, we may be able to identify that the address does not exist and scrub the inaccurate record from our data.

  • Standardising Data

Check and ensure that data in each column are all using the same unit of measurement. For example, if we are measure a person height, we do not want to mix the metric system with the imperial system.

  • Cross-set data errors

When data comes from multiple places and is merged, we need to ensure consistency and that there are no contradictory observations that will skew are analysis.

Data Relevance

Based on the type of workloads you intend for your data down-stream, an additional step may be to identify what parts of the dataset are required for your objectives. Reducing the overall dataset size by focusing on what is relevant to the task will save on storage and computation costs. In scenario’s where removing data is not an option (e.g., for compliance), we can partition the dataset by what is relevant for our workload and the rest can be archived to cheaper storage. Reducing our data size this way will also reduce computation costs and processing time for analytical workloads.

Data Structure

Data cleansing is part of a broader process typically referred to as ‘Data Wrangling.’ The key goal of data wrangling is to transform raw data into clean and ‘Tidy Data’ that is suitable for analytics and machine learning workloads. Whilst the entire data wrangling process is outside the scope of this article, you can find a summary of the Tidy Data principles as follows:

  1. Each variable has its own column
  2. All column headers are variable names
  3. Only one variable is stored in each column
  4. There are no variables stored as rows
  5. Each row is a single observation
  6. One type of observational unit forms each table

Data Quality

Sometimes referred to as Validation or Data QA, this is the last step in the cleansing process. The goal here is to ensure that all previous steps have been performed and the data is ready to be processed by the downstream workloads. Many organisations have a set of pre-defined scripts or processes that scan the data and surface any remaining issues that may need attention. Some key points to consider are:

  • Checking the dataset dimensions (dataset size)
    • Do we have sufficient data to perform the analysis?
  • Check all column data types
  • Use visual tools to graph and visually inspect the data for anomalies
  • Use statistical analysis to get insights to the data and check these against known facts
    • Check for biased raw data (e.g., if our survey submissions consisted of 90% male and 10% female, we may want to consider how this impacts the analysis)

Data cleansing tools

In recent years, one of the most popular approaches to data ingestion, cleansing and transformation is the data lake. Data lakes offer the versatility of ingesting raw data from just about any source and the scalability to handle large data volumes. Databricks provides a cloud based, collaborative, notebook style environment that is built on Spark for high performance. It supports Python, R, SQL, and Scala all from the same notebook, so you can use the language with which you are most comfortable. You can also run statistical analysis and create visualisation with a click of a button, right from the notebook you are working with.

Conclusion

Whether you are storing customer information, building simple business visualisations, or building complex machine learning models, data cleansing is a vital step to ensure successful outcomes. Holistically, data cleansing can be a daunting task, so having a structured approach that divides the problem into smaller, manageable pieces will lay the groundwork for a project that can be accurately costed and well executed.

References

[1]W. Turner-Williams, “Data literacy is the key to better decisions, innovation, and data-driven organizations,” Forrester, March 2022. [Online]. Available: https://www.tableau.com/about/blog/2022/3/data-literacy-key-better-decisions-innovation-and-data-driven-organizations.
[2]MongoDB, “Unstructured Data,” MongoDB, 2022. [Online]. Available: https://www.mongodb.com/unstructured-data#:~:text=Unstructured%20data%20is%20information%20that,common%20types%20of%20unstructured%20content..