Data cleansing or data scrubbing is the process of detecting and correcting corrupt or inaccurate records from a record set, table or database, and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting dirty data or coarse. Data cleansing can be performed interactively using the tools of data sets or batch processing with scripts.

After cleansing, a data set must be compatible with other similar data sets in the system. Discrepancies are found, may have been originally caused by user errors, corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleansing differs from data validation in that validation almost always means data is rejected from the system at entry and is performed at the time of entry, not the data packets.

The process of data cleansing may involve removing typographical errors or checking and correcting values against a list of known entities. Validation may be strict such as rejecting any address that does not have a valid postcode, or fuzzy, such as adjustment records partially match existing, known records. Some data cleaning solutions will clean data by cross checking with a validated data set. General cleaning of the practice of data extension data, where the data is made more complete by adding related information. For example, adding address with any phone numbers related to this address. Data cleansing may also include the unification and normalization of the data, which is the process of combining data from "different file formats, naming, and columns", and turning it into a single data set, a simple example-expansion of abbreviations.


1. Motivation. (Мотивация)

Administratively, incorrect,inconsistent data can lead to false conclusions and to redirect investments in public and private scale. For example, the government may want to analyze census data to decide which areas require additional consumption and investment in infrastructure and services. In this case, it will be important to have access to reliable data to avoid erroneous fiscal decisions. In the business world, incorrect data can be costly. Many companies use customer information databases that record data like contact information, addresses, and preferences. For example, if the addresses do not match, the company will incur the cost of sending mail or even losing customers. The profession of forensic accounting and fraud investigating the use of data cleansing in preparing the data and is usually performed before sending the data to the data warehouse for further investigation. There are available packages so that you can clean / wash address data when you enter it into your system. This is usually done via an application programming interface API.


2. The quality of the data. (Качество данных)

Quality data needs to pass a set of quality criteria. These include:

  • Restrictions vary: typically, numbers or dates should fall within a certain range. That is, they have minimum and / or maximum permissible values.
  • Validity: the extent to which the measures conform to defined business rules or constraints, see also statistics act). When modern database technology used for the development of systems of data capture action is relatively difficult: there is bad data, mostly in traditional contexts where restrictions were not implemented in the software or where the use of invalid data capture technology. Data constraints can be divided into the following categories.
  • Pass-through authentication: certain conditions that utilize multiple fields must have. For example, in laboratory medicine, the sum of differential white blood cell count must be equal to 100, since they are all percentages. In the database, the hospital, the date of the patients discharge from the hospital may not be earlier than the date of admission.
  • Data-Type constraints – e.g., values in a certain column must be a data type, e.g. Boolean, integer, or real, date, etc.
  • Set-membership constraints: the values for the column from a set of discrete values or codes. For example, one gender may be female, male or unknown is not fixed.
  • Regular expression patterns: occasionally, text fields should be checked this way. For example, the phone number may be required to have a sample of 999 999-9999.
  • Mandatory constraints: certain columns cannot be empty.
  • Unique constraints: a field or combination of fields that must be unique within the dataset. For example, two people can have the same social security number.
  • Foreign key constraints: this is a more General case of determining the composition of the participants. The set of values in the column defined in the column of another table that contains unique values. For example, in the database of American taxpayers, in the column "State" is obliged to belong to one of the CSS defined States or territories: the set of valid States / territories are recorded in a separate table. The foreign key concept borrowed the terminology of relational databases.
  • Efficiency: the degree to which a set of events data is defined using the same units in all the systems see also units of measure. In datasets combined from different areas, the weight can be recorded in either pounds or kilograms and should be converted to a single measure using the arithmetic conversion.
  • Completeness: the degree to which all required measures are known. Incompleteness is almost impossible to fix with the methods of cleansing data: it is impossible to determine the facts, which were not captured when the data in question was originally recorded. Inconsistency occurs when two data item in the data set contradict each other: for example, a customer is recorded in two different systems, two different current addresses, and only one of them can be correct. Fixing inconsistency is not always possible: it requires different strategies - for example, to decide which data was written most recently, which data Source is likely to be the most reliable of the latest knowledge may be specific to the organization, or just trying to find the truth by testing how data objects, for example, a customer calls.
  • Accuracy: degree of conformity of a measure to a standard or a true value - see also accuracy and precision. Accuracy very difficult to achieve with cleaning data in the General case, because it requires access to an external data source that contains the true value of a "gold standard" data is often unavailable. Accuracy has been achieved in some cleansing contexts, in particular customer contacts, external databases that match zip codes to geographical points of the city and state, and will also help to ensure that the addresses in these indexes actually exist.

The term "integrity" includes the accuracy, consistency and some aspects of verification, see also data integrity, but is rarely used in data cleaning context, because it is not specific enough. For example, "integrity" is the term used to denote the use of foreign key constraints above.


3. Process. (Процесс)

  • Execution: the execution phase of a workflow after its specification is complete and its correctness verified. The implementation of the workflow should be efficient even on large data sets, which inevitably represents a compromise, since the execution of the data cleansing operation can be computationally expensive.
  • Post-processing and controlling: after executing the cleansing workflow, the results are checked to verify correctness. Information that can not be eliminated in the course of a workflow manually fix it, if possible. As a result of full cycle of data cleansing process where data is checked again to allow the specification of an additional workflow to further clean up the data by automatic processing.
  • The workflow specification: the detection and elimination of anomalies is performed a sequence of operations on the data known as process. It is specified after the process of auditing the data and is crucial in achieving a final product of high quality data. In order to achieve a proper workflow, the causes of anomalies and errors in the data that need to be carefully considered.
  • Data auditing: the data is audited using statistical methods and databases to detect anomalies and contradictions: this eventually indicates the features of anomalies and their location. Several commercial software packages will allow you to define restrictions of various types and then generate code that checks the data for violation of these restrictions. This process is referred to below in the Bullets "process specification" and "implementation". For users who do not have access to high-end cleansing software, microprocessor based packages such as Microsoft Access or file producer Pro will also allow you to perform such checks on the constraint for constraint-based, interactive mode with little or no programming is required in many cases.

Good quality source data associated with a data quality Culture” and must be initiated at the top of the organization. Its not just the issue of implementation of stringent checks on the input screens, because almost no matter how strong these checks, they can often be circumvented by users. There are nine step guide for organizations who want to improve the quality of the data:

  • Promoting inter-Agency cooperation. (Поощрения межучрежденческого сотрудничества)
  • Drive process reengineering at the Executive level.
  • To spend money to change work processes.
  • Quality continuous measurement and improvement of data.
  • Promotion through the awareness team.
  • To spend money to improve the data entry environment.
  • Publicly celebrate data quality best practices.
  • To spend money to improve application integration.
  • To declare a high level of commitment to data quality culture.

Others include:

  • Eliminate duplicates: duplicate detection requires an algorithm to determine what data contains duplicate representations of the same object. Usually, data is sorted by key that would bring duplicate entries closer together for faster identification.
  • Statistical methods: by analyzing data using values of mean, standard deviation, range, or clustering algorithms, it is possible for an expert to find values that are unexpected and thus erroneous. Although the correction of such data is difficult because the true value is not known, it can be solved by setting the average or other statistical value. Statistical methods can also be used to handle missing values, which can be substituted one or more possible values, which are usually obtained extensive data augmentation algorithms.
  • Data transformation: data transformation allows the mapping of the data from format to format, expected a statement. This includes value conversions or translation functions, and the normalization of numeric values correspond to the minimum and maximum values.
  • Analysis: to detect syntax errors. A parser decides whether a string of data is acceptable within the allowed data specification. This is similar to how a parser works with grammars and languages.


4. System. (Система)

The necessary operation of this system is to find the right balance between fixing dirty data and keeping the data as close as possible to the original data from production source system. It is a challenge to extract, transform, load architect. The system should offer an architecture that can clean the data quality recording of events and measurements / quality control data to the data warehouse. A good start to perform the analysis of the profiling data that will help to determine the required complexity of the data cleaning system, and also give a representation of the current data quality in the source systems.


5. Tools

There are many cleansing utilities Trifacta, OpenRefine, Paxata, Alteryx, ladder data, WinPure, and others. It also made use of the library, such as software Panda for the Python programming language, or Dplyr for R programming language.

One example of data cleaning for distributed systems under Apache Spark-its called Optimus with open-source basis for a laptop or a cluster, allowing pre-processing, cleansing, and exploratory data analysis. It includes several structuring these instruments.


6. The quality of the screens. (Качество экранов)

Part of the data cleansing system is a set of diagnostic filters known as quality screens. Every one of them to run the test in the stream of data that, if it fails, writes an error to the error schema. Quality screens are divided into three categories:

  • The structure of the screens. They are used to check the integrity of the various relationships between the columns, as a rule, foreign / primary keys in one or different tables. They are also used to check that the group of columns corresponds to a particular structural definition to which it must adhere.
  • Business rule screens. The most complex of the three tests. They check to see if the data possibly across multiple tables, to follow certain rules of business. An example would be that if a customer is marked as a certain type of customer, business rules that define this type of support must be respected.
  • Screens column. Testing an individual column, for example, to unexpected values as null values, non-numeric values that should be numeric, out of range values, etc.

If the quality of the screen records the error, it can either stop the process data flow, send faulty data somewhere else than the target system or data labels. The last option is the most reliable because the first requires that someone has to manually deal with the problem every time it occurs, and the second assumes that data is missing in the target system, the integrity, and often it is unclear what should happen to these data.


7. Criticism of existing tools and processes. (Критика существующих инструментов и процессов)

Most cleansers these limitations in usability:

  • Time: the development of large-scale data cleansing software takes a lot of time.
  • Security: cross-validation requires the exchange of information, access the application in different systems, including for sensitive legacy systems.
  • Project costs: costs typically in the hundreds of thousands of dollars.

8. Error schema. (Схема ошибки)

In case of an error schema contains records of all error events thrown by quality screens. This event is the error tables with foreign keys to three dimension tables, which represent the date when the batch job, where the screen that produced the error. It also contains information about exactly when the error occurred and the severity of the error. In addition, there is the detail of the event "error" the table with the foreign key in the main table, which contains all information about which table, record and field the error occurred, and error.

