Data should be checked for accuracy each time a piece of data is placed into the database. Duplicate copies should secure at multiple sites with no data tampering.
If the data is a collection of waveform / signal : before the analysis there should be a step called data cleaning where we can apply some noise removal algorithms & Outlier removal techniques and filling the missing values(if any) in order to make sure the data is appropriate for analysis.
Data quality can be improved with having a good data governance (DG). Which is the overall management of the quality, availability, usability, integrity and security of data used in an enterprise. A sound data governance program includes a governing body or council, a defined set of procedures and a plan to execute those procedures.
Look for patterns in data that you would not normally expect. Run an autocorrelation analysis if you suspect this is the case. Double check any typos or misplaced decimals. Do sampling checks to ensure the plausibility of each piece of data sampled. Run checks to ensure that the data type is consistent for the entire data set.
Data quality begin with the collecting data procedure, and it depends on the sensitivity of the measurement equipments and tools, which give small error margin for measurements, and does not depend on the diligence and experience of the workers, which may lead to bias in measurements.
Methods of data collection is important factor. Mixed method data collection and data triangulation is very effective to get quality data for the purpose of analysis.
There are many techniques of improving data quality. One of them is filling missing values. I recently published a paper "Generic Data Imputation and Feature Extraction for Signals from Multifunctional Printers". It's open access, here is the link: http://ceur-ws.org/Vol-2322/dsi4-1.pdf I hope it will help you with your data.
Sampling is the basic part of data science . In statistical quality assurance few imperical formule are available of which variance is most important. Frequency of sampling ,Random collection of data at different point also play an important role .
It is not a top-down process but an iterative one, so you can start with performing an initial data manipulation phase, analyse and assess whether the process was sufficiently satisfying. Even the very concept of outliers is not straight-forward and it depends on the type of data and analysis involved
The Step is to determine which Data quality dimension is applicable for our project or subject area. There are 6-7 dimensions such as
Completeness, Accuracy
Timeliness, Consistency
Uniqueness, Validity
Not necessarily all will be applicable for your data. Hence At the beginning of project we should target for the low-hanging fruits
For E.g to achieve 'Validity', Set the rule that PhoneNumber Column should only have Numeric values.
For first cycle of Data quality try to achieve one or two dq dimension, this will give some experience and learning to the team.
In second cycle we should apply that learning and improve further.
Data quality project require detail planning and research it's bit difficult to write all in message but soon i will publish my paper so that it helps the community.