The downloaded daily or hourly meteorological data through National Oceanic and Atmospheric Administration (NOAA), The Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI), HDFT web version of Environmental Protection Agency (EPA) contain lots of missing data. The data gaps are unique. Interpolation method is used to fill the data gap of less than 10 days.Monthly average is used to fill data gap of more than 10 days. I do not know whether this methods are acceptable or not for research. I wonder is there an effective way to interpolate the missing data values?
If you have other stations available nearby you may interpolate data using the Voronoi tesselation or similara interpolation techniques - see
Andersson J.C.M., Zehnder, A.J.B., Wehrli B., Yang H. 2012. Improved SWAT model performance with time-dynamic Voronoi tessellation of climatic input data in Southern Africa. J. Amer. Water Resources Association 48, 480-493, DOI: 10.11111/j.1752-1688.2011.00627.x
Depending on the size of your dataset, I recently have pulled those days out from my analysis. I should back up and say that my experiences have been using those datasets to develop predictive models. Since I have been looking to to determine the relationship between hydrologic conditions and those climate conditions I did not want to add undue noise into the analysis. As I progressed through this analysis, I was able to see which climate variables are the key drivers for the processes I was modeling. For key variables that had missing data, I tried to track down a surrogate location that collected the missing data along with other similar data. The similar data I used to compare and understand how the climate conditions compared across the two sites. I also called/emailed the specific noaa location that is in charge of the data for the site of interest to see if there were any additional data not directly available (eg hourly or 15min data that could be used to get at the missing data).
Hope this helps.
It depends what meteorological data set you are referring to. Strategies to fill gaps in precipitation data sets will be different than filling in gaps in temperature data sets due to the spatial nature of rainfall especially if the rainfall event is more convective in nature (discrete cells) rather than stratiform. What data set (i.e rainfall, temperature wind speed, etc) are you working with?
Fairly large amount research has been carried out in this topic. you can use ANN and other soft computing techniques to fill the missing value if you have sufficient length of data.
javascript:
If the data missing are the Hydrological like runoff and you want them to calibrate a model then its a little worst than missing some rain days or some temperature data i think.Because you will need the runoff data to calibrate the model and see whats the conection of the other meteorological and geological conditions with the runoff . But you cab still use the technics you mentioned and you might just have a little less coefficient of efficiency on your calibration and modeling but you will still not know if those differences between observed and measured runoffs are due to other mistakes or due to the data you interpolated .Many papers i ve read use monthly data on their calibration or on the final presentation and testing of the model efficiency so you could use monthly averages on the end even though you can use daily data on the model with the interpolated data and you will still have good monthly predictions and calibration. Some other technic is to calibrate your model in a smaller period when you have all the data of runoff and then rerun it with the interpolated data also .You will just fail to predict some major rain events for sure whatever you ll do. DATA DATA DATA..... :(
I am working with Hydrological Simulation Program Fortran (HSPF) model which requires hourly data without a single gap. Hourly precipitation, air temperature, wind speed, cloud cover, dew point temperature, pan evaporation, solar radiation is needed for HSPF. Five climate stations (SD 392231, SD 397227, SD 396427, SD 393868, SD 396937) are located near to my study area. I pulled out daily precipitation data from NOAA and CUAHSI (GHCN daily) for station SD 393868. NOAA data has continuous one month gap. CUAHSI data has 1/3 less than actual period of time. Same data from NOAA and CUAHSI has different issues.I also downloaded hourly precipitation data for station SD 396937 (1948/8/10 to 2012/7/31) using http://gis.ncdc.noaa.gov/map/viewer/#app=cdo&cfg=cdo&theme=hourly&layers=00000001&extent=-139.2:12.7:-50.4:57.8&node=gis. I got only 31864 data. But the total number of hours between these dates is 560790. I am still interestingly playing with these data. I will contact to NOAA/other data provider once I completely stuck. I thought we use same technique to fill data gap for different dataset.Thanks Cole and Cotton. Your comments are valuable for me to understand the data world!
There are a number of methods that can fill in missing data. From a simple regression, to more advanced methods like autoregressive models and up to artificial neural networks, you can check the published literature.
Unfortunately, I can't say that I am convinced that one of them is the most effective regardless of the type of data.
For example if a temperature value is missing for a day, I could accept a simple regression, but what happens if a rainfall value is missing? We must be very careful, because rainfall is not well autocorrelated, as temperature is.
There is also the question of how sensitive your model is. For example a groundwater flow model might not be as sensitive as a stream flow model to a slightly modified rainfall value.
My personal favorite way of dealing with this problem, when I use daily values to train an artificial neural network, is to omit the whole day, when a gap in the time-series exists.
If your model cannot handle non-adjacent days of data, or if the application of this method leaves you with very few days of data, you should try searching the literature for the most appropriate method for each particular hydrological parameter.
And always keep in mind that if you modify the observed data enough, then you are not calibrating a model to match the physical system, rather with an imaginary system that would have given the modified observations.
Be careful, because that could lead very quickly to garbage in - garbage out.
If you have other stations available nearby you may interpolate data using the Voronoi tesselation or similara interpolation techniques - see
Andersson J.C.M., Zehnder, A.J.B., Wehrli B., Yang H. 2012. Improved SWAT model performance with time-dynamic Voronoi tessellation of climatic input data in Southern Africa. J. Amer. Water Resources Association 48, 480-493, DOI: 10.11111/j.1752-1688.2011.00627.x
Dol Raj,
I don’t think that methods you used to fill in gaps illogical. Like any algorithm used in any software you have used logical steps and minimised larger errors using aggregate actual data to bring sum of estimated values with the actual observation to recorded/reported mean value. So far so good! However, a further step you have to take is validation of estimated values.. Mostly, researcher use an observed series of data which are considered theoretically related to the series of data where missing data are filled by estimation/interpolation. The best practices involve calculating correlation of related series with that in which data are filled block-wise. That is sufficient number of actual values or correlated with the recorded values of related hydrological values. At least there should be two such blocks. The correlations between each block values are calculated, giving a range of correlation values to judge the correlation of block in which lays the estimated values. If the correlation of block containing estimated values lies between previous two correlations or is negligibly different, your estimated values are validated and can be used with conviction.
Another method of validation is to use two or three methods may be used to estimate/ interpolate missing values and correlate them, if correlations are high and significant (if number of missing values continuously or intermittently is large), it will be better to average estimated values, estimated using different methods.
I have recently combined two data splicing techniques to fill in missing data. The first method is the surrogate method and the second method is the overlap method in a two step approach. The surrogate method utilizes data from a nearby gauge (donor site) or use of a surrogate variable at site of interest (target site). These data are then used to develop predictive relationships (most methods have been discussed in previous posts e.g., regression or neural networks). The relationships are developed using data from time periods when both the donor and target stations or gauges have data (period of data overlap).
Instead of using all the data from the data overlap period to develop a predictive model, you may split the data into two sub-datasets. Develop your predictive model (s) using the first sub-dataset. Use the model to predict values for the second sub-dataset (xi). Finally predict the missing data using the relationship below
Y = X * SUM ( yi / xi ) or Y = X * SUM(yi) / SUM (xi)
Where Y is the estimated missing value, X is the corresponding predicted value using the model you developed, yi is the ith observed value of the second subdataset, and xi is the corresponding ith estimated value of the second sub-dataset.
I choose data immediately before and after the period of missing data for the second sub-dataset to account for local biases of the predictive model.
Your ideas will help me to find the way out. Thanks a lot to everyone!
If you have complete time series for precip and temp and enough data to calibrate some simple model, then a possible way to go is to first calibrate the model on existing data and then to use the simulated runoff to fill the gaps. To make things a bit more sophisticated you can use several good parameter sets derived using a Monte Carlo approach or several calibration trials and compute some 'ensemble mean'.
The interpolation with nearest stations I think is the best approach. In the case that you don't find any surrounding reliable information, because of lack of stations or also lack of the same period of data (shouldn't be the case of USA but here in Argentina is a typical case), you can use a polynomial fit for streamflow data for gaps of at least 10 days. That also depends on the season considered and can be extended to 15 days or so in the dry season. If you have gaps greater than 10 days, I don't recomend you to replace the entire month with its climatological values because you will be discarding 20 or less days with measured valuable information. In the case of precipitation a polynomial approach is not valid, but you can also use satellite estimations (depending on the region considered).
Interpolate using nearby data sets. Don't rely upon computer modelling until you've verified the computer model with manual techniques. A couple of hours making sure the numbers make sense with the Mk1 pencil and a calculator is often undervalued, but a critical step.
I would suggest "Gaussian Process", a Bayesian non-parametric method that widely used in Machine Learning. It is a regression method that consider data uncertainty.
I don't think there is an effective way of doing this. You just simply take the missing data out, or you redo the process to get new data. Also you can interpolate using the ArcGis tools.
We can use from regression method between missing time series ( x : dependent station) with time series (independent station: y) also, the first we have to find the best station using a little distance and full data set.
On the other han we can use the ArcGIS and isolines.
You may set up a stochastic model (AR, MA, ARMA, etc.) to the part of the data that is available and use the model to predict the gaps. However, the length of the available data should be statistically enough.
Plotting the hydrological variable of concern on a time series linear regression plot will give rise to the line equation that will allow missing variable to be predicted.
Multiple regression is also recommended depending on the performance of the variable in the model.
I think a linear interpolation across data gaps (drawing a straight line from your site's last valid point to the first point after the gap) is the least preferable and not likely to be accepted if it's noticed in peer review (you didn't specify what criteria you are using "for research". "Research" can mean a wide range of things :-) ). Linear interpolation effectively introduces "wrong" data that is likely to distort your analysis, especially if others less familiar with the data take your data for further analysis later. As Jeffrey outlined, in the absence of any other data that would support a more "educated guess", leave it blank. If you are using a model that does not tolerate missing data in a time series, that's unfortunate, but don't hand your model bad data. As others have outlined, you can try to find other "nearby" stations and develop (in a demonstrable and defensible way) some kind of "informed interpolation". Examples include nearby rainfall gages (check out CoCoRaHS.org for the US and southern Canada), temperatures corrected using lapse rate for elevation, etc. As part of demonstrating and defending this interpolation, you will (or should) quantify the confidence and uncertainty associated with each variable. These will be more work, but will be magnitudes better than blind interpolation. Related to this, do a "reality check" on the data that *does* exist. Follow Edward Tufte's advice and "plot the data!" and scrutinize it for unrealistic values caused by instrument errors, transcription errors (transposed decimal points!), etc. If you see 10 inches of rainfall at one location, and all nearby stations are dry as a bone, it's probably an instrument error (I've seen this).
Your interpolation method should make use of information you have about the process (e.g. neighbouring stations, seasonal behaviour, statistical properties, etc.). Following interpolation care should be given to its impact on the statistical properties of the time series. There is a set of time series analysing techniques that tolerate series with gaps.
As you mentioned using interpolation methods, statistical methods (like regression models) would be useful. The most important thing you have to mention is to find an efficient empirical relationship between your data. In some cases using hydrological or meteorological models would be useful (specifically for precipitation or runoff) . For example If you want to find the missing data for runoff, a well-calibrated rainfall-runoff model can predict the missing data. Using interpolation methods like kriging would be very useful for predicting missing precipitation data. For temperature data if you have multiple stations you can easily find the missing data by using the other stations data (by interpolation methods) and if you have a single station, time series analysis can help. in a mountainous region making a linear regression relationship between the temperature of stations can predict the missing data almost very good. For finding the missing data of relative humidity there are some statistical methods proposed.
Depends on what data has missing values. If streamflow data, then a forecasting model, or maybe an ANN model would probably be best. For missing data in the inputs (rainfall, PET etc), then I would recommend that you try several different methods, and look at the impact on the predicted flows to see how important the infilling procedure is. For missing rainfall data, you will need to take a lot of care. For missing PET data, then the method used may not be as critical as the loss of water through evaporation is a slow process (rate of loss is small compared to the amount of water stored). For rainfall, you could look at neighbouring locations and see what the rainfall is there. An ANN might work through comparing neighbouring raingauges, and perhaps other climate data. Certainly would not consider a linear interpolation for rainfall. Might be able to get away with this for PET, but need to exercise caution.
Main thing is to explore different methods, and look at the impact the different values have on the predicted flows. That way you can discuss the impact the infilled data has on your results.
Finally, I applied following technique to fill missing data.Nearest station Precipitation data was used to fill data gap for greater than 6 hours. Simple linear interpolation was applied to fill data gap for 1-2 hours even though precipitation does not occur linearly. Temperature lapse rate was used to fill missing data using nearest station temperature.Missing Solar Radiation data was calculated using cloud cover or simply filled by nearest station data.Missing Relative Humidity data was calculated using air temperature and dew point temperature or filled by nearest station data.
So far, I did not find any definite way to fill missing meteorological data. After reading several researchers comments, I realized that statistical tools will certainly improve the quality of data or overall result of engineering research. We should explore these tools too!
Thanks to everyone for sharing their valuable experience and ideas!
I wonder whether switching to data analysing techniques that can handle unevenly spaced time series would provide a method to patch the gaps or reconstruct an evenly spaced time series. I have used - many years ago - "Lamp-Scargle" algorithm, a method (least square fit) that was / is used by astronomers and, as I see now, also in biomathematics (Detecting periodic patterns in unevenly spaced gene expression, time series using Lomb–Scargle periodograms, Earl F. Glynn, Jie Chen, and Arcady R. Mushegian, Bioinfomatics, 22, 2006). I had to struggle with the computational effort but Mathlab has (now) algorithms for "Lamp-Scargle periodograms". It seems worth studying whether it is an option to use a method with known features to construct an evenly spaced time-series that then is analysed further.
I recommend you the use of geostatistical simulation techniques in space-time domain. The geostatistical simulation have one important advantage, they preserve the statistical characteristics of the original data, if you apply any interpolation method, the extreme values tends to disappear.
As Barry and Jorge said, check different techniques and make sure that you do not alter the statistical properties of your data. There are quite a lot different techniques in the remote sensing community to gap fill data, for obvious reasons (discontinuous overpass times, clouds, technical issues etc.):
http://www.biogeosciences-discuss.net/9/17053/2012/bgd-9-17053-2012.html
http://www.biomecardio.com/pageshtm/publi/envirmodellsoftw12.pdf
Alavi N.*, J. Warland and A.A. Berg. 2006. ‘Gap filling of evapotranspiration measurements for water budget studies: evaluation of a Kalman filtering approach’. Agricultural and Forest Meteorology. 141, 57-66.
Dumedah, G., Coulibaly, P., 2011. Evaluation of statistical methods for infilling missing values in high-resolution soil moisture data. J. Hydrol. 400, 95-102
Just have a look in to those papers.
Ketaki Ustoorikar and M C Deo (2008): Filling up Gaps in Wave Data with Genetic Programming, Marine Structures, Elsevier, 21(2008, 177-195.
Ruchi Kalra and M C Deo (2007): Genetic Programming to retrieve missing information in wave records along the west coast of India, Applied Ocean Research, Elsevier, 29(3), 99-111.
For some data, as suggested before, you can use physical models, instead of statistical approaches.
For instance, for discharge data you can try to use a rainfall-runoff model for predicting discharge when data are missing. We applied for some projects and, with good input data and a good model, results were really satisfactory (also by doing some analysis with simulated missing data).
You could always interpret data from neighboring stations as many others have mentioned check out the Center of Ecology and Hydrology in the UK they collected data from all national agencies and process it to create one database the have developed many different interpolation methodologies. So there experience may be useful the have many publication which most are free the also have some pre-written code for this job that can be downloaded.
http://www.ceh.ac.uk/index.html
Hi there, I recommend take a look about the attachment. There you could find some answers to your question. It is a well introduction towards a good tool to interpolate data.
Fill in missing data is pratical problems and can be a research problems to find effective methods for that purpose.
Filling in ave values is good for practical purpose but probably not for research, as the stat characters of data series changed. The mean is preserved but not the other stat properties.
So, need to use suitable means that preserved stat properties of data, as necessary according res objectives. Simplest method for this is using nearest neighbor stations of corresponding observation times.
Good luck.
Method used to fill hydrological data depends on various aspects:
1) Amount of data missing in terms of the period
2) The time step; whether daily, monthly or annual time step
3) Availability of similar data sets for the same are over long period
If the data to be filled is not enormous, establish homogeneous zones with similar characteristics and extrapolate using station in same zone with similar characteristics. It may be necessary to establish the long term climatic characteristics of the region in question. For longer time steps like monthly or annual totals long term mean could be helpful. There are also a number of formulae in literature that can be applied to achieve the same. It is however important to note that the method used depends very much on the specific case under investigation.
I think you may treat the missing data in the same way as you do predictions for the future.
I mean you just use the tools of time series analysis and apply them on the gaps. You also may use conditioning in the sense that you already know the data at the beginning of the gap and at the end of the gap and you just generate the values between them with a suitable time series model that you may fit in the data outside the gap.
In the carbon exchange community (FLUXNET), a pretty accurate technique to perform gap filling on NEE multitemporal series is to apply deterministic ecosystem models,
These models are first calibrated per site and secondly validated with a jacknifing approach for the same ecosystem types.
The next step is to trust their prognostic capacity, as well as gap filling capacity.
For FLUXNET site gap filling, the deterliistic ecosystem models have been demonstrated to quite accurately perform gap filling for NEE multitemporal data, which also includes descriptions of local hydrology and its impact on ecosystem functioning. As a consequence,
With respect to missing hydrological data, I would proceed by following the same approach, with a deterministic (and hence) hydraulic model of a catchment or part of it, under study. A well known model in this respect is MIKE-SHE.
MIKE SHE is an advanced integrated hydrological modeling system. It simulates water flow in the entire land based phase of the hydrological cycle from rainfall to river flow, via various flow processes such as, overland flow, infiltration into soils, evapotranspiration from vegetation, and groundwater flow. MIKE SHE has been applied in a large number of studies world-wide focusing on e.g. conjunctive use of surface water and ground water for domestic and industrial consumption and irrigation, dynamics in wetlands, and water quality studies in connection with point and non-point pollution.
It is used in regional studies covering entire river basins as well as in local studies focusing on specific problems on small scale.
MIKE SHE is a characterized as being :
Integrated Fully dynamic exchange of water between all major hydrological components is indcluded, e.g. surface water, soil water and groundwater
Physically based. It solves basic equations governing the major flow processes within the study area
Fully distributed. The spatial and temporal variation of meteorological, hydrological, geological and hydrogeological data across the model area is described in gridded form for the input as well as the output from the model
Modular. MIKE SHE has been given a modular structure, which allows expanding water quantity simulations to cover e.g. solute transport, particle tracking, geochemical reactions etc. The modular architecture allows user only to focus on the processes, which are important for the study.
Hence, the exercise is to parameterize MIKE-SHE for the Region of interest. Validate the simulation results and finally perform the gap filling of the hydrological variable of interest (river flow rate?).
It is as simple as that, but mind that MIKE-SHE is a complex model. It will take time to get the gaps filled.
Good luck and cheers,
Frank
Investigated the correlation of nearby weather stations and get the correlation equation with the best correlation coefficient. With this equation you can fill the gaps.
Finally, I completed my master thesis incorporating most of your comments. It helped me a lot to make final decision while filling the time series data gap. My thesis study indicates that the quality of input data has significant impact on the accuracy of any hydrologic model output. Thank you everyone for your valuable suggestions. I hope this discussions will help others too.
Sounds like your research has been a valuable journey. Well done on getting it finished.
If you have a nearby weather station you may correlate the precipitation data of one station to the other. However, as precipitation can be quite variable on a short distance, the correlation may not be very good. I have usually worked by summing rainfall days (generally 3 cumulated rainfall days). The quality of the correlation between stations is improved this way. It is a simple and rather effective way!
This question is a big issue for many hydrologists and water resources engineers. The choice of interpolation method depend on they type of data you will like to fill the gap. If it is weather data (e.g. precipitation) you can simply use basic interpolation method (e.g. Thison polygon method or related one). But the success of spatial interpolation varies according to the type of model chosen. Please review about deterministic methods and geostatistical method of interpolation.
But if the missing data are streamflow or discharge, you should carefully do it. If there is an already existing rating curve it will be easy to estimate the missing flow data.
A key question. Get the best data correlated (a set of good predictors). Depending on the target variable, develop a regression model (i.e. Multiple Linear Regression, Neural Networks, Gaussian Processes). Finally, validate your equation or model and evaluate if the results are aceptable or not. Then you can fill the data for your specific case. Hope this helps & Good luck.
hy, you can use linear regression. I have used and I was really happy that I have find somthing good for my needs. And you can use this interpolation in Excel very easy. Good luck !
Also on this question you can find some answers from different researchers..
https://www.researchgate.net/post/How_can_I_fill_the_missing_climatological_data?_tpcectx=profile_questions
If you have adequate numbers of complete data you can use machine learning approach to predict missing fields !
Article Machine Learning Based Missing Value Imputation Method for C...