First I would check the data for errors. Even automated systems will generate errors. You cannot rely on the quality of the data even from what should be a trustable source. It will mostly be correct, but that may not be the same as 100% correct.
1) Is the rainfall on neighboring days the same? This occurrence is possible, but unlikely.
2) Is the maximum rainfall reasonable? So the data show that on December 14, 1972 8735 cm of rain fell. One would think there should be news reports of flooding. Are there?
3) Are there any missing values? Are all dates present and accounted for. The easy way to figure this out is that there are 365 or 366 days/year. Do you have the right number of entries?
4) Are there any negative values?
Now that you have a clean data set you can take the sum or average. You need to select a program, and figure out how to get the program to read your file. In R it might be something like this:
What is the format of your data? If you are using raster data, there are nice tools implemented in the package raster which make this kind of operation straightforward.
Let me know if you are interested: I will post a true-life example.
I have raster data ( containing daily data of a year) from which I have extracted these data in csv file and now I want to calculate monthly data from it. If you can give an example using Raster package I will go for that.
Here is an example with TAMSAT daily rainfall data which cover Africa. For the example, I downloaded two years of data, in the region of Tahoua (Niger). The R code is embedded in the attached Rmarkdown file which is used to produce the HTML file available at https://dl.dropboxusercontent.com/u/44699235/Daily2MonthlyRainfall.html
As you said you already have the dataset in CSV format. I assume that you hava a CSV file with two columns data[date, prep]. In R you have many options to make the aggregation. One convenient way is to use the xts package and the function apply.monthly. R code is generally as following:
I fully agree that R is much more powerful and versatile than Excel. However, for some tasks and some skill levels Excel might be faster, especially if this task needs to be done only once. Furthermore, Excel will force you to look at some of the numbers. That can be useful to catch problems. Are there error codes in the data? What do they look like? "NA" or "-999", or some other values may be present, or cells might be blank or entire days of data may be missing. I suspect that if there is a "NA" in a data column then R will read the column as text? So check the data type for all the columns.
One issue: I can correct the data errors as I find them using Excel. In R I have to find the problem, then open the data file in another program find where the error is located, and correct it. I then have to close the file, and then rerun R to find the next problem. Depending on how you work with R and how many errors are present, Excel might be faster.
It would be easy to have R find the error code -999. I could then write a simple command for R to find all instances of -999 and delete them (or treat them as missing data). So what happens if sometimes there is a failure pattern? Some number of values before -999 are artificially higher or lower because the sensor is failing but that failure has not been properly detected in 5% of the cases. How do I tell R to find a pattern of failure associated with the -999 failure code? It is that sort of problem that I found in my weather data that came from ground based weather stations. I mostly ended up going through 5 years of weather data gathered at 15 minute intervals at five sites by hand. I could then convert the data to daily measures and use it in other analyses. It took about 2 weeks, but I could not think of a more efficient way to catch all the problems (and I probably didn't get all of them).
you are right that in many cases using Excel is a faster way to get the job done, especially for not so complicated dataset. R is more powerful in complicated tasks and it takes time to learn and code.
For some cases you mentioned, they are quite well-known problems and could be handled very well in R.
- NA values: while reading from tabular data, you can specify R what is the default value for NA, it could be any kind of value, i.e. na = "empty", na = "-999"...
- To detect outliers in a data set, there are many ways from manual to automatic. Some simple steps included but not limited to drawing histograms, box plot and you can draw a full time serie for e.g. a year of 15-min frequency values and then zoom in and out around to see the weird behaviors. Some packages written in R perform complicated detections like tsoutliers, outliers...
I get a temperature of 25, 24, 23, 22, 21, 20, 19, 20, 21, 22, 23, 24, and 25 as nighttime cooling takes effect and the temperatures start to rise in the morning.
I get 27, 27, 26, 24, 22, -999, -999, -999, 26, 26, 27, and 29 as the sensor fails.
How does R decide that the value of 24 and 22 should be deleted in the second case but not in the first?
Ah, but you will rightly say that I am a bit off topic because this example is temperature and not rainfall. Most likely with rainfall you will get some error code (which is easy to correct in R), or you will get wired values that make no sense and can be found by looking for outliers. Furthermore, the data are already summarized by day (not by some shorter duration) so you may not be able to detect this sort of error.
The graphical approach using a plot of the raw data would certainly work. Include some flag for missing values and focus on those areas and this might be a fairly easy task.
To deal with problems like yours it would need both understanding about the phenomenon behind the time series and how the it was measured/collected/pre-processed. It is not a simple task, especially when the outliers are likely located within the confidence interval band as in your case.
I recently found an useful tool at plot.ly (it is available for R also). You can go there and copy - paste data from Excel to make plots. I really like the zooming function.
One more thing to mention is that writing code in R is really time consuming and it is only worth a try when we have a really big dataset.
The daily rainfall data you are referring to is station data or gridded data?
For station data, first you need to identify the missing daily data, if any and then fill the missing data using Inverse Distance Weighing Method or any other method. After that you can compile the monthly rainfall series by summation of daily values. The process will be simple using spreadsheet (MS Excel) only and programming may take time as different months have different number of days, so formulating and validating that logic is time consuming.
For gridded data, you need to extract that data and import to excel as per the latitude and longitude of your place and then formulate the daily time series. Then process is similar.
I suggest you to introduce a quality control within the addition process, to be sure that you get each month the total number of days with valuable data
If you able to read the grid points in EXCEL, make each grid points of column, with respect to daily date basis. Use Pivot table in excel to extract monthly values easily.
The mathamtic sum of daily rainfall data will be for the particular month. The plots of monthly data om graph paper reflect the trend of variation of monthly rainfall.
Can you check whether the month with 0 precipitations are dry months? Otherwise can you assess the probability that these months received 0 precip? Are there neighbouring stations to compare?
You can use the "Pivot Table" option in Excel (https://www.lumeer.io/pivot-table-complete-guide/ ) to compute automatically the monthly values. Please, do pay attention to the missing values as discuss above. They may bias your results.