I have 19 large files of average size of 5GB, I want to split data from all the files into small files into another 35000 files based on some criteria.

One file takes 8 to 10 hrs if done serial way. and if the session stops or any other failure leads to do it again and again causing threat to reliability of data. Is there any way to make it possible to execute parallel or more faster approach.

This task is very much important for me as the my main goal of modelling depends on this data. This process is one time execution process for my application.

I am working on windows(right now bit difficult to Linux). The files are database files, I can produce both Mysql DB files or tables in the form of .txt file. I want to split based on the selection and filter criteria on this files(Its not after every n bytes). I am using R for this problem. The server working on is 8 core, 32 GB RAM server.

Please suggest any way if anyone came across such problem?

More Mayur Narkhede's questions See All
Similar questions and discussions