I am attempting a bioinformatics project in which I a processing, line by line, a text file that is over a billion lines long (about 67 Gb).

Each line contains a chromosome number, a nucleotide position for the chromosome, and other information. For each line, I want to then access a chromosome file (which might have upwards of 3 million lines) and pull out a specific sequence, and then write this information to another file.

Does anybody know of any particular tricks to more efficiently read files, jump to particular lines in a file, or write information to a file? I want to be able to process this at a speed of at least 8,000 line per minute if possible (that way, I can analyze the entire database in 100 days). Currently, I am running at 1/10th that speed.

I'm happy to explain the ideas behind this project in more detail and share my code (once I upload it to GitHub) if anybody would like to collaborate!

More Devin Camenares's questions See All
Similar questions and discussions