I have a 47gb file to parse. The sequences are in the following format:

>tscs_00041 gene0ea_12345_rframe2_orf

mlaathyykfairrlfpllkdticasysisikhhenfmalsnmpkiwedvevdgnnmqwtrfqttpvmpvyfiaagvfnlsfitnwntkllyrkdilpymtfaynvakniawflshirktkitnhi

>tscs_00044 gene0ea_12341_rframe2_orf

mticasysisikhhenfmaikhhenfmalsnmpkiwedv

I simply want to format this file like:

>tscs_00041

mlaathyykfairrlfpllkdticasysisikhhenfmalsnmpkiwedvevdgnnmqwtrfqttpvmpvyfiaagvfnlsfitnwntkllyrkdilpymtfaynvakniawflshirktkitnhi

>tscs_00044

mticasysisikhhenfmaikhhenfmalsnmpkiwedv

Can anyone share the script?

More Shishir K Gupta's questions See All
Similar questions and discussions