How to extract information from headers

0

Entering edit mode

4.8 years ago

jones.theo194 • 0

I have multiple files, which originally had lots of sequences and headers in. I managed to remove the sequences to leave multiple headers only in each file.

So the file looks like this

Ncov - date_date_ - xxx| info I want |xxx - blah

Ncov - date_date_- xxx| info I want |xxx - blah

I would then like to save these new lines of information in one column as a new file, so I can easily compare all these files for % similarity - to give a much lower memory drag. However, I can't quite get my programme to run this properly or efficiently

I am wondering how I would then code for this extraction?

genomics • 617 views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 4.8 years ago by jones.theo194 • 0

0

Entering edit mode

Hi, Can you show the steps that you have already done?

If you want to extract a pattern itself: grep -oP "pattern" $file

If you want to extract sequences with a pattern in the header: grep -A 1 ">.*pattern" $file

This only works if each of your sequences is one line. You can use this command to convert multi-line fasta to one-line fasta

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' <  oldfile >> newfile

ADD REPLY • link 4.8 years ago by Fatima ▴ 1000

Login before adding your answer.