How to extract information from headers
0
0
Entering edit mode
3.6 years ago

I have multiple files, which originally had lots of sequences and headers in. I managed to remove the sequences to leave multiple headers only in each file.

So the file looks like this

Ncov - date_date_ - xxx| info I want |xxx - blah

Ncov - date_date_- xxx| info I want |xxx - blah

I would then like to save these new lines of information in one column as a new file, so I can easily compare all these files for % similarity - to give a much lower memory drag. However, I can't quite get my programme to run this properly or efficiently

I am wondering how I would then code for this extraction?

genomics • 488 views
ADD COMMENT
0
Entering edit mode

Hi, Can you show the steps that you have already done?

If you want to extract a pattern itself: grep -oP "pattern" $file

If you want to extract sequences with a pattern in the header: grep -A 1 ">.*pattern" $file

This only works if each of your sequences is one line. You can use this command to convert multi-line fasta to one-line fasta

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' <  oldfile >> newfile
ADD REPLY

Login before adding your answer.

Traffic: 2163 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6