How to extract fasta subsequences from a multiline fasta file which has very long headers ?
3
2
Entering edit mode
22 months ago
sunnykevin97 ▴ 980

HI,

The multiline fasta file, comprises of 38667 sub-sequences,

Each header starts with the ">" and has very long headers.

How do I extract the some sub-sequences using a specific keyword

examples - ID's  (one ID per line)

XP_034407502.1
XP_034416580.1
XP_034403031.1

I had 5000 ID's in a text file.

I tried with faSomeRecords, generates an empty file

Before computing faSomeRecords, I removed ">" in IDs.txt file

faSomeRecords lumpsC.fa IDs.txt  IDs.txt.fa 

ls -lh IDs.txt.fa
-rw-rw-r-- 1 sun sun 0 Jun 22 05:11 IDs.txt.fa

I know we can do it by grep, its quite hard to do it manually for 5000 ID's. Some suggestions please.

grep "XP_034407502.1" lumpsC.fa > retreive_IDs_IDs.txt 

cat retreive_IDs_IDs.txt 

>lcl|NC_046980.1_cds_XP_034407502.1_23875 [gene=LOC117743771] [db_xref=GeneID:117743771] [protein=sprouty-related, EVH1 domain-containing protein 2-like] [protein_id=XP_034407502.1] [location=join(2056311..2056336,2070742..2070922,2071008..2071176,2071587..2071651,2075371..2075529,2075652..2076434)] [gbkey=CDS]
gene genome protein • 1.1k views
ADD COMMENT
0
Entering edit mode
22 months ago
d-cameron ★ 2.9k

For extracting the sequences, I use samtools faidx. For extracting the headers, I also use samtools faidx but with some linux filtering on the .fai file before passing the sequences of interest to samtools faidx.

ADD COMMENT
0
Entering edit mode
22 months ago
Mensur Dlakic ★ 27k

The list of IDs should be one ID per line. If this is what you have:

XP_034407502.1,XP_034416580.1,XP_034403031.1

it will not work. I suggest you convert your file to one entry per line, something like this:

tr "\," "\n" < IDs.txt > IDs_new.txt

After that:

faSomeRecords lumpsC.fa IDs_new.txt IDs.txt.fa
ADD COMMENT
0
Entering edit mode

I provided IDS in a text file, one ID per line. I updated the post.

ADD REPLY
0
Entering edit mode
22 months ago
GenoMax 141k

Using filterbyname.sh from BBMap suite. names can point to a file with ID. One per line.

$ more test.fa
>cds-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGGA
>rna-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGGAGGAG
>exon-123-1 transcript=rna-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGG
>cds-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGACAG
>rna-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGACAGGACC
>exon-456-1 transcript=rna-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGA
>lcl|NC_046980.1_cds_XP_034407502.1_23875 [gene=LOC117743771] [db_xref=GeneID:117743771] [protein=sprouty-related, EVH1 domain-containing protein 2-like] [protein_id=
XP_034407502.1] [location=join(2056311..2056336,2070742..2070922,2071008..2071176,2071587..2071651,2075371..2075529,2075652..2076434)] [gbkey=CDS]
AGCTAGCTAGCTCGAGC

$ filterbyname.sh -Xmx2g in=test.fa out=stdout.fa names=XP_034407502 substring=t include=t


Input is being processed as unpaired
>lcl|NC_046980.1_cds_XP_034407502.1_23875 [gene=LOC117743771] [db_xref=GeneID:117743771] [protein=sprouty-related, EVH1 domain-containing protein 2-like] [protein_id=XP_034407502.1] [location=join(2056311..2056336,2070742..2070922,2071008..2071176,2071587..2071651,2075371..2075529,2075652..2076434)] [gbkey=CDS]
AGCTAGCTAGCTCGAGC
Time:                           0.100 seconds.
Reads Processed:           7    0.07k reads/sec
Bases Processed:          69    0.00m bases/sec
Reads Out:          1
Bases Out:          17
ADD COMMENT
2
Entering edit mode

Or Else,

Is their any sed or awk command to edit the sub-sequences headers in a fasta file.

Original headers in a file
>XP_034380974.1 alpha-protein kinase 3-like [acia acia]
>XP_034380975.1 glutamyl aminopeptidase [acaia acia]

Modified headers in a file
>XP_034380974.1 
>XP_034380975.1

Is it possible to delete the sub-sequences

ADD REPLY
4
Entering edit mode

You can simply cut -f1 -d " " file.fa if your headers are like what you show.

ADD REPLY
0
Entering edit mode

if it was a bigger string, what I'd be the command ?

**Original headers in a file**

>lcl|NC_046980.1_cds_XP_034407502.1_23875 [gene=LOC117743771] [db_xref=GeneID:117743771] [protein=sprouty-related, EVH1 domain-containing protein 2-like] [protein_id=
XP_034407502.1] [location=join(2056311..2056336,2070742..2070922,2071008..2071176,2071587..2071651,2075371..2075529,2075652..2076434)] [gbkey=CDS]
AGCTAGCTAGCTCGAGC

**Modified headers in a file**
>[protein_id=XP_034407502.1] 
ADD REPLY

Login before adding your answer.

Traffic: 2672 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6