Question

How to extract fasta subsequences from a multiline fasta file which has very long headers ?

2

Entering edit mode

22 months ago

sunnykevin97 ▴ 980

HI,

The multiline fasta file, comprises of 38667 sub-sequences,

Each header starts with the ">" and has very long headers.

How do I extract the some sub-sequences using a specific keyword

examples - ID's  (one ID per line)

XP_034407502.1
XP_034416580.1
XP_034403031.1

I had 5000 ID's in a text file.

I tried with faSomeRecords, generates an empty file

Before computing faSomeRecords, I removed ">" in IDs.txt file

faSomeRecords lumpsC.fa IDs.txt  IDs.txt.fa 

ls -lh IDs.txt.fa
-rw-rw-r-- 1 sun sun 0 Jun 22 05:11 IDs.txt.fa

I know we can do it by grep, its quite hard to do it manually for 5000 ID's. Some suggestions please.

grep "XP_034407502.1" lumpsC.fa > retreive_IDs_IDs.txt 

cat retreive_IDs_IDs.txt 

>lcl|NC_046980.1_cds_XP_034407502.1_23875 [gene=LOC117743771] [db_xref=GeneID:117743771] [protein=sprouty-related, EVH1 domain-containing protein 2-like] [protein_id=XP_034407502.1] [location=join(2056311..2056336,2070742..2070922,2071008..2071176,2071587..2071651,2075371..2075529,2075652..2076434)] [gbkey=CDS]

gene genome protein • 1.1k views

ADD COMMENT • link 22 months ago by sunnykevin97 ▴ 980

score 0 · Answer 1 · 2022-06-22

0

Entering edit mode

22 months ago

d-cameron ★ 2.9k

For extracting the sequences, I use samtools faidx. For extracting the headers, I also use samtools faidx but with some linux filtering on the .fai file before passing the sequences of interest to samtools faidx.

ADD COMMENT • link 22 months ago by d-cameron ★ 2.9k

score 0 · Answer 2 · 2022-06-22

0

Entering edit mode

22 months ago

Mensur Dlakic ★ 27k

The list of IDs should be one ID per line. If this is what you have:

XP_034407502.1,XP_034416580.1,XP_034403031.1

it will not work. I suggest you convert your file to one entry per line, something like this:

tr "\," "\n" < IDs.txt > IDs_new.txt

After that:

faSomeRecords lumpsC.fa IDs_new.txt IDs.txt.fa

ADD COMMENT • link 22 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I provided IDS in a text file, one ID per line. I updated the post.

ADD REPLY • link 22 months ago by sunnykevin97 ▴ 980

score 0 · Answer 3 · 2022-06-22

0

Entering edit mode

22 months ago

GenoMax 141k

Using filterbyname.sh from BBMap suite. names can point to a file with ID. One per line.

$ more test.fa
>cds-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGGA
>rna-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGGAGGAG
>exon-123-1 transcript=rna-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGG
>cds-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGACAG
>rna-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGACAGGACC
>exon-456-1 transcript=rna-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGA
>lcl|NC_046980.1_cds_XP_034407502.1_23875 [gene=LOC117743771] [db_xref=GeneID:117743771] [protein=sprouty-related, EVH1 domain-containing protein 2-like] [protein_id=
XP_034407502.1] [location=join(2056311..2056336,2070742..2070922,2071008..2071176,2071587..2071651,2075371..2075529,2075652..2076434)] [gbkey=CDS]
AGCTAGCTAGCTCGAGC

$ filterbyname.sh -Xmx2g in=test.fa out=stdout.fa names=XP_034407502 substring=t include=t


Input is being processed as unpaired
>lcl|NC_046980.1_cds_XP_034407502.1_23875 [gene=LOC117743771] [db_xref=GeneID:117743771] [protein=sprouty-related, EVH1 domain-containing protein 2-like] [protein_id=XP_034407502.1] [location=join(2056311..2056336,2070742..2070922,2071008..2071176,2071587..2071651,2075371..2075529,2075652..2076434)] [gbkey=CDS]
AGCTAGCTAGCTCGAGC
Time:                           0.100 seconds.
Reads Processed:           7    0.07k reads/sec
Bases Processed:          69    0.00m bases/sec
Reads Out:          1
Bases Out:          17

ADD COMMENT • link 22 months ago by GenoMax 141k

2

Entering edit mode

Or Else,

Is their any sed or awk command to edit the sub-sequences headers in a fasta file.

Original headers in a file
>XP_034380974.1 alpha-protein kinase 3-like [acia acia]
>XP_034380975.1 glutamyl aminopeptidase [acaia acia]

Modified headers in a file
>XP_034380974.1 
>XP_034380975.1

Is it possible to delete the sub-sequences

ADD REPLY • link 22 months ago by sunnykevin97 ▴ 980

4

Entering edit mode

You can simply cut -f1 -d " " file.fa if your headers are like what you show.

ADD REPLY • link 22 months ago by GenoMax 141k

0

Entering edit mode

if it was a bigger string, what I'd be the command ?

**Original headers in a file**

>lcl|NC_046980.1_cds_XP_034407502.1_23875 [gene=LOC117743771] [db_xref=GeneID:117743771] [protein=sprouty-related, EVH1 domain-containing protein 2-like] [protein_id=
XP_034407502.1] [location=join(2056311..2056336,2070742..2070922,2071008..2071176,2071587..2071651,2075371..2075529,2075652..2076434)] [gbkey=CDS]
AGCTAGCTAGCTCGAGC

**Modified headers in a file**
>[protein_id=XP_034407502.1]

ADD REPLY • link 22 months ago by sunnykevin97 ▴ 980