Excluding sequences from fasta based on a id list - faSomeRecords not working
2
0
Entering edit mode
4 months ago

Hello! So, first of all, I have already read and tried several solutions exposed on Biostars and other foruns for my issue, but none of them seem to work. The ones I have most invested my time are faSomeRecords and SeqFilter. The problems I had with them are purely technical (python version, several modules needed, pipeline dont work or works the wrong way because the inpired version author forgot to put some extra line codes) and because of my lack of knowledge to solve them, i gave up

I have several limitations because i'm a newbie with a short deadline and I use my university server, so a not very complicated and quick solution would be nice.

So......... Here's what I have to do:

I have a fasta file with several transcripts sequences

>TCONS_00002379 gene=ENSBTAG00000006648
ATGAATTGCAGCACGCCAGGCCTCCCTGTCCATCACCAACTCCTGGAGTTCACCCAGACTCACATCCATC.......

>TCONS_00000007 gene=RCAN1
GACAGCTTCTGGTAAAGGAACTCCATCCACTTGGGGCTCGACTGCGGGAGTCGCTGTAGCTCTCACTGCC.....

And I have a list with the IDs of transcripts that i dont want my output to have, so i want the ID's sequences out of my output file :

TCONS_00002379 gene=ENSBTAG00000006648
TCONS_00000007 gene=RCAN1
TCONS_00002389 gene=RCAN1
TCONS_00002405 gene=ITSN1
TCONS_00002406 gene=ITSN1
TCONS_00002407 gene=ITSN1
TCONS_00002408 gene=ITSN1
TCONS_00002409 gene=ITSN1
TCONS_00002410 gene=ITSN1
TCONS_00002411 gene=ITSN1
TCONS_00002412 gene=ITSN1
TCONS_00002413 gene=ITSN1
TCONS_00000015 gene=CRYZL1

Currently im trying to run a pipeline with seqtk which I got from a post here, but its taking a long time, and i dont even know if its going to work:

seqtk subseq results_15m_200.fa $(grep ">" results_15m_200.fa | tr -d ">" | grep -v -w -f transcripts.with.orf_15m_full.name.txt) > finalfile_without_orf_15m.fa The solution you'll give me can also be on R!! Thanks in advance filter faSomeRecords fasta • 801 views ADD COMMENT 0 Entering edit mode what is the output of grep ">" results_15m_200.fa | tr -d ">" | grep -v -w -f transcripts.with.orf_15m_full.name.txt. input:$ cat file.txt
TCONS_00002379 gene=ENSBTAG00000006648
TCONS_00002389 gene=RCAN1
TCONS_00002413 gene=ITSN1
TCONS_00000015 gene=CRYZL1

$cat file.fa >TCONS_00002379 gene=ENSBTAG00000006648 ATGAATTGCAGCACGCCAGGCCTCCCTGTCCATCACCAACTCCTGGAGTTCACCCAGACTCACATCCATC....... >TCONS_00000007 gene=RCAN1 GACAGCTTCTGGTAAAGGAACTCCATCCACTTGGGGCTCGACTGCGGGAGTCGCTGTAGCTCTCACTGCC..... output:$ seqtk subseq file.fa file.txt
>TCONS_00002379 gene=ENSBTAG00000006648
ATGAATTGCAGCACGCCAGGCCTCCCTGTCCATCACCAACTCCTGGAGTTCACCCAGACTCACATCCATC.......

$faSomeRecords -exclude file.fa file.txt out.fa$  cat out.fa
>TCONS_00000007 gene=RCAN1
GACAGCTTCTGGTAAAGGAACTCCATCCACTTGGGGCTCGACTGCGGGAGTCGCTGTAGCTCTCACTGCC.....
0
Entering edit mode

Thank you for your input!! how could i use it without faSomeRecords? im not being able to make mine run, I have already tried several solutions....

0
Entering edit mode

Actually, if any of you could help me with solutions for faSomeRecords, I would be appreciate.

I downloaded the linux version for my machine but it is not working. I have already tried different Linux versions and even thought there was something wrong with my files, but they seem right. Can you give me some light on how should I proceed? My command line is:

/media/disk4/gopec/cracco/lncRNA/15m/faSomeRecords –exclude results_15m_200.fa transcripts.with.orf_15m_full.name.txt finalfile_without_orf_15m.fa

And when I run it I just get:

faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa
options:
-exclude - output sequences not in the list file.

The output file is not even created and I cannot find a solution for this online.

0
Entering edit mode

but it is not working

Just saying this does not help you or us. We have shown you that the program works as expected. You are able to get the in-line help to print to screen so it means that the program is executing fine on your machine. So that leaves your input data as the problem.

As asked by @shenwei365 did you manipulate any of the files on a machine other than a linux server at any time (e.g. on windows or macOS). If so that file(s) could have a problem with line endings. Those can be fixed with dos2unix command.

What does cat -vet transcripts.with.orf_15m_full.name.txt | head -4 show?

0
Entering edit mode

Sorry, i missed those questions!! None of the files were manipulated on windows, I did all through linux on my uni's server.

The output of this command grep ">" results_15m_200.fa | tr -d ">" | grep -v -w -f transcripts.with.orf_15m_full.name.txt never came out. After I press enter, nothing happen, and yesterday, with some hope, i runned it with nohup and &.

Now the output for cat -vet transcripts.with.orf_15m_full.name.txt | head -4 is:

TCONS_00002379 gene=ENSBTAG00000006648$TCONS_00000007 gene=RCAN1$
TCONS_00002389 gene=RCAN1$TCONS_00002405 gene=ITSN1$
1
Entering edit mode

If you get nothing from the grep command then that tells us that there is some discrepancy between your ID list and the original fasta headers. If I run that command on my toy example I get

$grep ">" test.fa | tr -d ">" | grep -v -w -f list NC_053005.1 UNVERIFIED: Enterobacter phage KNP7 genomic sequence test2 test3 Which shows the fasta headers that are not common between input fasta and file with list of headers that need to be removed. In other words sequences that will be kept in final output. As for the cat output your line endings do look to be in proper unix format without any additional characters. ADD REPLY 1 Entering edit mode 4 months ago GenoMax 115k Not sure why you can't get faSomeRecords to work. Here is an example.$ more test.fa
>NC_053005.1 UNVERIFIED: Enterobacter phage KNP7 genomic sequence
CCCGTTAATGACGATGGTGTACAGTTCTCTCGGCCAGACGCATATTCGTCAAACGCCGCCGTCGCGGTAACGCCGATCTC
>test2
ACATTGACATTTCCAAAGTCAGCACACGTCTGGCGTCGATTGAAGCGGTTAACGTCCGGACGATGGCCCGAAATCCGTTG
>test3
AGGCGGCCTGTGGCGTTTCAACACGTTGCAGATGGTGTCTGACCCGGCGGACGACGCACGTTGGAACCTTACAAAACTGG
>TCONS_00002379 gene=ENSBTAG00000006648
ATGAATTGCAGCACGCCAGGCCTCCCTGTCCATCACCAACTCCTGGAGTTCACCCAGACTCACATCCATC
>TCONS_00000007 gene=RCAN1
GACAGCTTCTGGTAAAGGAACTCCATCCACTTGGGGCTCGACTGCGGGAGTCGCTGTAGCTCTCACTGCC

# Need to make sure that ">" is removed from the fasta headers when you put them in list file

$more list TCONS_00002379 gene=ENSBTAG00000006648 TCONS_00000007 gene=RCAN1$ faSomeRecords -exclude test.fa list outtest

\$ more outtest
>NC_053005.1 UNVERIFIED: Enterobacter phage KNP7 genomic sequence
CCCGTTAATGACGATGGTGTACAGTTCTCTCGGCCAGACGCATATTCGTCAAACGCCGCCGTCGCGGTAACGCCGATCTC
>test2
ACATTGACATTTCCAAAGTCAGCACACGTCTGGCGTCGATTGAAGCGGTTAACGTCCGGACGATGGCCCGAAATCCGTTG
>test3
AGGCGGCCTGTGGCGTTTCAACACGTTGCAGATGGTGTCTGACCCGGCGGACGACGCACGTTGGAACCTTACAAAACTGG
0
Entering edit mode

Thank you for your comment!!! Thats the way I was using it, and I have no idea either why it was not working. I have even sent an email for dr. kent to see if he could give me some light.

Thank you once again

0
Entering edit mode

Does the ID file created in Windows? Try "dos2unix id.txt".

0
Entering edit mode

No, it wasnt!! I did everything through linux. I just tried this on my Id file, but it didnt work as well. I keep getting the same output:

faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa
options:
-exclude - output sequences not in the list file.

But thank you

1
Entering edit mode
4 months ago
GenoMax 115k

If you want to try a different solution then filterbyname.sh from BBMap suite will do this as well.

filterbyname.sh -Xmx4g in=results_15m_200.fa out=filtered.fa names=transcripts.with.orf_15m_full.name.txt