Question: Remove the last few bases from fasta sequences
1
gravatar for jjrin
17 days ago by
jjrin10
jjrin10 wrote:

Hello

I was looking through my created fasta files and realized that the number of bases are consistently 6 above the number that it should be according to the coordinates. I am not a strong scripter so how would I remove the last 6 bases from every sequence in my fasta file? Note: each line in the sequence is on a new line so I can't just remove 6 from the end of every other, I would need to iterate through it as a fasta file format.

>gene1
CACTGATTTGAGTTTTTTTCATAAATCAGAAACCGTTTGAATTATAAAAA
AAAAAACCTCACACAGCTCAAACTCAACCTCTGAAAAATAACCCCCAAAG
TGTAGCACTTCTAGCCCAATTCTTTAAACTTTGGTTGAAGGCTTCTGCAT
AGAGACGCGGGAAAGACAGTTTTACTGTTTAGCACTCTATGGAGCAACAT
CTGTAGCAACACTACTGGGGGGCCAGCGCGAGTATATGGACACAAAACAT
CATGTAGTGTAGGATTTCTGAATAGCAATACACCCTTTGTGGTGATGTAA
CAATAAAGGAAAGGGCATATTTTTGATGATCATGAGGTGTAGCCCCT

I would like to remove "GCCCCT" from the end. This is a multifasta file so there are hundreds of similar cases within the same fasta file. Thank you!

rna-seq fasta • 200 views
ADD COMMENTlink modified 14 days ago by Charles Plessy2.1k • written 17 days ago by jjrin10
1

each line in the sequence is on a new line so I can't just remove 6 from the end of every other

You can linearize your fasta to make it a single line fasta instead of multi-line fasta.

ADD REPLYlink written 17 days ago by st.ph.n1.1k
1

This would be most trivial with a Biopython script. You should have a look at the tutorial and cookbook. Feel free to ask for help if you get stuck (but show us the code and what goes wrong). For sure there are also multiple other solutions.

ADD REPLYlink modified 17 days ago • written 17 days ago by WouterDeCoster20k

I went over my data again and realized that it's not consistent with 6 bases being out of the coordinates but rather a certain number for each gene (that isn't consistent with all of the bases). How would I go about writing a script that determines how many should be in the sequence by subtracting these two coordinates and then splicing out any bases are past the threshold.

For example:

>MSTRG.5.1_Scaffold100_65415_65755_+
TATATTTATTAATTTATAAAAGTGTAATGATAACTACTTTTAAGACAGCT
GACACATTGTTTCATCCCTAGCATGGCAGGCCTAAGAGATATGTTCATCA
TGGGAAAGCTGTCCTCATTAGAGTCCCTCATCATACAGGAGAAGTCCCAC
ATCACACACACTATTCTTCCATTACTCACACGTCACATGAAGAAAACAGA
CCTGAAGGGTTCAGGACTGGACGTCCCTTGCTACCCATAAAAAATGAGGT
GAGCAAAGCAATTAGCATTTTGACCATCTATGGGAGTGACTAACTATAAG
AGTGACCATCTATAGGAGTGACCAACTATAAGAGTGATCAT

If you subtract 65755 and 65415 you get 340 but there are actually 346 bases here. How would I splice out bases based on the number from the coordinates.

ADD REPLYlink modified 17 days ago • written 17 days ago by jjrin10
1

I think you should try to understand why this is happening, instead of blindly removing some bases. What if the leading bases are to be removed? What if no bases are to be removed? What if the numbers are wrong, not the sequences?

Which brings us the question, how have these files been created?

ADD REPLYlink modified 17 days ago • written 17 days ago by h.mon7.5k

These bases were created using galaxy and a tool called "Extract Genomic DNA". I contacted the writers of it and they said that the tool is likely buggy and that the additional bases are probably due to overlapping reads/exons. So once I spliced out exons and looked at just transcripts I still have a couple of additional bases left, that are just the next couple few from the reference genome. For now, it would be best if I just had the bases that are contained within the coordinates.

ADD REPLYlink written 17 days ago by jjrin10

not consistent with 6 bases being out of the coordinates but rather a random number for each gene.

I'm not sure if I understand what's going on. But if you can somehow deduce how many nucleotides need to be removed you can put that too in the script for cutting. It obviously makes matters a bit more complicated.

ADD REPLYlink written 17 days ago by WouterDeCoster20k

The number of bases that need to be removed are the additional ones that aren't within the coordinates "reading frame". So if I have coordinates of 0-700 and have 750 bases, the last 50 need to be spliced out. I'm sorry for the confusion. So I have coordinates within the name of the gene how would I splice out genes based on it?

ADD REPLYlink written 17 days ago by jjrin10
1

So the script would read the header of the fasta, determine the desired number of nucleotides and remove everything that is longer than that, and then move on to the next record.

ADD REPLYlink written 17 days ago by WouterDeCoster20k
1

my fasta file

cat test.fa
>seq
actgactgactgn

fasta file without last 3 bases

seqkit subseq -r 1:-4 test.fa
>seq
actgactgact

fasta file without last 7 bases

seqkit subseq -r 1:-8 test.fa
>seq
actgact

works on fasta file with multiple sequences/records.

ADD REPLYlink modified 14 days ago • written 14 days ago by cpad01121.1k
1
gravatar for st.ph.n
17 days ago by
st.ph.n1.1k
Philadelphia, PA
st.ph.n1.1k wrote:

If you have the gene name, and they match the header in the fasta file, it is still fairly trivial to use python or biopython to trim your sequences.

  1. Linearize your fasta file from multi-line to single line. Pierre's awk statement is the easiest option in the link posted in my comment above. This will make it easier to work with.
  2. Make a file, tab-delimited with the gene name, start coordinate, and end coordinate
  3. Save the code below as trim_fasta.py, and run as trim_fasta.py coords.txt input.fasta output.fasta

See code below:

#!/usr/bin/env python
import sys

    with open(sys.argv[1], 'r') as f:
        coords = {}
        for line in f:
            coords[line.strip().split('\t')[0]] = (int(line.strip().split('\t')[1]), int(line.strip().split('\t')[2]))

    with open(sys.argv[2], 'r') as fasta: 
        with open(sys.argv[3], 'w') as out:
            for line in fasta:
                if line.startswith('>'):
                    header = line.strip().split('>')[1]
                    out.write('>' + header + '\n' + next(fasta).strip()[coords[header][0]:coords[header[1]])
ADD COMMENTlink modified 16 days ago • written 17 days ago by st.ph.n1.1k

I added the See code below: line to make sure the markdown was rendered correctly between your numbered list and code block.

ADD REPLYlink written 17 days ago by WouterDeCoster20k

Shouldn't the second for line in f: be for line in fasta?

ADD REPLYlink written 17 days ago by WouterDeCoster20k

It should. I changed it. Thanks for editing the code block, the 101010 button wasn't working correctly. I'll leave it to the OP to figure out how it works, since I didn't put comments in.

ADD REPLYlink modified 17 days ago • written 17 days ago by st.ph.n1.1k

Thank you so much! I had to fix the last line and include some brackets. I also created the linearized original fasta file.

But now I run into this error:

Traceback (most recent call last):
  File "/Users/margaretsaha/Desktop/trim_fasta.py", line 13, in <module>
    out.write('>' + header + '\n' + next(f).strip()[coords[header][0]:coords[header][1]])
ValueError: I/O operation on closed file

Do you have an idea what could be wrong?

This is the format of my original fasta file (but it is linearized, just ignore the bases' formatting right now, this is more for the header):

>MSTRG.5.1_Scaffold100_65415_65755_+
TATATTTATTAATTTATAAAAGTGTAATGATAACTACTTTTAAGACAGCT
GACACATTGTTTCATCCCTAGCATGGCAGGCCTAAGAGATATGTTCATCA
TGGGAAAGCTGTCCTCATTAGAGTCCCTCATCATACAGGAGAAGTCCCAC
ATCACACACACTATTCTTCCATTACTCACACGTCACATGAAGAAAACAGA
CCTGAAGGGTTCAGGACTGGACGTCCCTTGCTACCCATAAAAAATGAGGT
GAGCAAAGCAATTAGCATTTTGACCATCTATGGGAGTGACTAACTATAAG
AGTGACCATCTATAGGAGTGACCAACTATAAGAGTGATCAT
ADD REPLYlink modified 16 days ago • written 16 days ago by jjrin10
1

You're right, I missed a closing bracket. The issue is next(f) should be next(fasta), see edit.

If this answer solves your problem, please accept.

ADD REPLYlink written 16 days ago by st.ph.n1.1k
1
gravatar for Charles Plessy
14 days ago by
Charles Plessy2.1k
Japan
Charles Plessy2.1k wrote:

If you do not mind that the number of bases per lines gets changed, you can use the seqret command from the EMBOSS package.

$ cat gene1.fa 
>gene1
CACTGATTTGAGTTTTTTTCATAAATCAGAAACCGTTTGAATTATAAAAA
AAAAAACCTCACACAGCTCAAACTCAACCTCTGAAAAATAACCCCCAAAG
TGTAGCACTTCTAGCCCAATTCTTTAAACTTTGGTTGAAGGCTTCTGCAT
AGAGACGCGGGAAAGACAGTTTTACTGTTTAGCACTCTATGGAGCAACAT
CTGTAGCAACACTACTGGGGGGCCAGCGCGAGTATATGGACACAAAACAT
CATGTAGTGTAGGATTTCTGAATAGCAATACACCCTTTGTGGTGATGTAA
CAATAAAGGAAAGGGCATATTTTTGATGATCATGAGGTGTAGCCCCT

$ seqret -filter gene1.fa[:-7]
>gene1
CACTGATTTGAGTTTTTTTCATAAATCAGAAACCGTTTGAATTATAAAAAAAAAAACCTC
ACACAGCTCAAACTCAACCTCTGAAAAATAACCCCCAAAGTGTAGCACTTCTAGCCCAAT
TCTTTAAACTTTGGTTGAAGGCTTCTGCATAGAGACGCGGGAAAGACAGTTTTACTGTTT
AGCACTCTATGGAGCAACATCTGTAGCAACACTACTGGGGGGCCAGCGCGAGTATATGGA
CACAAAACATCATGTAGTGTAGGATTTCTGAATAGCAATACACCCTTTGTGGTGATGTAA
CAATAAAGGAAAGGGCATATTTTTGATGATCATGAGGTGTA
ADD COMMENTlink written 14 days ago by Charles Plessy2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1175 users visited in the last hour