Question

How do I use a regex to remove a line from a FASTA file if it has more than 6 'A' or 6 'T' sequentially present in a line?

1

Entering edit mode

7.0 years ago

inkprs ▴ 70

I am looking for a solution in both Bash/Linux and Python.

Also,

If more than 40%[Aa]’s or [Tt]’s present, remove the sequence and the description.

e.g. file.fasta

>seq1
GGCAGAGGCCCCCTAGCCCCGCCCGCGCCATGGTCAGGCACGCCCCTCCTCATCGCGGGGCACAGCCCGGCGGGTAGCCCCAGCGCTGGAGGCGGGCGGGGCCGGCCGGCGGAGGCCTGAGCAGCAGCCCAGCGCGGGCCGCCGAGACACCATGAGAGCCCCCACACTCCTCGCCCCACCGGCCCTGGCCGCACTGGGCACCGCTGGCCGGGCGGGTGGGTGCCCC

>seq2
CCACTGCACTCACCGCACCCGGCCAATTTTTTTTGTGTTTTTAGTAGAGACTAAATACCATATAGTGAACACCTAAGACGGGGGGCCTTGGATCCAGGGCGATTCAGAGGGCCCCGGTCGGAGCTGTCGGAGATTGAGCGCGCGCGGTCCCGGGATCTCCGACGAGGCCCTGGACCCCCGGGCGGCGAAGCTGCGGCGCGGCGCCCCCTGGAGGCCGCGGGACCCCTG

After processing, output should be:

>seq1
GGCAGAGGCCCCCTAGCCCCGCCCGCGCCATGGTCAGGCACGCCCCTCCTCATCGCGGGGCACAGCCCGGCGGGTAGCCCCAGCGCTGGAGGCGGGCGGGGCCGGCCGGCGGAGGCCTGAGCAGCAGCCCAGCGCGGGCCGCCGAGACACCATGAGAGCCCCCACACTCCTCGCCCCACCGGCCCTGGCCGCACTGGGCACCGCTGGCCGGGCGGGTGGGTGCCCC

seq1 is fine but seq2 and its description have to be removed from a file.

e.g.

AGAGCTAGAAGGGG - ok

AGAAAAAAAAGAGG - remove -- more than 6 A sequentially

ATTTTTTTTATGATG - remove -- more than 6 T sequentially

AGTAGTTAGGGGGG - ok

ACAGAAACAGAATG - remove -- more than 40% of A

TCTGATTTATTATTG - remove -- more than 40% of T

Are there any tools available on the market for these purposes? I would like to get a solution via tool as well use python/Linux.

FASTA FASTQ sequencing • 2.3k views

ADD COMMENT • link 7.0 years ago by inkprs ▴ 70

1

Entering edit mode

grep  -iE '(A{6,}|T{6,})'

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

7.0 years ago

Brian Bushnell 20k

BBDuk will handle hexamers in single-line or multi-line fasta:

bbduk.sh in=file.fa out=filtered.fa k=6 mm=f literal=AAAAAA

You can also add "mingc=0.6" to remove anything with more than 40% A or T combined. There is no flag for A or T individually. However, you could accomplish filtering based on 40% of the bases being A or 40% of the bases being T individually, like this:

reformat.sh in=file.fa out=AC.fa remap=AC
reformat.sh in=file.fa out=TC.fa remap=TC
reformat.sh in=AC.fa out=filteredAC.fa maxgc=0.6
reformat.sh in=TC.fa out=filteredTC.fa maxgc=0.6
cat filteredAC.fa filteredTC.fa > filteredACTC.fa
filterbyname.sh in=file.fa out=filtered.fa names=filteredACTC.fa exclude

That first changes all of the A's to C's, for example, so that only T will be counted as AT content, then does the gc-filtering at 60%, and does the same again for T to C. Then the names of the resulting files are used by filterbyname to exclude all the sequences with at least 40% of A or 40% of T.

ADD COMMENT • link 7.0 years ago by Brian Bushnell 20k

1

Entering edit mode

7.0 years ago

Alex Reynolds 35k

You could linearize the FASTA, then pipe to a second awk statement to test the regex and the overall A/T content:

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | awk '{ n=split($2,s,""); a=0; t=0; for(l in s){ if(s[l]=="A" || s[l]=="a") {a+=1;} if (s[l]=="T" || s[l]=="t") {t+=1;}} if (!match($2, /(A{6,}|T{6,})/) && (a/n < 0.4) && (t/n < 0.4)) { print $0; }}' | tr "\t" "\n"
>seq1
GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGTATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAGACACCATGAGAGCCCTCACACTCCTCGCCCTATTGGCCCTGGCCGCACTTTGCATCGCTGGCCAGGCAGGTGAGTGCCCC

ADD COMMENT • link 7.0 years ago by Alex Reynolds 35k

1

Entering edit mode

7.0 years ago

shenwei356 8.5k

An ugly but working one-line solution supporting both Windows and Linux using seqkit and csvtk:

$ cat seqs.fa 
>ok
AGAGCTAGAAGG
>remove
AGAAAAAAAAGG 
>remove
ATTTTTTTTAGGA
>ok
AGTAGTTAGGGG

$ seqkit grep -s -i -r -p "\"A{6,}\"" -p "\"T{6,}\"" -v  seqs.fa | seqkit fx2tab -B A,T | csvtk filter2 -H -t -f "! (\$4 > 40 || \$5 > 40)" | seqkit tab2fx
>ok
AGTAGTTAGGGG

The result is right as Vijay Lakhujani's.

The ugly "\"A{6,}\"" is for compatibility in Windows, the more readable version in Linux/OS X:

seqkit grep -s -i -r -p '"A{6,}"' -p '"T{6,}"' -v  seqs.fa \
    | seqkit fx2tab -B A,T \
    | csvtk filter2 -H -t -f '!($4 > 40 || $5 > 40)' \
    | seqkit tab2fx

Explain:

excluding sequences with 6'A' or 6'T' by searching sequences with regular expression.
converting FASTA format to tabular format, custom base content of A and T are outputted in 4th and 5th columns.
excluding sequences with A/T of >40%. awk can do this too.
converting tabular format back to FASTA format

ADD COMMENT • link 7.0 years ago by shenwei356 8.5k

score 2 · Accepted Answer · 2017-05-17

2

Entering edit mode

7.0 years ago

lakhujanivijay 5.8k

A pythonic way to achieve the same:

from Bio import SeqIO
import getopt, sys, re

def usage():
    print "Usage: remove_seqs.py -i <input_fasta> -o <output_fasta>"

try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out=open(output_file, 'w')

for record in SeqIO.parse(input_file, "fasta"):
    # check for 6 or more A's or T's
    search_hexamers=re.search('A{6,}|T{6,}',str(record.seq))
    # remove those sequences
    if not search_hexamers:
        # Calculate percentage of A's or T's 
        a = (sum(record.seq.count(x) for x in ['A'])/float(len(str(record.seq)))) * 100
        t = (sum(record.seq.count(x) for x in ['T'])/float(len(str(record.seq)))) * 100
        # Check of percentage of A's or T's is more than 40 % and remove
        if not a > 40 or t > 40:
            out.write(">"+record.id+"\n"+str(record.seq))
out.close()

USAGE

remove_seqs.py -i <input_fasta> -o <output_fasta>

Dependency: Biopython

Input:

>ok
AGAGCTAGAAGG
>remove
AGAAAAAAAAGG 
>remove
ATTTTTTTTAGGA
>ok
AGTAGTTAGGGG

Output

>ok
AGTAGTTAGGGG

Note:

>ok
AGAGCTAGAAGG

above sequence was removed as the percent of A's crossed 40%

ADD COMMENT • link 7.0 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Hi,

It is deleting some portion of id part in result FASTA file. All result fasta file has id portion being truncated.

e.g. 1:N:0:TACGCATT has been deleted from below id: (see last portion been deleted)

>ABCDEFG01:240:B43AWACQX:1:1301:1421:1128 1:N:0:TACGCATT

ADD REPLY • link 7.0 years ago by inkprs ▴ 70

0

Entering edit mode

Kindly replace this

out.write(">"+record.id+"\n"+str(record.seq))

with this

out.write(">"+record.description+"\n"+str(record.seq))

ADD REPLY • link 7.0 years ago by lakhujanivijay 5.8k

0

Entering edit mode

That works perfectly, thanks.

What regex should I use if I want to search for a sequence of more than 6A's and 6T's only in beginning and end of the string?

 from: search_hexamers=re.search('A{6,}|T{6,}',str(record.seq))
 to: ?

ADD REPLY • link 7.0 years ago by inkprs ▴ 70