Question: Fastq Trimmer by pattern
0
gravatar for dzisis1986
2.7 years ago by
dzisis198620
dzisis198620 wrote:

I have a set of fastq files and I want to trim from the begining until the second restriction site (GTAC) found in the read. Is there any simple way to do it in terminal ?

Example:

Original read :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT TTCTTAAGATAGTGACAATAAGATCTGCTCTTGCTCACCTAGGTACCCGGACCACCAATCAGCAATCTCAGTTATCCTTTAATTTCTTTCGTTAGATCGGAAGAGCACACGTCTGAACTCCAGTC +

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFF

What I want as output :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

TTCTTAAGATAGTGACAATAAGATCTGCTCTTGCTCACCTAGGTAC

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

terminal trimming fastq reads • 1.8k views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by dzisis198620
4
gravatar for Pierre Lindenbaum
2.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

Using awk:

cat input.fq | paste - - - - | awk -F '\t' '{i=index($2,"GTAC"); OFS="\t";$2=(i==0?$2:substr($2,1,i+3));$4=(i==0?$4:substr($4,1,i+3)); print;}' | tr "\t" "\n"
ADD COMMENTlink written 2.7 years ago by Pierre Lindenbaum123k

if want to cut also a part from the start from AGATCT until the GTAC in order to have finally the read like that :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

AGATCTGCTCTTGCTCACCTAGGTAC

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFF

I should run one more time the awk with AGATCT ?? Thank you in advance for your valuable help !!!

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by dzisis198620
1

My python script can easily get modified to get that done ;-)

from Bio import SeqIO
import sys

for record in SeqIO.parse(sys.argv[1], "fastq"):
    try:    
        positionSTART = str(record.seq).index("AGATCT")
        positionEND = str(record.seq).index("GTAC") + 4
        print(record[positionSTART:positionEND].format("fastq"))
    except ValueError:
        pass #Errors shouldn't go silent but up to OP what to do with those
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by WouterDeCoster41k

This script seems to work but it is very slow in huge fastq files. I have a file which is 15 Gb and it takes 1 day to finish. Any suggestion to make it faster ??

ADD REPLYlink written 2.7 years ago by dzisis198620

Do you have multiple cores/processors? You could split the fastq and run it in parallel.

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

BBDuk will process a 15GB fastq file in a couple minutes...

ADD REPLYlink written 2.7 years ago by Brian Bushnell16k
1

Yes, but does BBDuk support to use both a start recognition sequence and an end recognition sequence? If not, this probably can be done by running BBDuk twice/sequential.

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k
3

Kind of single step. bbduk.sh in=file.fq out=stdout.fq ktrim=r k=4 mm=f literal=GTAC rcomp=f ktrimexclusive | bbduk.sh in=stdin.fq out=trimmed.fq ktrim=l k=6 mm=f literal=AGATCT rcomp=f ktrimexclusive

Should add a minute or two :)

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax72k
1

BBDuk2 actually supports left and right trim simultaneously, but it only allows a single fixed kmer length which would make things rather difficult in this case :) I don't really ever use it because BBDuk is more flexible, as in Genomax's command.

ADD REPLYlink written 2.7 years ago by Brian Bushnell16k

Can we modify the python script to search for : Everything that starts from AGATCT to GTAC, GTAC to AGATCT , AGATCT to the end of the sequence (if there is no GTAC) and GTAC to the end of the sequence (if there is not AGATCT) ?? Thank you in advance.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by dzisis198620

Yup that's perfectly possible, but it's still going to be slow for huge files.

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

still better than nothing ! At least with this is going to work correct !! How is going to be the new version ?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by dzisis198620

I think the following should do the trick, but I didn't test it.

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

I thing it doesnt work ! i tried it but there is no match. It should check a read and take eveything that starts from AGATCT to GTAC or from GTAC to AGATCT and if it doesnt exist one of those two in each case take until the end of the read (AGATCT to the end or GTAC to the end )

ADD REPLYlink written 2.7 years ago by dzisis198620

i tried it but there is no match.

So it tells you None of the sites were found in read...?

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

yes i always have the exit message

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by dzisis198620

Can you show a read which gives the exit message?

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

None of the sites were found in read HISEQ:267:CAAV9ANXX:4:1101:10050:2218. this is the message i have and i checked the produced file which is only with the header of the read and emply !

ADD REPLYlink written 2.7 years ago by dzisis198620

Can you show me the actual sequence of that read?

grep -A 3 "HISEQ:267:CAAV9ANXX:4:1101:10050:2218" yourfile.fastq
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by WouterDeCoster41k

@HISEQ:267:CAAV9ANXX:4:1101:10050:2218 1:N:0:AGTCAA GTGCGGTCGATATTTTGTATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+

BBBBB/B<f< <="" <bf<fffffffffffffffffffffffffffff7ffffbfffffffffffffbfbf<ffffffffffffff<bfff="" bf="" fbf<b&lt;<f="" 7bfbbffff="" b="" bf<="" p="">

ADD REPLYlink modified 2.7 years ago by genomax72k • written 2.7 years ago by dzisis198620
1

As I expected, that read DOESN'T contain AGATCT or GTAC so the error message you got is completely accurate.

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

No because this is just a read from a fastq file. I run the script for a normal fastq file with a lot of reads and the output was the error message and one empty file or a file with 5 headers of the read but nothing else !

ADD REPLYlink written 2.7 years ago by dzisis198620

Can we modify the python script to search for : Everything that starts from AGATCT to GTAC, GTAC to AGATCT , AGATCT to the end of the sequence (if there is no GTAC) and GTAC to the end of the sequence (if there is not AGATCT) ?? Thank you in advance.

That's what you asked for. That's what you got from me too. I'm really bad at reading your mind.

This thread is getting confusing and you keep changing what you have and what you need. I'm hesitating to just close this down and let you think a while about your question and the solution that would help you. I'm also in doubt if all this makes biological sense and wonder if you know what you are doing. In addition, I can read your reactions perfectly fine without adding exclamation marks, thank you very much.

So, let's try one more time: what should then happen to reads which contain none of both restriction sites?

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

I am sorry if i didnt explain well what i want. The read which contain none of both restriction sites will be discarded. we want to have a new fastq file from the original which contains only reads that starts with AGATCT until GTAC, if there is no GTAC(secondary rest site) it will take the rest of the read. Also reads that start from GTAC until AGATCT, and if there is no AGATCT(secondary rest site in this case) it will take until the end of the read. I hope this is clear. Thank you and i am so sorry for the confusion

ADD REPLYlink written 2.7 years ago by dzisis198620
1

So reads without the restriction sites get ignored, alright. So we can do without the final lines, I updated the code:

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

This seems ok but i have to test it for all files !I will come back after testing. Thank you for the reply and i am sorry for the misunderstanding.

ADD REPLYlink written 2.7 years ago by dzisis198620

Hey i finished with testing the script and it is working fine!I have one last question: how can we modify the script remove reads with second apperance of restriction site at each case ? Apparently i have some reads like that

GTACCCGGACCACCAATCAGCAATCTCAGTTATCGTACAATTTCTTTCGTTAGATCGGAAGAGCACACGTCTGAACTCCAGTC

and in this case i want to remove the second apperance and keep only this

GTAC CCGGACCACCAATCAGCAATCTCAGTTATC

or with the other cutter

AGATCT CCGGACCACCAATCAGCAATCTCAGTTATCAATTTCTTTCGTTAGATCGGAAGAGCACACGTCT AGATCT

and keep this

AGATCT CCGGACCACCAATCAGCAATCTCAGTTATCAATTTCTTTCGTTAGATCGGAAGAGCACACGTCT

Thank you very much for your valuable help !!

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by dzisis198620

Does this only happen when the "other" restriction site is not found? Or always, even with the second cutter?

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

i found it when the second cutter was missing but maybe can be a case in which in one read we have

GTACCCGGACCACCAATCAGCAATCTCAGTTATCAGATCTAATTTCTTTCGTTAGATCGGAAGGTACCACGTCTGAACTCCAGTC

In that case we want only

GTACCCGGACCACCAATCAGCAATCTCAGTTATC

so actually at any case reads will start with any restriction site and end with a restriction site as we did before OR start with a restriction site and go until the end of the line in which there is no other restriction site or second appearance of restriction site

ADD REPLYlink written 2.7 years ago by dzisis198620
1

I've updated the code to do:

  1. Look for all matches of both restriction sites
  2. If one match is found, take from that position to the end
  3. If more than one match is found, take from the first to the second match
  4. If no match is found, do nothing

Obviously untested.

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

I have an error : python/3.5.1 load complete. File "trim_data.py", line 17 print(record[hits[0]:hits[1]].format("fastq")) ^ IndentationError: expected an indented block

ADD REPLYlink written 2.7 years ago by dzisis198620

That means something got messed up in the tabs of the script, check if the indentation is as I made it on the github gist.

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

This is what i did but still there is error. Now its something like : Traceback (most recent call last): File "trim_data.py", line 13, in <module> hits = findSite(record.seq, pattern) File "trim_data.py", line 9, in findSite return [m.start() for m in re.finditer(pattern, read)] File "/opt/exp_soft/local/generic/python/3.5.1/lib/python3.5/re.py", line 220, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object

ADD REPLYlink written 2.7 years ago by dzisis198620
1

I've made a minor modification, can you try again?

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

It looks OK now. Thanks i will try it for the rest of my data !

ADD REPLYlink written 2.7 years ago by dzisis198620
2
gravatar for WouterDeCoster
2.7 years ago by
Belgium
WouterDeCoster41k wrote:

Could you try this:

from Bio import SeqIO
import sys

for record in SeqIO.parse(sys.argv[1], "fastq"):
    try:    
        position = str(record.seq).index("GTAC") + 4
        print(record[:position].format("fastq"))
    except ValueError:
        pass #Errors shouldn't go silent but up to OP what to do with those

Needs biopython. Save as e.g. trimAfterGTAC.py
Execute as python trimAfterGTAC.py yourinput.fastq > youroutput.fastq

Now, when no occurrence of GTAC is found the error is silently ignored. Not good, but up to you to decide what to do with that.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by WouterDeCoster41k
1
gravatar for Brian Bushnell
2.7 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You can do this with BBDuk from the BBMap package:

bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=4 mm=f literal=GTAC rcomp=f ktrimexclusive
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Brian Bushnell16k

i am using a cluster and i dont have the authority to install packages thats why i am looking for a simpler way in terminal. Thanks a lot anyway !

ADD REPLYlink written 2.7 years ago by dzisis198620
1

No installation is needed for BBMap suite. All you need is Java 1.7 or above. Things can't be any simpler than this.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax72k

Do you have some scripting experience? This should be quite easy (for example using Biopython).
Did I understand correctly that everything after the occurrence of GTAC is to be removed?

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k

i am trying with grep awk etc to do it but i cant exactly! yes this is correnct i want to remove everything after the occurence of GTAC

ADD REPLYlink written 2.7 years ago by dzisis198620
0
gravatar for dzisis1986
2.7 years ago by
dzisis198620
dzisis198620 wrote:

I tried BBDuk in cluster and finally it works but the result is not what i want. i use as input pair end fastq files .

The Original read :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT 

TTCTTAAGATAGTGACAATA**AGATCTGCTCTTGCTCACCTAGGTAC**CCGGACCACCAATCAGCAATCTCAGTTATCCTTTAATTTCTTTCGTTAGATCGGAAGAGCACACGTCTGAACTCCAGTC 
+

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFF

What I want as output is everything from the primary restriction site trimmed at the second restriction site :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

AGATCTGCTCTTGCTCACCTAG

+

BBBBBFFFFFFFFFFFFFFFFFF

or in the reverse reads

@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA

GTGCGGTCGATATTGTGTACCTATACCCCTCAGTAGAATAGATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTG

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFFFF

and after trimming should give in trim2 file the entry

@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA

GTACCTATACCCCTCAGTAGAAT

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFF

So at the end i want to have fastq files only with all reads like that

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

AGATCTGCTCTTGCTCACCTAG

+

BBBBBFFFFFFFFFFFFFFFFFF 

@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA

GTACCTATACCCCTCAGTAGAAT

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFF

Is it possible to do this with BBDuk ???? I tried with command line and other ways and finally i didnt have the correnct results .

ADD COMMENTlink modified 2.7 years ago by genomax72k • written 2.7 years ago by dzisis198620

Something does not jive here. You keep referring to "reverse" reads (which I assume refers to R2 read) but the examples you have shown above all have 1:N:0:tag. So are these reads in two separate files that are labeled R1?

Please use ADD COMMENT/ADD REPLY when responding to existing posts. This helps keep threads logically organized. This particular information could have gone in the original post since you are explaining more clearly as to what you want.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax72k

I am sorry for the wrong output of my post . No i have one fastq file with forward and reverse primer siquences. The forward primer seq is for one data set for example ACACAATCATTAAACGTTAAAGATCT and the reverse is GTGCGGTCGATATTGTGTAC. In this case i want to look in the fastq file first keep only reads which start with the primer sequences (i did this) and the i want to do trimming on the reads as it is described above.I tried using diifferent ways in terminal with grep cat and python but everything is slow and apparently the result is not what i want. Thats why i am asking about an alternative faster way which is with BBDuk. Can you help with that ?

ADD REPLYlink written 2.7 years ago by dzisis198620

This is hard to follow because it seems like you keep changing what you say you want. But on your example forward read, I used this command:

bbduk.sh in=sample.fq out=stdout.fq ktrim=l k=6 mm=f literal=AGATCT rcomp=f ktrimexclusive | bbduk.sh in=stdin.fq out=out.fq ktrim=r k=4 mm=f literal=GTAC rcomp=f ktrimexclusive=f

...which produced this:

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT
AGATCTGCTCTTGCTCACCTAG
+
FFFFFFFFFFFFFFFFFFFFFF

However, that command won't work if half your reads are reversed.

ADD REPLYlink written 2.7 years ago by Brian Bushnell16k

This is hard to follow because it seems like you keep changing what you say you want.

^^ This ^^

If I am getting this right there is one pattern (which we have demo'ed) for one half of the reads and another for the second (reverse - not sure what OP means there) half.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax72k

This is the case haof my reads are reversed.

ADD REPLYlink written 2.7 years ago by dzisis198620
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1892 users visited in the last hour