Fastq Trimmer by pattern
4
0
Entering edit mode
5.8 years ago
dzisis1986 ▴ 60

I have a set of fastq files and I want to trim from the begining until the second restriction site (GTAC) found in the read. Is there any simple way to do it in terminal ?

Example:

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT TTCTTAAGATAGTGACAATAAGATCTGCTCTTGCTCACCTAGGTACCCGGACCACCAATCAGCAATCTCAGTTATCCTTTAATTTCTTTCGTTAGATCGGAAGAGCACACGTCTGAACTCCAGTC +

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFF

What I want as output :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

TTCTTAAGATAGTGACAATAAGATCTGCTCTTGCTCACCTAGGTAC

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

fastq trimming reads terminal • 3.2k views
4
Entering edit mode
5.8 years ago

Using awk:

cat input.fq | paste - - - - | awk -F '\t' '{i=index($2,"GTAC"); OFS="\t";$2=(i==0?$2:substr($2,1,i+3));$4=(i==0?$4:substr(\$4,1,i+3)); print;}' | tr "\t" "\n"

0
Entering edit mode

if want to cut also a part from the start from AGATCT until the GTAC in order to have finally the read like that :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

AGATCTGCTCTTGCTCACCTAGGTAC

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFF

I should run one more time the awk with AGATCT ?? Thank you in advance for your valuable help !!!

1
Entering edit mode

My python script can easily get modified to get that done ;-)

from Bio import SeqIO
import sys

for record in SeqIO.parse(sys.argv[1], "fastq"):
try:
positionSTART = str(record.seq).index("AGATCT")
positionEND = str(record.seq).index("GTAC") + 4
print(record[positionSTART:positionEND].format("fastq"))
except ValueError:
pass #Errors shouldn't go silent but up to OP what to do with those

0
Entering edit mode

This script seems to work but it is very slow in huge fastq files. I have a file which is 15 Gb and it takes 1 day to finish. Any suggestion to make it faster ??

0
Entering edit mode

Do you have multiple cores/processors? You could split the fastq and run it in parallel.

0
Entering edit mode

BBDuk will process a 15GB fastq file in a couple minutes...

1
Entering edit mode

Yes, but does BBDuk support to use both a start recognition sequence and an end recognition sequence? If not, this probably can be done by running BBDuk twice/sequential.

3
Entering edit mode

Kind of single step. bbduk.sh in=file.fq out=stdout.fq ktrim=r k=4 mm=f literal=GTAC rcomp=f ktrimexclusive | bbduk.sh in=stdin.fq out=trimmed.fq ktrim=l k=6 mm=f literal=AGATCT rcomp=f ktrimexclusive

Should add a minute or two :)

1
Entering edit mode

BBDuk2 actually supports left and right trim simultaneously, but it only allows a single fixed kmer length which would make things rather difficult in this case :) I don't really ever use it because BBDuk is more flexible, as in Genomax's command.

0
Entering edit mode

Can we modify the python script to search for : Everything that starts from AGATCT to GTAC, GTAC to AGATCT , AGATCT to the end of the sequence (if there is no GTAC) and GTAC to the end of the sequence (if there is not AGATCT) ?? Thank you in advance.

0
Entering edit mode

Yup that's perfectly possible, but it's still going to be slow for huge files.

0
Entering edit mode

still better than nothing ! At least with this is going to work correct !! How is going to be the new version ?

0
Entering edit mode

I think the following should do the trick, but I didn't test it.

0
Entering edit mode

I thing it doesnt work ! i tried it but there is no match. It should check a read and take eveything that starts from AGATCT to GTAC or from GTAC to AGATCT and if it doesnt exist one of those two in each case take until the end of the read (AGATCT to the end or GTAC to the end )

0
Entering edit mode

i tried it but there is no match.

So it tells you None of the sites were found in read...?

0
Entering edit mode

yes i always have the exit message

0
Entering edit mode

Can you show a read which gives the exit message?

0
Entering edit mode

None of the sites were found in read HISEQ:267:CAAV9ANXX:4:1101:10050:2218. this is the message i have and i checked the produced file which is only with the header of the read and emply !

0
Entering edit mode

Can you show me the actual sequence of that read?

grep -A 3 "HISEQ:267:CAAV9ANXX:4:1101:10050:2218" yourfile.fastq

0
Entering edit mode

@HISEQ:267:CAAV9ANXX:4:1101:10050:2218 1:N:0:AGTCAA GTGCGGTCGATATTTTGTATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+

BBBBB/B<f< <="" <bf<fffffffffffffffffffffffffffff7ffffbfffffffffffffbfbf<ffffffffffffff<bfff="" bf="" fbf<b&lt;<f="" 7bfbbffff="" b="" bf<="" p="">

1
Entering edit mode

As I expected, that read DOESN'T contain AGATCT or GTAC so the error message you got is completely accurate.

0
Entering edit mode

No because this is just a read from a fastq file. I run the script for a normal fastq file with a lot of reads and the output was the error message and one empty file or a file with 5 headers of the read but nothing else !

0
Entering edit mode

Can we modify the python script to search for : Everything that starts from AGATCT to GTAC, GTAC to AGATCT , AGATCT to the end of the sequence (if there is no GTAC) and GTAC to the end of the sequence (if there is not AGATCT) ?? Thank you in advance.

This thread is getting confusing and you keep changing what you have and what you need. I'm hesitating to just close this down and let you think a while about your question and the solution that would help you. I'm also in doubt if all this makes biological sense and wonder if you know what you are doing. In addition, I can read your reactions perfectly fine without adding exclamation marks, thank you very much.

So, let's try one more time: what should then happen to reads which contain none of both restriction sites?

0
Entering edit mode

I am sorry if i didnt explain well what i want. The read which contain none of both restriction sites will be discarded. we want to have a new fastq file from the original which contains only reads that starts with AGATCT until GTAC, if there is no GTAC(secondary rest site) it will take the rest of the read. Also reads that start from GTAC until AGATCT, and if there is no AGATCT(secondary rest site in this case) it will take until the end of the read. I hope this is clear. Thank you and i am so sorry for the confusion

1
Entering edit mode

So reads without the restriction sites get ignored, alright. So we can do without the final lines, I updated the code:

0
Entering edit mode

This seems ok but i have to test it for all files !I will come back after testing. Thank you for the reply and i am sorry for the misunderstanding.

0
Entering edit mode

Hey i finished with testing the script and it is working fine!I have one last question: how can we modify the script remove reads with second apperance of restriction site at each case ? Apparently i have some reads like that

GTACCCGGACCACCAATCAGCAATCTCAGTTATCGTACAATTTCTTTCGTTAGATCGGAAGAGCACACGTCTGAACTCCAGTC

and in this case i want to remove the second apperance and keep only this

GTAC CCGGACCACCAATCAGCAATCTCAGTTATC

or with the other cutter

AGATCT CCGGACCACCAATCAGCAATCTCAGTTATCAATTTCTTTCGTTAGATCGGAAGAGCACACGTCT AGATCT

and keep this

AGATCT CCGGACCACCAATCAGCAATCTCAGTTATCAATTTCTTTCGTTAGATCGGAAGAGCACACGTCT

Thank you very much for your valuable help !!

0
Entering edit mode

Does this only happen when the "other" restriction site is not found? Or always, even with the second cutter?

0
Entering edit mode

i found it when the second cutter was missing but maybe can be a case in which in one read we have

GTACCCGGACCACCAATCAGCAATCTCAGTTATCAGATCTAATTTCTTTCGTTAGATCGGAAGGTACCACGTCTGAACTCCAGTC

In that case we want only

GTACCCGGACCACCAATCAGCAATCTCAGTTATC

so actually at any case reads will start with any restriction site and end with a restriction site as we did before OR start with a restriction site and go until the end of the line in which there is no other restriction site or second appearance of restriction site

1
Entering edit mode

I've updated the code to do:

1. Look for all matches of both restriction sites
2. If one match is found, take from that position to the end
3. If more than one match is found, take from the first to the second match
4. If no match is found, do nothing

Obviously untested.

0
Entering edit mode

I have an error : python/3.5.1 load complete. File "trim_data.py", line 17 print(record[hits[0]:hits[1]].format("fastq")) ^ IndentationError: expected an indented block

0
Entering edit mode

That means something got messed up in the tabs of the script, check if the indentation is as I made it on the github gist.

0
Entering edit mode

This is what i did but still there is error. Now its something like : Traceback (most recent call last): File "trim_data.py", line 13, in <module> hits = findSite(record.seq, pattern) File "trim_data.py", line 9, in findSite return [m.start() for m in re.finditer(pattern, read)] File "/opt/exp_soft/local/generic/python/3.5.1/lib/python3.5/re.py", line 220, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object

1
Entering edit mode

I've made a minor modification, can you try again?

0
Entering edit mode

It looks OK now. Thanks i will try it for the rest of my data !

2
Entering edit mode
5.8 years ago

Could you try this:

from Bio import SeqIO
import sys

for record in SeqIO.parse(sys.argv[1], "fastq"):
try:
position = str(record.seq).index("GTAC") + 4
print(record[:position].format("fastq"))
except ValueError:
pass #Errors shouldn't go silent but up to OP what to do with those


Needs biopython. Save as e.g. trimAfterGTAC.py
Execute as python trimAfterGTAC.py yourinput.fastq > youroutput.fastq

Now, when no occurrence of GTAC is found the error is silently ignored. Not good, but up to you to decide what to do with that.

1
Entering edit mode
5.8 years ago

You can do this with BBDuk from the BBMap package:

bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=4 mm=f literal=GTAC rcomp=f ktrimexclusive

0
Entering edit mode

i am using a cluster and i dont have the authority to install packages thats why i am looking for a simpler way in terminal. Thanks a lot anyway !

1
Entering edit mode

No installation is needed for BBMap suite. All you need is Java 1.7 or above. Things can't be any simpler than this.

0
Entering edit mode

Do you have some scripting experience? This should be quite easy (for example using Biopython).
Did I understand correctly that everything after the occurrence of GTAC is to be removed?

0
Entering edit mode

i am trying with grep awk etc to do it but i cant exactly! yes this is correnct i want to remove everything after the occurence of GTAC

0
Entering edit mode
5.8 years ago
dzisis1986 ▴ 60

I tried BBDuk in cluster and finally it works but the result is not what i want. i use as input pair end fastq files .

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

TTCTTAAGATAGTGACAATA**AGATCTGCTCTTGCTCACCTAGGTAC**CCGGACCACCAATCAGCAATCTCAGTTATCCTTTAATTTCTTTCGTTAGATCGGAAGAGCACACGTCTGAACTCCAGTC
+

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFF


What I want as output is everything from the primary restriction site trimmed at the second restriction site :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

AGATCTGCTCTTGCTCACCTAG

+

BBBBBFFFFFFFFFFFFFFFFFF


@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA

GTGCGGTCGATATTGTGTACCTATACCCCTCAGTAGAATAGATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTG

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFFFF


and after trimming should give in trim2 file the entry

@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA

GTACCTATACCCCTCAGTAGAAT

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFF


So at the end i want to have fastq files only with all reads like that

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

AGATCTGCTCTTGCTCACCTAG

+

BBBBBFFFFFFFFFFFFFFFFFF

@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA

GTACCTATACCCCTCAGTAGAAT

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFF


Is it possible to do this with BBDuk ???? I tried with command line and other ways and finally i didnt have the correnct results .

0
Entering edit mode

Something does not jive here. You keep referring to "reverse" reads (which I assume refers to R2 read) but the examples you have shown above all have 1:N:0:tag. So are these reads in two separate files that are labeled R1?

Please use ADD COMMENT/ADD REPLY when responding to existing posts. This helps keep threads logically organized. This particular information could have gone in the original post since you are explaining more clearly as to what you want.

0
Entering edit mode

I am sorry for the wrong output of my post . No i have one fastq file with forward and reverse primer siquences. The forward primer seq is for one data set for example ACACAATCATTAAACGTTAAAGATCT and the reverse is GTGCGGTCGATATTGTGTAC. In this case i want to look in the fastq file first keep only reads which start with the primer sequences (i did this) and the i want to do trimming on the reads as it is described above.I tried using diifferent ways in terminal with grep cat and python but everything is slow and apparently the result is not what i want. Thats why i am asking about an alternative faster way which is with BBDuk. Can you help with that ?

0
Entering edit mode

This is hard to follow because it seems like you keep changing what you say you want. But on your example forward read, I used this command:

bbduk.sh in=sample.fq out=stdout.fq ktrim=l k=6 mm=f literal=AGATCT rcomp=f ktrimexclusive | bbduk.sh in=stdin.fq out=out.fq ktrim=r k=4 mm=f literal=GTAC rcomp=f ktrimexclusive=f


...which produced this:

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT
AGATCTGCTCTTGCTCACCTAG
+
FFFFFFFFFFFFFFFFFFFFFF


0
Entering edit mode

This is hard to follow because it seems like you keep changing what you say you want.

^^ This ^^

If I am getting this right there is one pattern (which we have demo'ed) for one half of the reads and another for the second (reverse - not sure what OP means there) half.

0
Entering edit mode

This is the case haof my reads are reversed.