Question: bash command for trimming and filtering .fastq files
1
gravatar for Himanshu Bhusan Samal
11 days ago by
Germany/Berlin/Max Delbruck Center for Molecular Medicine
Himanshu Bhusan Samal20 wrote:

Hello everyone,

I need little help to work with the genomic DNA seq. I would like to trim the reads (.fastq files) until first 'TA' position. And then, want to remove all the reads having length <40bp. It would be helpful if someone could share some useful commands to do the same. Thanks very much!

Best,

Himanshu

For ex -

@K00302:80:HLTWCBBXX:3:1101:3478:1402 1:N:0:NTAGGC
GGCGATGCGGCGGCGTTATTCCCATGACCCGCCGGGCAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACAAAAA
+
FAAA<JAA7AAFJ<AJJ-AJAFAFJFJ<-A<7<AAA-7AA-A7<F7-AJF7AA-77FFAJFFFFFJA<JAFJJ-A77AJFJFJF7F<FA7FJ<JJ7<J-A-
@K00302:80:HLTWCBBXX:3:1101:2199:1402 1:N:0:NTAGGC
GATAAATGCATTGTCCACTAAGAAGTTCTGAGCTGGAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCTTCACGTATTCCGT
+
JAFFFJF-<FFF-<-A--FFF-7JJFJJ--<A<<J-7FFFJJFFJF<JF7A7<--77J-AA-7AA-AAJ-7FFFA7<-7-7--7J--<---<)---7-7-----)7<

edit: Removed ** that were not part of the sequence. @genomax

filtering fastq files • 319 views
ADD COMMENTlink modified 8 days ago by RamRS17k • written 11 days ago by Himanshu Bhusan Samal20

Hi Himanshu Bhusan Samal,

  • Please select a more descriptive title for your thread so it's clear what this is about. You are looking at trimming and filtering.
  • Please use appropriate tags so experts can easily find your question and help you. Just 'next-gen' is meaningless.

  • I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

  • Please put some effort in solving this yourself. We are more eager to help you if you show us that you tried to solve it.
ADD REPLYlink modified 10 days ago • written 10 days ago by WouterDeCoster32k

Thanks for your suggestions.

ADD REPLYlink written 10 days ago by Himanshu Bhusan Samal20

do you want to trim the reads from the left side up until (or including) the 'TA' of trim them back from the right up to the 'TA' ?

If the former then the answer given by b.nota will indeed be what you're looking for

ADD REPLYlink modified 10 days ago • written 10 days ago by lieven.sterck2.4k

Hello, thanks very much for both of your response. Yes, I would like to trim from 5' end (from left side up-to/including TA). Ex- 'GGCGATGCGGCGGCGTTA' from the first read. Then want to remove if any reads having length <40.

Previously I have tried cutadapt, but it's not filling the purpose as the allowed error rate is 10% and here is only two base, TA. That's why, I am looking for bash command to the same. If you think, with cutadapt it is possible then it will be helpful if you could give the solution. Thanks!

ADD REPLYlink written 10 days ago by Himanshu Bhusan Samal20

Yes, I was trying to bold the rest sequence after TA, (sequence which I want to keep after trimming, just to visualize for you) that's why '**'. And, randomly I have copied pasted the .fastq sequences (from a huge .fastq file), that's why the length is not matching to the quality of sequence. In the original .fastq file, everything is fine. The problem is arising when I am trimming only two base, TA.

ADD REPLYlink modified 9 days ago • written 9 days ago by Himanshu Bhusan Samal20

see my post: add the -O 2 as a parameter will fix this !

you get this error because the adapter length you are looking for is smaller then the allowed overlap- between adapters causing cutadapt to fail , set the min overlap to 'adapter' length (== -O 2) to lower or equal to adapter length is the solution!

ADD REPLYlink modified 9 days ago • written 9 days ago by lieven.sterck2.4k

Did you try what lieven.sterck is suggesting? With the -O 2 argument?

ADD REPLYlink written 9 days ago by b.nota4.9k

Thank you so very much Sterck! I was trying that only. It's working the way you suggested (with -O 2 parameter). Couldn't post my message because I have limitation of sending 5 message per 6 hours, so reached the limitation. Sorry for the delay. I am really thankful for both of your effort, thank you!

ADD REPLYlink written 9 days ago by Himanshu Bhusan Samal20
5
gravatar for b.nota
10 days ago by
b.nota4.9k
Netherlands
b.nota4.9k wrote:

I suggest not to reinvent the wheel but use something like cutadapt. If I understand your aim correctly you want to remove everything preceding the first 'TA', correct?

Simple cutadapt command would do this.

cutadapt -g TA -o output.fq input.fq

See cutadapt -h for more information, also how to use it for the <40 cut off.

ADD COMMENTlink modified 10 days ago • written 10 days ago by b.nota4.9k

might work indeed.

but probably worth looking into if cutadapt uses the left most one or the right most one. In the latter case a few rounds of trimming might be required (if cutadapt does not handle multiple copies of the "adapter" well)

ADD REPLYlink written 10 days ago by lieven.sterck2.4k

-g argument means 5' of the read, please read help page.

ADD REPLYlink written 10 days ago by b.nota4.9k

yes, but that says nothing about the multiple "adapter" presence

and I was under the impression he wants to trim from right to left. But better ask OP for clarification

ADD REPLYlink written 10 days ago by lieven.sterck2.4k

Did you read the manual?

ADD REPLYlink written 10 days ago by b.nota4.9k

I certainly did, which raised another remark: should you not add the -O 2 as parameters (to set the min overlap equal to the adapter length) ?

and I stand corrected on the multiple adapters issue (empirical determined)

ADD REPLYlink written 10 days ago by lieven.sterck2.4k

Hi I am not sure if you really understood the manual then correctly, but let's go thru it together. The -g argument means to trim everything from 5' end until (including) the adapter, TA in this case. By default it only trims 1 adapter per read, so no need to change that. Furthermore, the default allowed error rate is 10%, which is 0.2 for this two base adapter, hence no error allowed. Can you please add your better solution as a new answer? Thank you.

ADD REPLYlink written 10 days ago by b.nota4.9k

Hello, thanks very much for both of your response. Yes, I would like to trim from 5' end (from left side up-to/including TA). Ex- 'GGCGATGCGGCGGCGTTA' from the first read. Then want to remove if any reads having length <40.

Previously I have tried cutadapt, but it's not filling the purpose as the allowed error rate is 10% and here is only two base, TA. That's why, I am looking for bash command to the same. If you think, with cutadapt it is possible then it will be helpful if you could give the solution. Thanks!

ADD REPLYlink written 10 days ago by Himanshu Bhusan Samal20

Well it seems you did not read the manual well enough, there is an argument -m which filters for length. Like I said before the error rate of 10% is 0.2 so no errors are allowed by default (and if you are still afraid of this error rate set it to 0, but it is not necessary).

cat input.fq
@K00302:80:HLTWCBBXX:3:1101:3478:1402 1:N:0:NTAGGC
GGCGATGCGGCGGCGTTATTCCCATGACCCGCCGGGCAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACAAAAA
+
FAAA<JAA7AAFJ<AJJ-AJAFAFJFJ<-A<7<AAA-7AA-A7<F7-AJF7AA-77FFAJFFFFFJA<JAFJJ-A77AJFJFJF7F<FA7FJ<JJ7<J-A-
@K00302:80:HLTWCBBXX:3:1101:2199:1402 1:N:0:NTAGGC
GATAAATGCATTGTCCACTAAGAAGTTCTGAGCTGGAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCTTCACGTATTCCGT
+
JAFFFJF-<FFF-<-A--FFF-7JJFJJ--<A<<J-7FFFJJFFJF<JF7A7<--77J-AA-7AA-AAJ-7FFFA7<-7-7--7J--<---<)---7-7-----

cutadapt -g TA -m 40 -o output.fq input.fq

cat output.fq
@K00302:80:HLTWCBBXX:3:1101:3478:1402 1:N:0:NTAGGC
TTCCCATGACCCGCCGGGCAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACAAAAA
+
AJAFAFJFJ<-A<7<AAA-7AA-A7<F7-AJF7AA-77FFAJFFFFFJA<JAFJJ-A77AJFJFJF7F<FA7FJ<JJ7<J-A-
@K00302:80:HLTWCBBXX:3:1101:2199:1402 1:N:0:NTAGGC
AATGCATTGTCCACTAAGAAGTTCTGAGCTGGAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCTTCACGTATTCCGT
+
FJF-<FFF-<-A--FFF-7JJFJJ--<A<<J-7FFFJJFFJF<JF7A7<--77J-AA-7AA-AAJ-7FFFA7<-7-7--7J--<---<)---7-7-----
ADD REPLYlink modified 10 days ago • written 10 days ago by b.nota4.9k

Thanks for your reply. Of course, I read the manual and had tried the same that you mentioned. Since I was getting error, thought of trying in different way, if it is possible. Since it's working for you, maybe some problem from my side. I am trying to sort that, thanks!

Trimming 1 adapter(s) with at most 10.0% errors in single-end mode ...
Traceback (most recent call last):
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/bin/.cutadapt-real", line 10, in <module>
    cutadapt.main()
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/lib/python3.4/site-packages/cutadapt/scripts/cutadapt.py", line 833, in main
    stats = process_single_reads(reader, modifiers, writers)
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/lib/python3.4/site-packages/cutadapt/scripts/cutadapt.py", line 254, in process_single_reads
    read = modifier(read)
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/lib/python3.4/site-packages/cutadapt/scripts/cutadapt.py", line 190, in __call__
    match = self._best_match(read)
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/lib/python3.4/site-packages/cutadapt/scripts/cutadapt.py", line 132, in _best_match
    match = adapter.match_to(read)
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/lib/python3.4/site-packages/cutadapt/adapters.py", line 293, in match_to
    assert match.length >= self.min_overlap
AssertionError
ADD REPLYlink written 10 days ago by Himanshu Bhusan Samal20

Okay, I see. The error was not in the original question, so I suggest to post a new question where you describe your error (and your code). Good luck!

ADD REPLYlink written 10 days ago by b.nota4.9k

I am not getting this error when I am trying with any different/long adapter. Code is same only. I have used cutadapt for adapter trimming several times, but this error is for the first time when I am trying to trim only two base, TA. Thats why I thought here maybe it's not possible with cutadapt, though I have read the manual. Hope you understand! It will be a great help if you can give me the solution, if possible. Thanks!

ADD REPLYlink written 10 days ago by Himanshu Bhusan Samal20

Maybe this can help: Cutadapt Assertionerror In your example fastq file, if you remove the ** in the sequence, the length of the DNA sequence is not equal to the quality sequence.

ADD REPLYlink modified 9 days ago • written 10 days ago by b.nota4.9k

see my post: add the -O 2 as a parameter will fix this !

you get this error because the adapter length you are looking for is smaller then the allowed overlap- between adapters causing cutadapt to fail , set the min overlap to 'adapter' length (== -O 2) to lower or equal to adapter length is the solution!

and I assume he added the ** to indicate where he want is to be trimmed (or indicate the part he wants to keep)?

ADD REPLYlink modified 9 days ago • written 9 days ago by lieven.sterck2.4k

Okay, in my version of cutadapt (1.13) this seems not a problem. I assume some things have changed in the newer versions. If you remove the ** from the second read, the quality of that reads is longer in length. I don't know the history of that fastq file, but something went wrong.

ADD REPLYlink written 9 days ago by b.nota4.9k

strange, I did not find anything in the change log that this has been updated, moreover it should still be default 3 , anyway ...

+1 for the length diff, I noticed that as well (and for the top one removing the ** gives the corresponding quality length)

ADD REPLYlink written 9 days ago by lieven.sterck2.4k

Yes, it's strange. In version 1.13 it is also default 3 but still seems to run with no errors.

ADD REPLYlink written 9 days ago by b.nota4.9k
1

Thank you so very much Sterck! I was trying that only. It's working the way you suggested (with -O 2 parameter). Couldn't post my message because I have limitation of sending 5 message per 6 hours, so reached the limitation. Sorry for the delay. I am really thankful for both of your effort, thank you!

ADD REPLYlink written 9 days ago by Himanshu Bhusan Samal20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1376 users visited in the last hour