1
0
Entering edit mode
4.8 years ago

HI All, I have two fastaq files and I want to subtract the reads of one fastaq file from the other fastaq file. I want to know what a command line or software I can use to do that?

alignment sequence next-gen • 2.6k views
0
Entering edit mode

You probably mean a fastq file, since a 'fastaq' file does as far as I know not exist.

3
Entering edit mode
4.8 years ago
gunzip -c f1.fq.gz f2.fq.gz | paste  - - - - | sort |uniq |tr "\t" "\n" > f3.fq

0
Entering edit mode

I think it will make a uniq reads among two file. But lets say if I have two fastq files and now if i want to remove only the reads present in one file and not to make a common uniq reads file then what should we do?

0
Entering edit mode

use comm

0
Entering edit mode

@Pierre Lindenbaum So the command should be like this ??: gunzip -c f1.fq.gz f2.fq.gz | paste - - - - | sort |comm|tr "\t" "\n" > f3.fq

0
Entering edit mode

But lets say if I have two fastq files and now if i want to remove only the reads present in one file and not to make a common uniq reads file then what should we do?

If you are referring to actual sequence identity (and not full fastq record being duplicated) then only way to do that is by using clumpify.sh from BBMap suite. See this thread: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

0
Entering edit mode

I think it will remove duplicate read. It will not work as per above question Lets say i have

one fastq file like :

    @HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7
@HISEQ:230:C6G45ANXX:3:1101:1498:2162 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF

@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7

##################################
another fastq file like:

@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7

@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFBFFFBB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7


So now let say i want compare these two file in a way if i find the reads from 2nd fastq file in 1st fastq the remove the reads otherwise keep the fastq file as it is.

Do you think clumpify.sh will do that ?

0
Entering edit mode

I think some one already asked the topic but little modified here : "I have a fastq file that seems to be contaminated by some sequences contaminating my reagents during library preparation. If I know the reads that came from reagents and I have them in a fastaq format, do you think I can eliminate those reads from my fastq file? I want to remove any reads contaminating my fastq file. How can I work this out?"

0
Entering edit mode

If you need to remove reads that are contaminants (do you have the reference for either both or at least species of interest) then you can use bbsplit.sh from BBMap suite like this: C: BBSplit syntax for generating builds for the reference genome and how to call di

0
Entering edit mode

For this particular application if you know that reads from file 1 are NOT present in file 2 for sure then merge the two files together and then use clumpify.sh to remove ALL duplicates. That should get rid of reads that are duplicated.

0
Entering edit mode

Sorry its not very clear to me yet. I have merged both forward and reverse file and which are generated from positive sample. And also i have another file which negative control from reagents contamination also in fastq file. Now i would like to remove those reads which present only in negative sample. So i will get finally a clean fastq file. Can i use clumify.sh to such job?

0
Entering edit mode

I have merged both forward and reverse file

1. Have you merged R1/R2 reads to get a longer single read in place of two reads? OR
2. Just copied the two files together for both positive/negative samples?

I think the best solution is to align the file1 against file 2 and then only keep/select those reads that do not map.

If you don't know how to do this then let me know.

0
Entering edit mode