Question: Subtracting one FASTAq file Reads from other FASTAq reads
0
gravatar for aftabahmad
16 months ago by
aftabahmad0
aftabahmad0 wrote:

HI All, I have two fastaq files and I want to subtract the reads of one fastaq file from the other fastaq file. I want to know what a command line or software I can use to do that?

sequence next-gen alignment • 555 views
ADD COMMENTlink modified 16 months ago by Pierre Lindenbaum119k • written 16 months ago by aftabahmad0

You probably mean a fastq file, since a 'fastaq' file does as far as I know not exist.

ADD REPLYlink written 16 months ago by WouterDeCoster38k
3
gravatar for Pierre Lindenbaum
16 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:
gunzip -c f1.fq.gz f2.fq.gz | paste  - - - - | sort |uniq |tr "\t" "\n" > f3.fq
ADD COMMENTlink written 16 months ago by Pierre Lindenbaum119k

I think it will make a uniq reads among two file. But lets say if I have two fastq files and now if i want to remove only the reads present in one file and not to make a common uniq reads file then what should we do?

ADD REPLYlink written 4 weeks ago by jeccy.J50

use comm

ADD REPLYlink written 4 weeks ago by Pierre Lindenbaum119k

@Pierre Lindenbaum So the command should be like this ??: gunzip -c f1.fq.gz f2.fq.gz | paste - - - - | sort |comm|tr "\t" "\n" > f3.fq

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by jeccy.J50

But lets say if I have two fastq files and now if i want to remove only the reads present in one file and not to make a common uniq reads file then what should we do?

If you are referring to actual sequence identity (and not full fastq record being duplicated) then only way to do that is by using clumpify.sh from BBMap suite. See this thread: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

ADD REPLYlink written 4 weeks ago by genomax65k

I think it will remove duplicate read. It will not work as per above question Lets say i have

one fastq file like :

    @HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
    TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
    +
    BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7
    @HISEQ:230:C6G45ANXX:3:1101:1498:2162 1:N:0:ACAGTGGTTGAACCTT
    TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
    +
    BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF

    @HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
    TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
    +
    BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7


##################################
another fastq file like:

@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7

@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFBFFFBB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7

So now let say i want compare these two file in a way if i find the reads from 2nd fastq file in 1st fastq the remove the reads otherwise keep the fastq file as it is.

Do you think clumpify.sh will do that ?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by jeccy.J50

I think some one already asked the topic but little modified here : "I have a fastq file that seems to be contaminated by some sequences contaminating my reagents during library preparation. If I know the reads that came from reagents and I have them in a fastaq format, do you think I can eliminate those reads from my fastq file? I want to remove any reads contaminating my fastq file. How can I work this out?"

ADD REPLYlink written 4 weeks ago by jeccy.J50

If you need to remove reads that are contaminants (do you have the reference for either both or at least species of interest) then you can use bbsplit.sh from BBMap suite like this: C: BBSplit syntax for generating builds for the reference genome and how to call di

ADD REPLYlink written 4 weeks ago by genomax65k

For this particular application if you know that reads from file 1 are NOT present in file 2 for sure then merge the two files together and then use clumpify.sh to remove ALL duplicates. That should get rid of reads that are duplicated.

ADD REPLYlink written 4 weeks ago by genomax65k

Sorry its not very clear to me yet. I have merged both forward and reverse file and which are generated from positive sample. And also i have another file which negative control from reagents contamination also in fastq file. Now i would like to remove those reads which present only in negative sample. So i will get finally a clean fastq file. Can i use clumify.sh to such job?

ADD REPLYlink written 4 weeks ago by jeccy.J50

I have merged both forward and reverse file

  1. Have you merged R1/R2 reads to get a longer single read in place of two reads? OR
  2. Just copied the two files together for both positive/negative samples?

I think the best solution is to align the file1 against file 2 and then only keep/select those reads that do not map.

If you don't know how to do this then let me know.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax65k

i changed the thread now....

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by jeccy.J50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1398 users visited in the last hour