Question: Union of unaligned fastq reads
1
gravatar for Jeffin Rockey
21 months ago by
Jeffin Rockey1.1k
Karimannoor
Jeffin Rockey1.1k wrote:

Hi,

Raw data is a paired end fastq file.

Aligned it with Genome-1 (using STAR) and got unaligned R1 and Unaligned R2.

Also aligned it with Genome-2 (using STAR) and got unaligned R1 and Unaligned R2.

Please advise what would be the best method/tool to obtain the 'union' fastq of the unaligned reads from the unaligned of both the genome alignments.

Jeffin

rna-seq • 511 views
ADD COMMENTlink modified 21 months ago • written 21 months ago by Jeffin Rockey1.1k
1

Do you really want the union or rather the intersection?

ADD REPLYlink modified 21 months ago • written 21 months ago by cschu1811.5k
1

Union itself is the requirement

ADD REPLYlink written 21 months ago by Jeffin Rockey1.1k
1

Creating the union would be simply combining the unaligned files together? Just need to avoid duplicates.

ADD REPLYlink written 21 months ago by WouterDeCoster37k
1

Assuming unaligned R1/R2 are .bamfiles:

samtools view unaligned.R1.genome1.bam | cut -f 1 > unaligned.R1.txt
samtools view unaligned.R1.genome2.bam | cut -f 1 >> unaligned.R1.txt
grep -A 3 -F -f <(sort -u unaligned.R1.txt) original.R1.fq | grep -v "\-\-" > union.unaligned.R1.fq

Analogously for .R2

ADD REPLYlink modified 21 months ago • written 21 months ago by cschu1811.5k
1

Thanks.

Could you please provide some detail on whats happening in the grep line?

ADD REPLYlink written 21 months ago by Jeffin Rockey1.1k
1
  1. We fgrep (grep -F, i.e. take patterns from file) the original fastq file for all the read ids of the union of unaligned reads.
  2. -f takes the pattern file, which here (because I'm lazy) are given as a process substitution: <(sort -u unaligned.R1.txt) provides the output of the sort operation as a 'file' (the shell pretends this is a file). Alternatively, one could have done the sort -u into another file and used that filename as argument to the -f.
  3. grep -A 3 returns the line with the match + the following 3 lines. However, if there are multiple hits, then they will be separated by --, which is why the output of the first grep is piped into a grep -v, that ignores the -- lines.
ADD REPLYlink written 21 months ago by cschu1811.5k
1

Thanks a lot @cschu181 . Good to learn about the -A functionality in grep :)

ADD REPLYlink written 21 months ago by Jeffin Rockey1.1k
1

Hi, The best way to avoid duplicate fastq entries is the aspect what I am doubtful about.

If I write a small script it would be easily doable.But I wanted to know whether there is better method to combine while keeping duplicates away

ADD REPLYlink modified 21 months ago • written 21 months ago by Jeffin Rockey1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2243 users visited in the last hour