Question

Union of unaligned fastq reads

1

Entering edit mode

6.9 years ago

Jeffin Rockey ★ 1.3k

Hi,

Raw data is a paired end fastq file.

Aligned it with Genome-1 (using STAR) and got unaligned R1 and Unaligned R2.

Also aligned it with Genome-2 (using STAR) and got unaligned R1 and Unaligned R2.

Please advise what would be the best method/tool to obtain the 'union' fastq of the unaligned reads from the unaligned of both the genome alignments.

Jeffin

RNA-Seq • 1.7k views

ADD COMMENT • link 6.9 years ago by Jeffin Rockey ★ 1.3k

1

Entering edit mode

Do you really want the union or rather the intersection?

ADD REPLY • link 6.9 years ago by cschu181 ★ 2.8k

1

Entering edit mode

Union itself is the requirement

ADD REPLY • link 6.9 years ago by Jeffin Rockey ★ 1.3k

1

Entering edit mode

Creating the union would be simply combining the unaligned files together? Just need to avoid duplicates.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

1

Entering edit mode

Assuming unaligned R1/R2 are .bamfiles:

samtools view unaligned.R1.genome1.bam | cut -f 1 > unaligned.R1.txt
samtools view unaligned.R1.genome2.bam | cut -f 1 >> unaligned.R1.txt
grep -A 3 -F -f <(sort -u unaligned.R1.txt) original.R1.fq | grep -v "\-\-" > union.unaligned.R1.fq

Analogously for .R2

ADD REPLY • link 6.9 years ago by cschu181 ★ 2.8k

1

Entering edit mode

Thanks.

Could you please provide some detail on whats happening in the grep line?

ADD REPLY • link 6.9 years ago by Jeffin Rockey ★ 1.3k

1

Entering edit mode

We fgrep (grep -F, i.e. take patterns from file) the original fastq file for all the read ids of the union of unaligned reads.
-f takes the pattern file, which here (because I'm lazy) are given as a process substitution: <(sort -u unaligned.R1.txt) provides the output of the sort operation as a 'file' (the shell pretends this is a file). Alternatively, one could have done the sort -u into another file and used that filename as argument to the -f.
grep -A 3 returns the line with the match + the following 3 lines. However, if there are multiple hits, then they will be separated by --, which is why the output of the first grep is piped into a grep -v, that ignores the -- lines.

ADD REPLY • link 6.9 years ago by cschu181 ★ 2.8k

1

Entering edit mode

Thanks a lot @cschu181 . Good to learn about the -A functionality in grep :)

ADD REPLY • link 6.9 years ago by Jeffin Rockey ★ 1.3k

1

Entering edit mode

Hi, The best way to avoid duplicate fastq entries is the aspect what I am doubtful about.

If I write a small script it would be easily doable.But I wanted to know whether there is better method to combine while keeping duplicates away

ADD REPLY • link 6.9 years ago by Jeffin Rockey ★ 1.3k