Question

Extracting unmapped reads to draft genome

0

Entering edit mode

8.9 years ago

Bioinfonext ▴ 480

I mapped de novo assembled transcript to genome using GMAP but still large no. of transcripts showing similarity with draft genome sequences. My aim is it extract unique sequence which do not mapping to draft genome?

I have 32 paired end RNA seq libraries. Can I take all raw reads R1 and R2 reads in two separated files and is it possible to map those raw reads to Draft genome and to extract unmapped R1 and R2 reads in two separete files and followed by de novo assembly?

Please suggest which software should I use to complete this analysis?

RNA-Seq • 3.0k views

ADD COMMENT • link updated 8.9 years ago by Brian Bushnell 20k • written 8.9 years ago by Bioinfonext ▴ 480

GenoMax · Answer 1 · 2016-12-04

1

Entering edit mode

8.9 years ago

Brian Bushnell 20k

There are many ways to do this. I suspect you would want to assemble reads in which neither read in a pair maps to the reference, which you can do like this with BBMap:

bbmap.sh in=r1.fq in2=r2.fq ref=reference.fasta outm=mapped.sam outu=unmapped1.fq outu2=unmapped2.fq

Now unmapped reads are in unmapped1.fq and unmapped2.fq. Alternately, you could use reformat.sh and repair.sh on an existing sam file:

reformat.sh in=all.sam out=unmapped.fq unmappedonly
repair.sh in=unmapped.fq  out=r1.fq out2=r2.fq outs=singleton.fq

This will give you both unmapped pairs and unmapped singletons.

ADD COMMENT • link 8.9 years ago by Brian Bushnell 20k

1

Entering edit mode

Hi Brian,

Thanks for helping me.

I have downloaded BBmap. so before running cammand Do I need to do indexing of reference genome.

ADD REPLY • link 8.9 years ago by Bioinfonext ▴ 480

1

Entering edit mode

The command I gave will do indexing first, then map. Alternately you could do it in two steps, like this:

bbmap.sh ref=reference.fasta
bbmap.sh in=r1.fq in2=r2.fq outm=mapped.sam outu=unmapped1.fq outu2=unmapped2.fq

Either way gives the same result.

ADD REPLY • link 8.9 years ago by Brian Bushnell 20k

1

Entering edit mode

Hi Brian

Thanks

I got result as below, I have some queries:

1) ) During mapping how many mismatch it allowed, is there any option by which we can adjust the mismatch. 2) How it map splicing variant?

Pairing data:           pct reads       num reads       pct bases          num bases

mated pairs:             79.8502%         8232674        79.8502%         2058168500
bad pairs:                1.2823%          132203         1.2823%           33050750
insert size avg:          264.21


Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                  83.9105%         8651295        83.9105%         1081411875
unambiguous:             71.2570%         7346701        71.2570%          918337625
ambiguous:               12.6535%         1304594        12.6535%          163074250
low-Q discards:           0.0165%            1702         0.0165%             212750

perfect best site:       27.3360%         2818387        27.3360%          352298375
semiperfect site:        27.4375%         2828843        27.4375%          353605375
rescued:                  8.0776%          832816

Match Rate:                   NA               NA        66.6516%         1057771997
Error Rate:              67.3220%         5824231        33.3197%          528789715
Sub Rate:                54.7312%         4734961         1.2152%           19285478
Del Rate:                30.2153%         2614018        31.8588%          505604634
Ins Rate:                 8.2683%          715313         0.2457%            3899603
N Rate:                   0.3170%           27421         0.0287%             454797

ADD REPLY • link updated 8.9 years ago by GenoMax 154k • written 8.9 years ago by Bioinfonext ▴ 480

1

Entering edit mode

BBMap does not have a specific mismatch number. To quote @Brian from a recent answer:

The default is roughly 76% identity. You can adjust this with the "minid" flag (e.g. "minid=0.80" for 80% identity.) If you want to restrict alignments to a maximum number of substitutions, you can use "subfilter"; e.g., "subfilter=5" will discard alignments with more than 5 substitutions.

Splice variants would be mappable based on the setting used for (maxindel and intronlen).

ADD REPLY • link 8.9 years ago by GenoMax 154k

0

Entering edit mode

Splice variants would be mappable based on the setting used for (maxindel and intronlen).

Hmmm, that was my mistake, for some reason I neglected to mention maxindel. When mapping RNA-seq data to a genome, maxindel is a useful flag to adjust; the default (16000) is fine for fungi and many plants, which have short introns, but for things like mammals which have long introns, I suggest setting adding the flags "maxindel=400000 intronlen=10". That allows mapping across introns of up to around 400 kbp or so. "intronlen" is normally unnecessary but may affect some downstream programs.

ADD REPLY • link 8.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks,

I want to map raw reads to CDS instead of genome than also should I run this with default setting?

ADD REPLY • link 8.9 years ago by Bioinfonext ▴ 480

0

Entering edit mode

Yes, for mapping RNA-seq reads to transcripts default settings are fine. You may want to add "ambig=all" because some transcripts have multiple isoforms.

ADD REPLY • link 8.9 years ago by Brian Bushnell 20k