Question: denovo assembly of unmapped reads
gravatar for ashish
2.7 years ago by
ashish430 wrote:

I have RNA-seq data of a wheat translocation line which has its chromosome short arm replaced by another wild variety chromosome short arm. There is no genomic or transcriptomic information available for that wild variety. I want to see what kind of genes are expressing from the new short arm. For this I was thinking if I can do denovo assembly of the unmapped reads and then blast the assembled transcripts with nr database. I want to know if this strategy is correct or not or if there is a better way to do it.

update: number of unmapped read pairs is around 8 million, total reads were around 33 million and I used the following parameters in hisat2

hisat2 -x wheat_hisat_index -1 13R_R1.fastq -2 13R_R2.fastq -S 13.sam -p 12 --mp 8,4 --rdg 7,5 --rfg 7,5 --un-conc 13_unmapped_pairs
rna-seq • 1.6k views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by ashish430

Provided you have an adequate amount of reads left over to do a reasonable assembly. What fraction of data has been left over? You are only going to capture genes that are unique for that short arm.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax85k

I also have the same doubt. Currently the job is running, Soon I will update the number of unmapped reads left and my aim is to capture only the unique genes.

ADD REPLYlink written 2.7 years ago by ashish430

I've updated the question.

ADD REPLYlink written 2.7 years ago by ashish430

Good point, I gave you my answer below but it holds only as long as @genomax's point is fulfilled.

ADD REPLYlink written 2.7 years ago by Macspider3.0k

To the updated question:

your mismatch penalty seems to be quite high compared to what the threshold for alignment scores is if you don't touch the scoring function.

Increasing penalties for gaps and mismatches obviously results in more unmapped reads.

ADD REPLYlink written 2.7 years ago by Macspider3.0k

I did not understand the --score-min parameter very well from the manual. Can you suggest me some better values?

ADD REPLYlink written 2.7 years ago by ashish430

There is no such thing as "better values". Parameters have to be chosen according to your data. From the manual:

--score-min <func> : Sets a function governing the minimum alignment score needed for an alignment to be onsidered "valid" (i.e. good enough to report). This is a function of read length. For instance, specifying L,0,-0.6 sets the minimum core function f to f(x) = 0 + -0.6 * x, where x is the read length. See also: [setting function options]. The default is L,0,-0.2.

What is your maximum read length after trimming? Let's say it's 100 nt for example. In this function, this would be multiplied by the third parameter, and after the multiplication the second parameter will be added. The final value is the threshold for the alignment score that is applied to keep or discard a mapping result of a read. That score is negative and every time you have a mismatch or a gap the algorithm will subtract the penalties that you specified for gaps and mismatches.

I think you should calculate the alignment score threshold in a way that you get what you want. For example, if you want to call variants between strains of the same species, you might allow very few mismtaches because you know that the genomes are very similar. If you're mapping reads from different species to a reference, you might allow more gaps/mismtaches because biologically speaking they're expected.

ADD REPLYlink written 2.7 years ago by Macspider3.0k

Thanks Macspider, I got it.

ADD REPLYlink written 2.7 years ago by ashish430
gravatar for Macspider
2.7 years ago by
Vienna - BOKU
Macspider3.0k wrote:

I think the best way to proceed would be to:

  • map your RNASeq reads against the wheat genome with stringent criteria with hisat2 or bowtie2 using:
  --mp <int>,<int>   max and min penalties for mismatch; lower qual = lower penalty <6,2>
  --rdg <int>,<int>  read gap open, extend penalties (5,3)
  --rfg <int>,<int>  reference gap open, extend penalties (5,3)
  --score-min <func> min acceptable alignment score w/r/t read length
  • save the unmapped reads to a file using:
  --un <path>           write unpaired reads that didn't align to <path>
  --al <path>           write unpaired reads that aligned at least once to <path>
  --un-conc <path>      write pairs that didn't align concordantly to <path>
  --al-conc <path>      write pairs that aligned concordantly at least once to <path>
  • assemble those unmapped reads with trinity or any other RNASeq read assembler
  • run a quantification method
  • run a differential expression method :)
ADD COMMENTlink written 2.7 years ago by Macspider3.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1754 users visited in the last hour