Question

Kmer content failed in FastQC analysis

0

Entering edit mode

6.8 years ago

josmantorres ▴ 10

Hello,

I have sent to sequence my RNAseq experiment to Macrogene. They used NovaSeq (Illumina Platform) to sequence my samples. I asked for 20M reads paired-end 150 bp length. I have made the FASTQC analysis of the raw reads and some samples showed strange Kmer-content graphs with an increase of specific k-mers from position 90. The trimming (I used Trimmomatic) eliminated 15-25% of reads in those libraries. Finally, I mapped against the genome and only 50% of reads mapped. The genome sequences are fine because a colleague used them to map other RNAseq experiment and obtained 80% of mapped reads. So, I think that it was a problem during the library preparation or the sequencing process.

Here is the link to the image: https://www.dropbox.com/s/4npf8zr929mhqep/kmer_profiles.png?dl=0

Thanks for your help

Best wishes,

Jose

rna-seq illumina • 3.2k views

ADD COMMENT • link 6.8 years ago by josmantorres ▴ 10

2

Entering edit mode

Please don't worry about failing k-mer content results in FastQC. This has caused more than enough confusion in past that I thought this test is now turned off by default in latest version of FastQC.

Having only 50% of your reads map may be due to other issues. You could also have some contamination in your data that does not belong to the genome that you were working with. These possibilities need to be investigated separately.Take a sample of the reads that don't map and use BLAST at NCBI to see what genome they map to as a start.

ADD REPLY • link 6.8 years ago by GenoMax 152k

0

Entering edit mode

Thanks a lot for your answer.

As you suggested, I have made a blast against NCBI with the unmapped reads and they hit against Triatoma virus, which is present in the insect I have sequenced (https://en.wikipedia.org/wiki/Triatoma_virus). I think that this type of material would be eliminate during the library preparation process.

All the best

ADD REPLY • link 6.8 years ago by josmantorres ▴ 10

0

Entering edit mode

Now you know!

If you have no use for those reads you could separate them away from genome of interest or they simply would not map when you analyze the data. Be on the lookout for varying levels present in your samples since that would affect how of much of data actually belongs to your genome of interest.

ADD REPLY • link 6.8 years ago by GenoMax 152k

0

Entering edit mode

Thanks for your answer,

As Macrogene made the sequencing and I paid for 20M and just 7.5 M mapped against the genome, I would like to know if the presence of genetic material from virus is normal in RNAseq experiments or is it possible that something wrong happened during the library preparation or sequencing.

All the best.

ADD REPLY • link 6.8 years ago by josmantorres ▴ 10

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

What kind of virus is that? RNA/DNA? In first case it may be valid to have reads coming from viral RNA (would depend on what method was used to prep libraries) or you may have ended up with contamination with DNA. If it is the latter then you should be weary that your may also have contamination from insect DNA.

Have you looked at the alignments (is the insect genome available/annotated) and made sure that the alignments you are looking at are reasonable (that is reads pile up under exons and not all over the genome). If you see them all over then you could have definitely have DNA contamination.

Edit: @Michael confirms RNA virus below. Depending what the experiment you were trying to do you may have to sequence more to get more insect data or devise a strategy to get rid of RNA from virus during sample prep for new libraries.

ADD REPLY • link 6.8 years ago by GenoMax 152k

2

Entering edit mode

Triatoma virus has a positive-sense, single-stranded RNA genome that functions like an mRNA molecule so it can be directly translated by host cell machinery. Excluding the poly-A tail, the genome of TrV is 9010 nucleotides long. With the poly-A tail, the genome is approximately 10 kb long.

So if op ordered a polyA enriched RNA seq he pretty much got what was payed for. The viral RNA would be sequenced in the same way as host RNA. At the point of library prep the viral RNA is indistinguishable from any other transcript. It is normal to find RNA viruses in arthropod RNAseq. However, your viral load seems to be quite high.

ADD REPLY • link 6.8 years ago by Michael 56k

0

Entering edit mode

Thank a lot for you help and information !

It was very useful.

As Michael said, viral load seems to be very high for me. It is possible that the presence of virus sequences makes that only 50% of the reads mapped against the genome? It is difficult to understand.

Thanks again for your time and help

All the best, Jose

ADD REPLY • link 6.8 years ago by josmantorres ▴ 10

1

Entering edit mode

You could align your data to the virus genome and if the % of aligned sequences (when added to those that aligned to insect genome) roughly add up to (could go over) 100% then yes.

Don't think of viral reads as eating into your total yield. You got N reads out of which 50% appear to be viral.

ADD REPLY • link 6.8 years ago by GenoMax 152k

0

Entering edit mode

We have simply included the viral genomes of the two most common Rhabdoviruses in our Star genome index and added them to the GFF for counting as well.

ADD REPLY • link 6.8 years ago by Michael 56k

0

Entering edit mode

Viral infection could alter host gene expression pattern. In L. salmonis we have normally less than 5% of reads that can be assigned to the two known RNA viruses L. salmonis rhabdovirus No.9 & No. 127 (no idea why they are given these numbers).

Given that your virus is a pathogen for its host, I would simply say your insects are probably very 'sick'. If you need a virus free strain, RNA-viruses can be potentially removed by RNAi knock-down: https://www.nature.com/articles/s41598-017-14282-3

ADD REPLY • link 6.8 years ago by Michael 56k