Question

Odd Fastq sequences; 50% Aligned more than once

0

Entering edit mode

3.7 years ago

pjferrandi • 0

Hello all. I'm REALLY trying to get DEG analysis done for my PI with essentially no experience and no guidance. I'm using Galaxy and attempting to assemble the fastq files received via sequencing and am getting very strange results. FastQC shows very odd GC%:

The raw read file itself looks odd to me please see:

The major thing I notice is the same small sequence following by a + and then the unique sequence directly after. I do not see this in other sets of files that seem to be working very well. I've searched google with every combination of search terms that I can think of to understand what this is and how to solve it. I will be GREATLY appreciative to anyone who can help me, as I have a lot of pressure to get this done without any help. Thank you very much.

RNA-Seq assembly alignment sequence • 1.1k views

ADD COMMENT • link 3.7 years ago by pjferrandi • 0

0

Entering edit mode

if you're struggling how to add images, see this biostar post: How to add images to a Biostars post

ADD REPLY • link 3.7 years ago by lieven.sterck 15k

0

Entering edit mode

Indeed, these data were small RNA seq specific data. My PI is in the process of getting the correct sequence files, which will hopefully pan out nicely for us. Thanks to everyone for your help!

ADD REPLY • link 3.7 years ago by pjferrandi • 0

0

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY • link 3.7 years ago by GenoMax 141k

score 2 · Answer 1 · 2020-08-04

same small sequence following by a + and then the unique sequence directly after.

There is nothing wrong with the sequences. The line starting at "@" is a header, it ends with the Illumina barcodes (or indexes). Then, the first sequence is the actual read, the "+" indicates the next line are the base qualities, the next line are the base qualities indeed.

This is probably miRNA, not RNAseq. For example, googling one of the sequences from the picture, it returned the paper Effect of microRNA-1 on hepatocellular carcinoma tumor endothelial cells. This also explains the strange GC content plots: probably, you have two different miRNAs very highly expressed, and they contain different %GC.

edit:

I'm using Galaxy and attempting to assemble the fastq files received via sequencing

These reads are too short, you want to map them t a reference genome, not assemble them.

score 1 · Answer 2 · 2020-08-04

1

Entering edit mode

3.7 years ago

swbarnes2 14k

That fastq file looks fine if you only wanted to reads 20 bases. Those QC images look normal if what you have is a whole lot of a few sequences over and over again.

I'm pretty sure it's not RNA sequence. I'd say its index sequence, except that it doesn't match the index indicated in the name line of the reads.

ADD COMMENT • link 3.7 years ago by swbarnes2 14k

0

Entering edit mode

Lol, I really wish I understood what you mean. Should I contact the sequencing company for clarification on what these files actually are? The other set of files I'm comparing these to, I downloaded from SRA and they work perfectly when I run them through my Galaxy workflow...I'm not exactly sure why our files are so odd.

ADD REPLY • link 3.7 years ago by pjferrandi • 0

1

Entering edit mode

You should contact you PI and ask him the experiment details, the sequencing seems fine.

ADD REPLY • link 3.7 years ago by h.mon 35k