Question: Odd Fastq sequences; 50% Aligned more than once
0
gravatar for pjferrandi
6 weeks ago by
pjferrandi0
pjferrandi0 wrote:

Hello all. I'm REALLY trying to get DEG analysis done for my PI with essentially no experience and no guidance. I'm using Galaxy and attempting to assemble the fastq files received via sequencing and am getting very strange results. FastQC shows very odd GC%: gcpercentage sequencepercnt

The raw read file itself looks odd to me please see:
sequences

The major thing I notice is the same small sequence following by a + and then the unique sequence directly after. I do not see this in other sets of files that seem to be working very well. I've searched google with every combination of search terms that I can think of to understand what this is and how to solve it. I will be GREATLY appreciative to anyone who can help me, as I have a lot of pressure to get this done without any help. Thank you very much.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by pjferrandi0

if you're struggling how to add images, see this biostar post: How to add images to a Biostars post

ADD REPLYlink written 6 weeks ago by lieven.sterck8.5k

Indeed, these data were small RNA seq specific data. My PI is in the process of getting the correct sequence files, which will hopefully pan out nicely for us. Thanks to everyone for your help!

ADD REPLYlink written 6 weeks ago by pjferrandi0

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink written 6 weeks ago by genomax89k
2
gravatar for h.mon
6 weeks ago by
h.mon31k
Brazil
h.mon31k wrote:

same small sequence following by a + and then the unique sequence directly after.

There is nothing wrong with the sequences. The line starting at "@" is a header, it ends with the Illumina barcodes (or indexes). Then, the first sequence is the actual read, the "+" indicates the next line are the base qualities, the next line are the base qualities indeed.

This is probably miRNA, not RNAseq. For example, googling one of the sequences from the picture, it returned the paper Effect of microRNA-1 on hepatocellular carcinoma tumor endothelial cells. This also explains the strange GC content plots: probably, you have two different miRNAs very highly expressed, and they contain different %GC.

edit:

I'm using Galaxy and attempting to assemble the fastq files received via sequencing

These reads are too short, you want to map them t a reference genome, not assemble them.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by h.mon31k

My mistake! I am attempting to map to a reference via HISAT2 and/or STAR...for now I've just attempted HISAT2 to the reference mouse genome. Following this, I used htseq-count to get the read counts for each identified gene/transcript. This worked perfectly fine on the dataset I pulled from SRA (not our dataset, which was given to me by my PI).

I think you're correct that his is miRNA reads and not the RNAseq. It makes a lot of sense, because I followed through with the downstream steps (mapping, counts, annotating) and when I searched most of the Ensemble IDs, they were miRNA. There were some protein coding genes found, though. I'll ask him to be sure he sent me the correct files. Thanks for your help.

ADD REPLYlink written 6 weeks ago by pjferrandi0
1
gravatar for swbarnes2
6 weeks ago by
swbarnes28.6k
United States
swbarnes28.6k wrote:

That fastq file looks fine if you only wanted to reads 20 bases. Those QC images look normal if what you have is a whole lot of a few sequences over and over again.

I'm pretty sure it's not RNA sequence. I'd say its index sequence, except that it doesn't match the index indicated in the name line of the reads.

ADD COMMENTlink written 6 weeks ago by swbarnes28.6k

Lol, I really wish I understood what you mean. Should I contact the sequencing company for clarification on what these files actually are? The other set of files I'm comparing these to, I downloaded from SRA and they work perfectly when I run them through my Galaxy workflow...I'm not exactly sure why our files are so odd.

ADD REPLYlink written 6 weeks ago by pjferrandi0
1

You should contact you PI and ask him the experiment details, the sequencing seems fine.

ADD REPLYlink written 6 weeks ago by h.mon31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1258 users visited in the last hour