what does the result show in Hisat ?
1
0
Entering edit mode
6.2 years ago
XBria ▴ 90

Hi,

I am aligning a full sample of paired-end on only chromosome X (human). the results show that only less than 5 percent is uniquely aligned and more than 95 percent is not aligned.

Does it seem correct or I am heading a wrong way to down sample my data.

the way I down sample is first wc -l sample forward and wc -l sample reverse

then the number is divided by two for each. next, head -n <the result="" of="" dividing=""> full sample > sub_sample for each of the reverse and forward. then I align them to the chromosome X.

Your immediate reply is appreciated.

Thanks

RNA-Seq mapping • 1.6k views
ADD COMMENT
0
Entering edit mode

Haven't we discussed downsampling before in another thread? I think this can be a duplicate and therefore closed.

ADD REPLY
0
Entering edit mode

Dear Wouter, Based on that thread, these are the results. The question is about the results which look weird.

ADD REPLY
1
Entering edit mode

You are aligning a full RNA-seq dataset to chrX only and surprised that you get a low rate of alignment?

ADD REPLY
0
Entering edit mode

You are right. We had already discussed this in other thread: C: Sub_sampling Paired_end reads in Fastq.gz format.

ADD REPLY
0
Entering edit mode

I also entirely miss the point of all this downsampling, but that's perhaps not so important.

ADD REPLY
0
Entering edit mode

Dear Wouter,

I am aligning samples one by one. I got an error saying read ERR188257.7222000 HWI-962:71:D0PEYACXX:2:1205:3186:5646/1 has more read characters than quality values.

I divided the whole sample by four. It gave 0.5. so I added an extra 0.5. the down-sample is now 14444000 volume. Is it cousing the problem ? or I have to re-download the full sample ?

Thanks

ADD REPLY
0
Entering edit mode

Downsampling done right should never break individual fastq records. You must have a corrupt fastq record/file.

ADD REPLY
0
Entering edit mode

Dear Genomax,

I re-downloaded a sample. But again the same issue !!

I downloaded via wget from ebi.ac.uk I am unable to work with 4 samples of 12 samples because of this issue. samples are: ERR204916, ERR188428, ERR188401, ERR188257,

What do you suggest me to do ?

Thanks

ADD REPLY
1
Entering edit mode

the way I down sample is first wc -l sample forward and wc -l sample reverse

then the number is divided by two for each. next, head -n <the result="" of="" dividing=""> full sample > sub_sample for each of the reverse and forward.

I divided the whole sample by four. It gave 0.5. so I added an extra 0.5. the down-sample is now 14444000 volume.

This is hard to read. I don't understand. But you probably created a corrupt fastq file. Remember that a fastq record has 4 lines. Why you selected this method of 'sampling' (it's not real sampling, you are just taking a subset of the reads) after the answer you got here How to down-sample a full data is beyond me.

ADD REPLY
0
Entering edit mode

Dear Wouter, I have already downloaded the whole full data set consisting of 12 samples (paired-end). Then Divided each samples length by 2 once, and then by four and then by 8, now I have 3 sub-samples of one half, one fourth and one eighth of the whole full sample. (I did so for both forward and reverse). I did all this by getting the length of them first by wc -l and then, head command in Ubuntu. and then mapped them.

ADD REPLY
1
Entering edit mode

What does grep -A 3 "ERR188257.7222000 HWI-962:71:D0PEYACXX:2:1205:3186:5646/1" on the file in question show before splitting it?

ADD REPLY
0
Entering edit mode

my system is busy for minutes ..., I will let you know as soon as it is finished

ADD REPLY
0
Entering edit mode

I just stopped the process. it gives no result. Instead I installed seqtk to downsample. This is the command I wrote and got the result. I hope it is now fine and I can continue with mapping 0.1 of sample to only chromosome X.

seqtk sample ERR188257_1.fastq.gz 0.1 > sub1.fastq
ADD REPLY
2
Entering edit mode
6.2 years ago
h.mon 35k

If you look at table 1 from the Human Genome Wikipedia entry, you can calculate the percentage of each feature (base pairs, protein coding genes, and so on) pertaining to the X chromosome. I will leave here just the base pairs:

Base pairs X: 156040895 Total genome: 3088286401

X %: 5.1%

It seems you are getting the correct amount of mapped reads.

ADD COMMENT
0
Entering edit mode

Dear H.mon,

I am done with mapping using Hisat and Star on my data (paired-end, 75bp length) I am curious to know why Hisat uniquely mapped rate is around 4% and that of Star is about 12%. I am only mapping on chrimosome X. so I think Hisat represents the correct value. my question is why there is a huge difference between hisat and star uniquely mapped rate ? what is the reason behind that ? and how to know which one is more correct?

Thanks in advance

ADD REPLY

Login before adding your answer.

Traffic: 1755 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6