Question

what does the result show in Hisat ?

0

Entering edit mode

6.2 years ago

XBria ▴ 90

Hi,

I am aligning a full sample of paired-end on only chromosome X (human). the results show that only less than 5 percent is uniquely aligned and more than 95 percent is not aligned.

Does it seem correct or I am heading a wrong way to down sample my data.

the way I down sample is first wc -l sample forward and wc -l sample reverse

then the number is divided by two for each. next, head -n <the result="" of="" dividing=""> full sample > sub_sample for each of the reverse and forward. then I align them to the chromosome X.

Your immediate reply is appreciated.

Thanks

RNA-Seq mapping • 1.6k views

ADD COMMENT • link updated 6.2 years ago by h.mon 35k • written 6.2 years ago by XBria ▴ 90

0

Entering edit mode

Haven't we discussed downsampling before in another thread? I think this can be a duplicate and therefore closed.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Dear Wouter, Based on that thread, these are the results. The question is about the results which look weird.

ADD REPLY • link 6.2 years ago by XBria ▴ 90

1

Entering edit mode

You are aligning a full RNA-seq dataset to chrX only and surprised that you get a low rate of alignment?

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

You are right. We had already discussed this in other thread: C: Sub_sampling Paired_end reads in Fastq.gz format.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

I also entirely miss the point of all this downsampling, but that's perhaps not so important.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Dear Wouter,

I am aligning samples one by one. I got an error saying read ERR188257.7222000 HWI-962:71:D0PEYACXX:2:1205:3186:5646/1 has more read characters than quality values.

I divided the whole sample by four. It gave 0.5. so I added an extra 0.5. the down-sample is now 14444000 volume. Is it cousing the problem ? or I have to re-download the full sample ?

Thanks

ADD REPLY • link 6.2 years ago by XBria ▴ 90

0

Entering edit mode

Downsampling done right should never break individual fastq records. You must have a corrupt fastq record/file.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Dear Genomax,

I re-downloaded a sample. But again the same issue !!

I downloaded via wget from ebi.ac.uk I am unable to work with 4 samples of 12 samples because of this issue. samples are: ERR204916, ERR188428, ERR188401, ERR188257,

What do you suggest me to do ?

Thanks

ADD REPLY • link 6.2 years ago by XBria ▴ 90

1

Entering edit mode

the way I down sample is first wc -l sample forward and wc -l sample reverse

then the number is divided by two for each. next, head -n <the result="" of="" dividing=""> full sample > sub_sample for each of the reverse and forward.

I divided the whole sample by four. It gave 0.5. so I added an extra 0.5. the down-sample is now 14444000 volume.

This is hard to read. I don't understand. But you probably created a corrupt fastq file. Remember that a fastq record has 4 lines. Why you selected this method of 'sampling' (it's not real sampling, you are just taking a subset of the reads) after the answer you got here How to down-sample a full data is beyond me.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Dear Wouter, I have already downloaded the whole full data set consisting of 12 samples (paired-end). Then Divided each samples length by 2 once, and then by four and then by 8, now I have 3 sub-samples of one half, one fourth and one eighth of the whole full sample. (I did so for both forward and reverse). I did all this by getting the length of them first by wc -l and then, head command in Ubuntu. and then mapped them.

ADD REPLY • link 6.2 years ago by XBria ▴ 90

1

Entering edit mode

What does grep -A 3 "ERR188257.7222000 HWI-962:71:D0PEYACXX:2:1205:3186:5646/1" on the file in question show before splitting it?

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

my system is busy for minutes ..., I will let you know as soon as it is finished

ADD REPLY • link 6.2 years ago by XBria ▴ 90

0

Entering edit mode

I just stopped the process. it gives no result. Instead I installed seqtk to downsample. This is the command I wrote and got the result. I hope it is now fine and I can continue with mapping 0.1 of sample to only chromosome X.

seqtk sample ERR188257_1.fastq.gz 0.1 > sub1.fastq

ADD REPLY • link 6.2 years ago by XBria ▴ 90

score 2 · Accepted Answer · 2018-01-23

2

Entering edit mode

6.2 years ago

h.mon 35k

If you look at table 1 from the Human Genome Wikipedia entry, you can calculate the percentage of each feature (base pairs, protein coding genes, and so on) pertaining to the X chromosome. I will leave here just the base pairs:

Base pairs X: 156040895 Total genome: 3088286401

X %: 5.1%

It seems you are getting the correct amount of mapped reads.

ADD COMMENT • link 6.2 years ago by h.mon 35k

0

Entering edit mode

Dear H.mon,

I am done with mapping using Hisat and Star on my data (paired-end, 75bp length) I am curious to know why Hisat uniquely mapped rate is around 4% and that of Star is about 12%. I am only mapping on chrimosome X. so I think Hisat represents the correct value. my question is why there is a huge difference between hisat and star uniquely mapped rate ? what is the reason behind that ? and how to know which one is more correct?

Thanks in advance

ADD REPLY • link 6.2 years ago by XBria ▴ 90