Question: To fetch the reads which are being partially mapped to the genome
1
gravatar for swadha
4 months ago by
swadha20
University of California, United States
swadha20 wrote:

Hi,

I have been trying to fetch the reads which are being partially mapped to the genome. Like if the total length of a sequence is 126 bases and only 66 bases out of 126 are mapping to the genome. Is it possible to fetch those sequences whose half of the bases align to the genome? I created a fake fastq file having just 4 sequences to see how a mapper is aligning the reads. I have pasted those 4 sequences below. So far I have used Bowtie2, Hisat2 and STAR. None of them can give me hits if half of the sequence is mapping to the genome.

The fake fastq file I created has following sequences (below) and I am trying to map them to Giardia lamblia genome.

@seq-1_Whole_sequence_is_Copied_from_Genome
CTTGCGGCAAATTGTGAAACAGAAGGCAATAAAAGCAATTGTGACAGCGGTCGCTGTGAAATGGTTGGCATTGTTGAAGTCTGTACTCAGTGCAAGACAAATGGACACGTCCCTATTGACGGAGTA
+
HHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJDD9?BDDDDADCDDDDBCCCFFFF
@seq-2_Left_77_bases_are_copied_from_the_genome
TCTTGCGGCAAATTGTGAAACAGAAGGCAATAAAAGCAATTGTGACAGCGGTCGCTGTGAAATGGTTGGCATTGTTATGACGATGACGATGACGATGACGTAGACGATGACGTAGCAGTAGACGAT
+
CCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDD
@seq-3_Right_77_bases_are_from_genome
GAGTGACGTAGCGATGCAGTGACGTAGCGGAGTACCAGATGATCACAAAGTCGCTGTGAAATGGTTGGCATTGTTGAAGTCTGTACTCAGTGCAAGACAAATGGACACGTCCCTATTGACGGAGTA
+
CCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDD
@seq-4_Two_different_parts_of_genome_spliced_together
TCTTGCGGCAAATTGTGAAACAGAAGGCAATAAAAGCAATTGTGACAGCGGTCGCTGTGAAATGTCCGGGCCAGCTAGGCCCATCGACATACCTCGTTGACCGGGGAAACAGCATGCCAGGACAGG
+
CCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDD

Do you know any parameter in Bowtie2, HSAT2, STAR etc which can give me hits for these type of sequences?

Thanks in Advance, Swadha

ADD COMMENTlink modified 4 months ago by genomax62k • written 4 months ago by swadha20

Can you post the command line options you used for those alignment programs?

I also suggest that you give bbmap.sh from BBMap suite a try. Lots of options and flexibility. Guide available here.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax62k

I used default parameters. I was not sure which parameters can give hits for such kind of sequences. Like for Hisat I used:

hisat2 -x hisat2_ref/ref --known-splicesite-infile giardia_cis_splicesites.txt -U test_1.fq --no-sq   >  mapped.bam

The output I got is:

seq-1_Whole_sequence_is_Copied_from_Genome  16  ctg02_1 7141    1   126M    *   0   0   TACTCCGTCAATAGGGACGTGTCCATTTGTCTTGCACTGAGTACAGACTTCAACAATGCCAACCATTTCACAGCGACCGCTGTCACAATTGCTTTTATTGCCTTCTGTTTCACAATTTGCCGCAAG  FFFFCCCBDDDDCDADDDDB?9DDJHHHHHFFFFFCCCBDDDDCDADDDDB?9DDDBDBDDECDFFHJIJIJJJHHHHHFFFFFCCCBDDDDCDADDDDB?9DDDBDBDDECDFFHJIJIJJJHHH  AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:126    YT:Z:UU NH:i:2
seq-1_Whole_sequence_is_Copied_from_Genome  256 ctg02_1 722 1   126M    *   0   0   CTTGCGGCAAATTGTGAAACAGAAGGCAATAAAAGCAATTGTGACAGCGGTCGCTGTGAAATGGTTGGCATTGTTGAAGTCTGTACTCAGTGCAAGACAAATGGACACGTCCCTATTGACGGAGTA  HHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJDD9?BDDDDADCDDDDBCCCFFFF  AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:126    YT:Z:UU NH:i:2
seq-2_Left_77_bases_are_copied_from_the_genome  4   *   0   0   *   *   0   0   TCTTGCGGCAAATTGTGAAACAGAAGGCAATAAAAGCAATTGTGACAGCGGTCGCTGTGAAATGGTTGGCATTGTTATGACGATGACGATGACGATGACGTAGACGATGACGTAGCAGTAGACGAT  CCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDD  YT:Z:UU                                 
seq-3_Right_77_bases_are_from_genome    4   *   0   0   *   *   0   0   GAGTGACGTAGCGATGCAGTGACGTAGCGGAGTACCAGATGATCACAAAGTCGCTGTGAAATGGTTGGCATTGTTGAAGTCTGTACTCAGTGCAAGACAAATGGACACGTCCCTATTGACGGAGTA  CCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDD  YT:Z:UU                                 
seq-4_Two_different_parts_of_genome_spliced_together    4   *   0   0   *   *   0   0   TCTTGCGGCAAATTGTGAAACAGAAGGCAATAAAAGCAATTGTGACAGCGGTCGCTGTGAAATGTCCGGGCCAGCTAGGCCCATCGACATACCTCGTTGACCGGGGAAACAGCATGCCAGGACAGG  CCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDDBDBDDD9?BDDDDADCDDDDBCCCFFFFFHHHHHJJJIJIJHFFDCEDD  YT:Z:UU
ADD REPLYlink modified 4 months ago by genomax62k • written 4 months ago by swadha20

Which problem are you working on? What is the origin of these reads which align only half?

ADD REPLYlink written 4 months ago by WouterDeCoster36k

I am trying to find mRNA reads which are trans-spliced (this results when protein-coding regions from multiple RNAs transcribed from different loci can be joined). I have created a fake dataset so see if Bowtie2, HISAT2 or STAR can map these type of trans-spliced reads. I can move forward with the partial mapping as well.

ADD REPLYlink written 4 months ago by swadha20

Information like that is quite important, so please be as complete as possible when asking questions. So maybe something like https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r34? Perhaps it is not necessary to reinvent the wheel.

ADD REPLYlink written 4 months ago by WouterDeCoster36k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 631 users visited in the last hour