Question: Is there a limit for SAMTOOLS tview option?
0
gravatar for bioplanet
2.1 years ago by
bioplanet60
bioplanet60 wrote:

Hi all,

I am new to the genomics field and I have recently started working with SAMTOOLS. My question is the following:

  1. I have my SAM file with 2,5 M aligned reads.
  2. I convert it to BAM file with SAMTOOLS
  3. I make the index of the BAM and the reference file

What I want to do is use the tview command to study a region of interest (say position 150-160). So I write this:

samtools tview -d T -p ref_cons:145 BAM_FILE --reference REFERENCE_FILE

It works fine, BUT the problem is that it returns only ~8,000 lines (reads). My question then is if there is some limit of the number of alignments that are outputted and, if so, how can I make it so that it prints all of them? I attach a screen-shot of the output I am getting.

enter image description here

Thank you!

alignment • 1.5k views
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by bioplanet60
1

ok, let's see this in a different way: WHY would you need to visualize more than 8K rows ?

ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum124k

Maybe it is because of hard-coded depth limit in samtools. Please see here: https://biostar.usegalaxy.org/p/6650/

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Sej Modha4.5k

The thing is that I want to grep all this "sub-regions" in all my aligned reads. These are barcodes from an amplicon-seq experiment that have been mapped to my reference construct (which I use as reference "genome" in this case), so I need to know all the different barcodes that were generated...

ADD REPLYlink written 2.1 years ago by bioplanet60

so what about reading this information from the bam directly. How do you know which section of the reads is assigned to a barcode ?

BTW: it's a typical xy problem http://xyproblem.info/ :-)

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Pierre Lindenbaum124k
1

Thank you for the kind help. Unfortunately I just changed fields from studying proteins to genomics and all these are new to me - sorry if my question was not very clear. I don't know how to do as you suggest from the BAM file actually. Regarding the second question, I actually know it because the reads were mapped to the NNNNNNN on my reference construct (I used the STAR aligner that allows for mapping to NNNNN). Therefore I am confident that if, in theory, I could print all the 2,5M reads, I would be able to grep the respective region (say pos 150-160 that I have my NNNNs) in each of them and using that to get all the different possible barcodes. Does that make more sense know (at least what I was planning to do)?

ADD REPLYlink written 2.1 years ago by bioplanet60

If there is an inline barcode in your reads you could separate them first before aligning them to anything.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax74k

Hello bioplanet!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?p=212153

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 2.1 years ago by WouterDeCoster42k

Hi all and sorry for the cross-posting. The way I see it, this is not doable then. What genomax kindly suggested I am not sure I understand. The barcodes are random, i.e. I have to prior information about them, that is why I used STAR to align my sequencing reads against my reference construct, and then, based on the position of the NNNNNs in it, grep the respective regions of the aligned reads. The approach would work perfectly, if not for the (apparent) limitation of 8,000 reads to be outputted...

ADD REPLYlink written 2.1 years ago by bioplanet60

Hi Pierre!

Many thanks for your help! Yes, this is what I want to do..

I installed the bioalcidaejdk and I stored the code you wrote as script.js.

Then I try:

samtools view -h input.bam "reference:150-300" | java -jar dist/bioalcidaejdk.jar -f script.js -F bam

But I get lots of errors. Is this at least the correct way to do this or I misread your post?

ADD REPLYlink written 2.1 years ago by bioplanet60

But I get lots of errors

How can I know ?

ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum124k

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This response should have gone under @Pierre's answer.

ADD REPLYlink written 2.1 years ago by genomax74k

Hi Pierre and, again, many many thanks for the help!

Ok, here comes my script.js:

    final String chrom = "ref_cons";
final int start=145;
final int end=165;

stream().filter(R->!R.getReadUnmappedFlag()).
    filter(R->R.getContig().equals(chrom) && !(R.getEnd()<start || R.getStart()>end)).
    forEach(R->{

    for(int pos=start; pos< end ;++pos)
        {
        char base='.';
        int ref=R.getStart();
        int read=0;
        for(CigarElement ce:R.getCigar())
            {
            CigarOperator op=ce.getOperator();
            switch(op)
                {
                case P: break;
                case H: break;
                case S : read+=ce.getLength(); break;
                case D: case N: ref+=ce.getLength();break;
                case I: read+=ce.getLength();break;
                case EQ:case X: case M:
                    {
                    for(int i=0;i< ce.getLength() ;i++)
                        {
                        if(ref==pos)
                            {
                            base = R.getReadString().charAt(read);
                            break;
                            }
                        if(ref>pos) break;
                        read++;
                        ref++;
                        }
                    break;
                    }
                default:break;
                }
            if(base!='.')break;
            }
        print(base);
        }
    println();
});

ref_cons is the name of my construct (as I said, my "reference genome" is only 1 sequence), 145 start 165 end.

ADD REPLYlink written 2.1 years ago by bioplanet60

Errors:

         [DEBUG][BioAlcidaeJdk] Compiling :
         1  import java.util.*;
         2  import java.util.stream.*;
         3  import java.util.function.*;
         4  import htsjdk.samtools.*;
         5  import htsjdk.samtools.util.*;
         6  import htsjdk.variant.variantcontext.*;
         7  import htsjdk.variant.vcf.*;
         8  import com.github.lindenb.jvarkit.util.bio.fasta.FastaSequence;
         9  import javax.annotation.Generated;
        10  import htsjdk.variant.vcf.*;
        11  @Generated(value="BioAlcidaeJdk",date="2017-10-26T16:35:29+0200")
        12  public class BioAlcidaeJdkCustom1850364251 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.SAMHandler {
        13    public BioAlcidaeJdkCustom1850364251() {
        14    }
        15    @Override
        16    public void execute() throws Exception {
        17     // user's code starts here 
        18     final String chrom = "ref_cons";
        19  final int start=145;
        20  final int end=165;
        21  
        22  stream().filter(R->!R.getReadUnmappedFlag()).
        23      filter(R->R.getContig().equals(chrom) && !(R.getEnd()<start || R.getStart()>end)).
        24      forEach(R->{
        25  
        26      for(int pos=start; pos< end ;++pos)
        27          {
        28          char base='.';
        29          int ref=R.getStart();
        30          int read=0;
        31          for(CigarElement ce:R.getCigar())
        32              {
        33              CigarOperator op=ce.getOperator();
        34              switch(op)
        35                  {
        36                  case P: break;
        37                  case H: break;
        38                  case S : read+=ce.getLength(); break;
        39                  case D: case N: ref+=ce.getLength();break;
        40                  case I: read+=ce.getLength();break;
        41                  case EQ:case X: case M:
        42                      {
        43                      for(int i=0;i< ce.getLength() ;i++)
        44                          {
        45                          if(ref==pos)
        46                              {
        47                              base = R.getReadString().charAt(read);
        48                              break;
        49                              }
        50                          if(ref>pos) break;
        51                          read++;
        52                          ref++;
        53                          }
        54                      break;
        55                      }
        56                  default:break;
        57                  }
        58              if(base!='.')break;
        59              }
        60          print(base);
        61          }
        62      println();
        63  });
        64  
        65      //user's code ends here 
        66     }
        67  }
[SEVERE][InMemoryCompiler]ToolProvider.getSystemJavaCompiler() failed.
java.lang.RuntimeException: ToolProvider.getSystemJavaCompiler() failed.
    at com.github.lindenb.jvarkit.lang.InMemoryCompiler.compileClass(InMemoryCompiler.java:165)
    at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk$AbstractHandlerFactory.getConstructor(BioAlcidaeJdk.java:473)
    at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk$SAMHandlerFactory.execute(BioAlcidaeJdk.java:564)
    at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.doWork(BioAlcidaeJdk.java:796)
    at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:1162)
    at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:1320)
    at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.main(BioAlcidaeJdk.java:821)
[SEVERE][BioAlcidaeJdk]java.lang.RuntimeException: Cannot compile custom class BioAlcidaeJdkCustom1850364251
ADD REPLYlink written 2.1 years ago by bioplanet60

what is your version of java/javac ?

java -version
ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum124k

openjdk version "1.8.0_131" OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11) OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

Is this a problem?

ADD REPLYlink written 2.1 years ago by bioplanet60

Hi Pierre and everyone!

I just wanted to let you know that it works! I tested it on another machine because I noticed you advise against using OpenJDK and it seems to be fetching the regions properly!

Many thanks for your help, time and code, much appreciated :)

ADD REPLYlink written 2.1 years ago by bioplanet60

cool. please mark my answer as "accepted" (green mark on the left)

ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum124k
2
gravatar for Pierre Lindenbaum
2.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

Ok, I'm not sure If I've understood but the followings script for bioalcidaejdk ( http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html ) should print the read sequences in a given region of the genome (indels are skipped!)

final String chrom = "rotavirus";
final int start=150;
final int end=200;

stream().filter(R->!R.getReadUnmappedFlag()).
    filter(R->R.getContig().equals(chrom) && !(R.getEnd()<start || R.getStart()>end)).
    forEach(R->{

    for(int pos=start; pos< end ;++pos)
        {
        char base='.';
        int ref=R.getStart();
        int read=0;
        for(CigarElement ce:R.getCigar())
            {
            CigarOperator op=ce.getOperator();
            switch(op)
                {
                case P: break;
                case H: break;
                case S : read+=ce.getLength(); break;
                case D: case N: ref+=ce.getLength();break;
                case I: read+=ce.getLength();break;
                case EQ:case X: case M:
                    {
                    for(int i=0;i< ce.getLength() ;i++)
                        {
                        if(ref==pos)
                            {
                            base = R.getReadString().charAt(read);
                            break;
                            }
                        if(ref>pos) break;
                        read++;
                        ref++;
                        }
                    break;
                    }
                default:break;
                }
            if(base!='.')break;
            }
        print(base);
        }
    println();
});

usage:

samtools view -h input.bam "rotavirus:150-300" | java -jar dist/bioalcidaejdk.jar -f script.js -F bam 

TAATTCCCAGAGTTCAAAGTAAAAATGATAATGTGATGGATGACTCTGTT
TATTTACCAGATTTAAAAGTAACTTTGATTATGTGATGGATGACTCTGGT
TATTAACCAGAGTTACAAGTAAATTTGATTATGTGCTGGATGACTCTGGT
TATTTCCCAGAGATAAACGTCAATTTGATTCTGTGATGGATGACTCTGGT
TATTTACCAGAGTTAAAAGTAAATTTGATTCTGTGATGGATTCCTCTGGT
TATTTACCAGAGTTAAAAGTAAATTTGATTCTGTGATGGATGACACTGGT
TATTTACCAGAGATAAAAGTAAATTTGATTATGTGATGGATGACTCTGGT
TATTTACCAGAGTTAAAAGTAAATTAGATTATTTGATGGATGACTCTGGT
TATTAACCAGAGTTAAAAGTCAATTTGATTCTGTGATGGATGACTCAGGT
TATTTACCAGAGTTAAAAGTAAATTTGATAATTTGATGGATGACTCAGGT
TATTTACCAGATTTAAAAGTAAATTTGAATATGTGAAGGCTGACTCTGGT
TATATACCAGAGATAAAAGTAAATTTGATAATGTGATGGATGACTCTTGT
TATTTAGCAGATTACAAAGTAAATTTGATTATGTTATGGAAGACTGTGTT
TATTTACCAGATTTAAAAGTAAATTTGAATATGAGATTGATGACTCTGGA
TATTTACCAGAGTTAAAAGTAAATTTGATTATGTTATGGATGACTCTGGT
TATTTACCATAGTTAAAAGTAAATTTGATTATGAGATGGATGACTCTGGT
TATTTACGCGAGTTAAAAGTAAATTTGATTATGAGATGGATGACTCTGGT
TATTTACCAGAGTTAAAAGTAAATTTGATAATGAGATGGATGACTCTTGT
TATTTACCATAGTAAAAAGTACATTTGATTATGTGATGGATGACTCTGGT
TATTAACCAGAGTTAAAAGTAAATTTGATTAAGAGATGGATGACTCTGGT
TATTTACCAGAGTTAAACTTAAATTTGATTATGTTATGGATGACTCTGGT
TATTTACCAGCGATACAAGTAACTTTTATTATGTGAAGGATGACTCTGGT
TATTAACCATAGTTAAAAGTCAATTTGATTATGTGAAGGCTGACTCTGGT
TATTTACCAGAGTTAAACGTAAATTTGATTATGTGATGGATGACTCTGGT
TATTTACCAGAGTTAAAAGTAAATTTGATTATGAGATGGATTACTCTGGT
TATTAACCAGAGTTAAAAGTAAATTTGATTCTGTGATGGATGACTGTGGT
TATTTACCAGAGTTAAAAGAAAATTAGATTATGTGCTGGCTGCCTCTGGT
TATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCAGGT
TATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGCTGACTCTGGT
TATTTACCAGATTTACAAGTAAATTTGATAATGTGATGGATGACTCTGGA
TATTTCCCATAGTTAAAAGTAAATTTGATTAAGTGATGGCTGACTCTGGT
(...)
ADD COMMENTlink written 2.1 years ago by Pierre Lindenbaum124k

No I mean if I am following your steps correctly... Was I supposed to save the script as script.js and give it as argument?

My command is:

samtools view -h MY_BAM_FILE "ref_cons:145-160" | java -jar jvarkit/dist/bioalcidaejdk.jar -f script.js -F bam
ADD REPLYlink written 2.1 years ago by bioplanet60

yes, modify the chrom, start, end variables in script.js. Show me the errors.

ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum124k

Hi again,

I was wondering, if with some modification I can print not only the region but also the whole read (e.g. with a tab separating them). Is it maybe somehow possible with a modification in the JAVA script? Sorry to ask, but I do not program in JAVA so I cannot understand the script well.

ADD REPLYlink written 2.1 years ago by bioplanet60
1

replace the last

println();

with

println("\t"+R.getReadString());
ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum124k

Thank you Pierre, it works!

I think that I get only the sequence of the read though, is it perhaps possible to caption the whole read, like ID etc?

ADD REPLYlink written 2.1 years ago by bioplanet60
1

see https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/SAMRecord.html#getReadName--

ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum124k

Thank you so much for the link and all the help!

ADD REPLYlink written 2.1 years ago by bioplanet60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2299 users visited in the last hour