Bowtie2 problem when use aligned or unaligned output file for re-alignment
1
0
Entering edit mode
2.8 years ago
daewowo ▴ 80

Using bowtie2 I want to get only reads that align to animal1 genome and not animal 2. I alligned to animal1, with high sensitivity. Then I took the aligned .sam file and converted it to fastq using gatk SamToFastq (then would align to animal2 and take unaligned reads -> which would give me only animal1 reads).

Problem is that bowtie2 throws an error when try to align to animal2 when using this fastq file:

Error reading RefRecord offset from FILE

Looking at https://github.com/BenLangmead/bowtie2/blob/master/ref_read.h

RefRecord(FILE *in, bool swap) {
    assert(in != NULL);
    if(!fread(&off, OFF_SIZE, 1, in)) {
        cerr << "Error reading RefRecord offset from FILE" << endl;
        throw 1;
    }
    if(swap) off = endianSwapU(off);
    if(!fread(&len, OFF_SIZE, 1, in)) {
        cerr << "Error reading RefRecord offset from FILE" << endl;
        throw 1;
    }
    if(swap) len = endianSwapU(len);
    first = fgetc(in) ? true : false;
}

The issue appears because there is just an id in the output .sam (shown converted to fastq) below:

@SRR10111187.1.1
GTTTATTAGTACGTTGAGGTTGTGATCCGGAGTTTTCGGGGTATGGGCAACCTAGCTTGCTTAGCTGACCTTATTATAGATTGTGGTGTAAGTT

But bowtie want additional information, not just id.

Can I just use the original raw fastq file, get the matching read id, then copy the rest of the read header information and replace the read header in the fastq file output from bowtie2?

Eg

@SRR10111187.1.1 1 length=150 GTTTATTAGTACGTTGAGGTTGTGATCCGGAGTTTTCGGGGTATGGGCAACCTAGCTTGCTTAGCTGACCTTATTATAGATTGTGGTGTAAGTT

I presume this is valid as reads as far as I know are not altered by bowtie2?

bowtie2 • 2.3k views
ADD COMMENT
1
Entering edit mode

Use a proper tool like bbsplit.sh to align data to both genomes at the same time. See: BBSplit syntax for generating builds for the reference genome and how to call different builds.

ADD REPLY
0
Entering edit mode

Brilliant, thanks

ADD REPLY
0
Entering edit mode

I have run BBSplit using ambiguous2=split and ambiguous2=toss > but am getting really anomalous results wherby one animal genome is dominating, which contradicts Krona analysis and mitochondrial analysis. Will run alignment methid as recommended below to QC

ADD REPLY
0
Entering edit mode

With human and mouse genomes (which are very similar) this is not unexpected. I am not sure about mitochondrial analysis you are talking about but changing aligners/parameters is not going to change this result. What kind of an experiment is this? Are you working with xenografts or is this simple contamination.

ADD REPLY
0
Entering edit mode

The dataset is contaminated with human reads, I want to have high certainty in identifying only human reads- want to be certain on contamination quatity and typing. If there is any similarity to mouse I do not want to include the read. (Mitochondrial analyses as a separate anaysis where I took human and mouse mitochondrial sequence and aligned entire dataset to each seperately using bwa mem). Krona/STAT taxonomic anaysis I did on NCBI.

I think easiest way may be is to run bowtie2 2x once for mouse, once for human. Then use a shell script to find all non matching ids in the sam files and write to new file

ADD REPLY
0
Entering edit mode

If a read aligns perfectly to mouse, and imperfectly to human, you really want to throw it away? I think most people would say a read that aligns better to mouse should be classified as mouse.

ADD REPLY
0
Entering edit mode

I want to count human and mouse seperately. The easiest way for me to work this out is to align twice

ADD REPLY
1
Entering edit mode

Since human and mouse genomes have areas of homology if you align the reads separately you would be double counting some reads.

ADD REPLY
0
Entering edit mode

I actually found that using bowtie2 with very hgh sensitivity gave very few matches. I ended up using bbsplit for final analysis

ADD REPLY
0
Entering edit mode

It might be easiest. Sometimes, the easiest way is rather inaccurate.

ADD REPLY
2
Entering edit mode
2.8 years ago

Using bowtie2 I want to get only reads that align to animal1 genome and not animal 2. I alligned to animal1, with high sensitivity.

This is not ideal. Better would be to align with high sensitivity to both genomes at once, then remove any reads that align to both genomes.

ADD COMMENT
0
Entering edit mode

Just to confirm:

I concatened the 2 genomes, GRCm39 and GRCh38 to one reference file. Now I am running

bowtie2 -x ref_file -U fastq_1 fastq_2 -S out_file -a --local --score-min G,298,0

As I want to ensure the seach covers both human and mouse genomes and as they are both very large, need to keep the search unlimted by using the -a option. However I note that the bowtie2 manual mentions -a is suboptimal:

"Like -k but with no upper limit on number of alignments to search for. -a is mutually exclusive with -M and -k, and -M is the default.

Note: Bowtie 2 is not particularly designed with -a mode in mind, and when aligning reads to long, repetitive genomes this mode can be very, very slow."

As such I am wondering if my initial methed may be faster.

ADD REPLY
1
Entering edit mode

Are you sure you want the -a function? If a read aligns 20 times, do you want to know all the positions? Or are you going to omit it?

ADD REPLY

Login before adding your answer.

Traffic: 1603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6