Question

BAM file doesn't show the altered sample identifier

0

Entering edit mode

5.5 years ago

Badh2 • 0

Hi All, I used the following python script to add sample identifier (S1R1, S1R2, S2R1, S2R2 etc) based on the adaptor combination. It did add the sample name as I needed but when I used the output files in Bowtie to get bam files, the sample identifier at the end doesn’t show up. (I used Geneious and IGV to visualize the bam files.) Can anybody tell me what has gone wrong? My intention is to easily identify to which sample each read belongs to. Thanks!!

replace ID_TSV file
1:N:0:ATTACTCG+TATAGCCT S1R1
2:N:0:ATTACTCG+TATAGCCT S1R2
1:N:0:TCCGGAGA+TATAGCCT S2R1
2:N:0:TCCGGAGA+TATAGCCT S2R2
1:N:0:CGCTCATT+TATAGCCT S3R1
2:N:0:CGCTCATT+TATAGCCT S3R2
1:N:0:GAGATTCC+TATAGCCT S4R1
2:N:0:GAGATTCC+TATAGCCT S4R2
1:N:0:ATTACTCG+ATAGAGGC S5R1
2:N:0:ATTACTCG+ATAGAGGC S5R2
1:N:0:TCCGGAGA+ATAGAGGC S6R1
2:N:0:TCCGGAGA+ATAGAGGC S6R2
1:N:0:CGCTCATT+ATAGAGGC S7R1
2:N:0:CGCTCATT+ATAGAGGC S7R2
1:N:0:GAGATTCC+ATAGAGGC S8R1
2:N:0:GAGATTCC+ATAGAGGC S8R2


# Dictionary with strings to replace and what to replace them with
replace_strings = {}
with open("replace_ids.tsv", "r") as id_file:
    # Read file line-by-line
    for line in id_file.readlines():
        # Split line on TAB
        ids = line.strip().split("\t")
        # Fist entry is the original ID
        original_id = ids[0]
        # Second entry is your ID
        my_id = ids[1]
        # Add both to our dictionary of strings to replace
        replace_strings[original_id] = my_id

# Read file with sequences, called "sequence.txt"
with open("S8_R2_p.fastq", "r+") as infile:
    # Read each line of file into a list
    content = infile.readlines()
    # Keep a list of the lines with the replaced strings
    new_content = []
    # Loop lines in the file content
    for line in content:
        new_line = line
        # Find and replace any original_id with your own ids in the line of content and add it to our list of replaced lines
        for original_id, my_id in replace_strings.items():
            new_line = new_line.replace(original_id, my_id)
        new_content.append(new_line)

    # Write replaced content to a new file called "outfile.txt"
    with open("outfileS8R2.fastq", "w") as outfile:
        for line in new_content:
            outfile.write(line)

BAM fastq • 1.1k views

ADD COMMENT • link updated 5.5 years ago by finswimmer 16k • written 5.5 years ago by Badh2 • 0

finswimmer · Answer 1 · 2018-10-30

0

Entering edit mode

5.5 years ago

finswimmer 16k

Hello Badh2 ,

I'm not aware that you can have have different samples in one fastq file and being able to differ these samples after alignment. The usual way is to separate these fastq files per sample and align them independently. So you can give a Sample name in the ReadGroup during alignment.

A way to separate them is to use demuxbyname.shby bbtools:

demuxbyname.sh in=<file> out=<outfile> delimiter=whitespace prefixmode=f
This will demultiplex by the substring after the last whitespace.

fin swimmer

ADD COMMENT • link 5.5 years ago by finswimmer 16k

0

Entering edit mode

Hi finswimmer,

Thanks for trying to help. May be my question is not clear. Anyways, I don't have different samples in one fastq file. I got demultiplexed fastq files for each sample so I didn't have to demultiplex it myself. But when I assembled all the samples together using Bowtie2 and visualized them using either IGV or Geneious I can't identify to which sample each read belongs to. It has only X and Y cordinates by which I can't directly identify the sample (@M04503:27:000000000-G2K2K:1:1101:14373:1561). Therefore, I thought to add the sample no. to the end of each reads' identifier by using above script. It worked and now it is like this, with S1R2 at the end (@M04503:27:000000000-G2K2K:1:1101:14373:1561 S1R2).

My question is even though I did this, still I can't see the S1R2 part when I'm visualizing the alignment/assembly.

Thanks!!

ADD REPLY • link updated 5.5 years ago by finswimmer 16k • written 5.5 years ago by Badh2 • 0

0

Entering edit mode

Hello again,

the most clean way is still to do the alignment separate for each sample and adding a ReadGroup containing the sample name. Doing so the read group can be used by various tools for further analyses.

I see that there is also in --sam-no-qname-trunc option in bowtie2 which Suppress standard behavior of truncating readname at first whitespace at the expense of generating non-standard SAM. But again: I strongly recommend using ReadGroups to differ the samples.

fin swimmer

ADD REPLY • link 5.5 years ago by finswimmer 16k

0

Entering edit mode

Alrighty,,, I'll try that. Thanks again!!

ADD REPLY • link 5.5 years ago by Badh2 • 0