Question: How to calculate the fraction of bases that are different from the reference base using samtools mpileup ?
0
2.1 years ago by
M.O.L.S30
M.O.L.S30 wrote:

Questions:

• How are sites identified to find the fraction of bases that are different from the reference base at each position? It seems from the documentation that . and , is equal to a match on the reference base.

• How is the the fraction of sites with all bases equal to the reference genome calculated ?

• How is the fraction of positions where all the bases are different from the base in the reference calculated? It seems that a letter A,C,G,or T indicates a difference in the reference base.

My output from samtools mpileup is:

gi|110645304|ref|NC_002516|pseudocap|136    1   T   19  ^].^],^],^],^],^],^],^],^],^],^],^],^],^],^],^],^],^],^],   @AAAA>AAAAAAAAAAAA< ]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136    2   T   25  .......,,,,,,,,,,,,,,,,,,   /////@/BBBABBBBBBBBBBABBB   ]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136    3   T   26  .......,,,,,,,,,,,,,,,,,,^].    55555@5EEECAEEEEE@DCEEEEE@  ]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136    4   A   26  .......,,,,,,,,,,,,,,,,,,.  FHFFAFAJGIJGIJIJJIHIIJIJI?  ]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136    5   A   27  .......,,,,,,,,,,,,,,,,,,.^].   FGFFGDHJHIGHHJHJJHGGIJJIHB> ]]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136    6   A   28  .......,,,,,,,,,,,,,,,,,,..^].  FHHFHEHJGIGFHIGJJGGCGJJIBB>@    ]]]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136    7   G   28  .......,,,,,,,,,,,,,,,,,,...    FJHFHFGJIIGBFICJJGGHHJJJDF@@    ]]]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136    8   A   29  .......,,,,,,,,,,,,,,,,,,...^]. FJHHIFJJGIH?FIFJIHEGGJIJ?DDCC   ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136    9   G   31  .......,,,,,,,,,,,,,,,,,,....^].^]. HJGHIHJJDHFF@IGIIGFAHIHI:DDFCCB ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136    10  A   31  .......,,,,,,,,,,,,,,,,,,...... HJFHIDJI0DC?@HCHH@??EIAH0B?FCC@ ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]

In the pileup format (without -u or -g), each line represents a genomic position, consisting of chromosome name, 1-based coordinate, reference base, the number of reads covering the site, read bases, base qualities and alignment mapping qualities. Information on match, mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the read base column. At this column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, a '>' or '<' for a reference skip, ACGTN' for a mismatch on the forward strand andacgtn' for a mismatch on the reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position.

modified 2.0 years ago • written 2.1 years ago by M.O.L.S30
1

Hello M.O.L.S ,

could you please reformulate your question a bit. I don't understand them completely.

But it seems, the way to your answer is to count the numbers of occurrence of . and , in each read column (column 5), as these indicates a match. The number of reads covering the position, is given in column 4. By subtracting the number of matches from this column, you get the number of reads that doesn't have a match. Now you can calculate any fraction you like.

fin swimmer

Hello M.O.L.S ,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

Thank you!

This is something I have been struggling with for a while by myself, and need some help with.

I have:

1. indexed the reference genome using the burrows wheeler aligner.
2. mapped the reads against the reference genome to output a SAM file.
3. converted the SAM file into a BAM file, then sorted the BAM file.
4. calculated the coverage using samtools depth and the sorted bam file.
5. generated statistics using both samtools flagstat and samtools stats and the sorted bam file.
6. Generated several different graphs using samtools plot-bamstat
7. produced the mpileup while which summarises read bases at each position in the chromosome.
8. created a bcf file from the mpileup file and had a look at it.
9. tried two different ways to read the mpileup file into Python.

with open("mpileup.txt", "r") as file_handle:
for line in file_handle:
lines = line.split()
chromosomes_name = lines[0]
genomic_coordinate_position_site = lines[1]
reference_base = lines[2]
base_quality = lines[5]
if lines[3] > "10":

filename =  (“mpileup.txt")

2 . mapped the reads against the reference genome to output a SAM file.
3 . converted the SAM file into a BAM file, then sorted the BAM file.

You can avoid these steps and intermediate files by piping the output of your aligner directly to samtools sort:

bwa mem genome.fa reads.fastq.gz | samtools sort -o alignment.Bam

Thanks, that command makes sense.

Hello M.O.L.S ,

is this an assignment? What have you done already to answer these questions? samtools mpileup is deprecated since v1.9 and should be replaced by bcftools mpileup. There the output format has changed.

fin swimmer

Thanks, but I need to still use samtools mpileup version 1.7

0
2.0 years ago by
M.O.L.S30
M.O.L.S30 wrote:

import pandas as pd

print("These are the headers for the data:")
print(TheTitle)

This program cannot read the entire mpileup output file from mpileup because it is too large.. There is not enough memory to read through the whole file, so a small section of it is taken here. How would I read the entire mpileup file from the online samtools server output?"

### How to import data from a file into a pandas dataframe?

allpiledup_initial = pd.read_csv("/Users/m.o.l.s/allpiledup", sep = "\t", names = TheTitle)

print(allpiledup_initial.iloc[:,[1,2,3]])
print("Only a few coloumns can be displayed.")`
print("The rows where the read count is 10 or less are filtered out.")

### Identify sites in the pileup file that have a sequencing depth of at least 10 (>= 10 or more than 11?)

print(allpiledup.iloc[:,[1,3]])

### LAMBDA FUNCTIONS: Writing functions on the fly by saying the names of the arguments after the keyword lambda

print("Functions were created in order to count the dots and commas which count as matches.")

### Function Creation

count_the_dots = lambda word : word.count(".")
print("There are ", count_the_dots(".../...'.'.;[][\'."), "dots")

count_the_commas = lambda word : word.count(",")
print("There are ", count_the_commas("..,,,./...'.'.;[][\'."), "commas")

print("New coloumns were added to the dataframe based on calculations from the old coloumns.")
print("The functions are applied to the dataframe coloumns in order to create new coloumns.","\n")

### dataframe["newcoloumnname'] = dataframe["oldcoloumnname'].apply(function)

print("The total dots in the read for each position are found.", "\n")
# How many dots are in each read result?
print(allpiledup.iloc[:,[1,7]])

print("The total commas in the read for each position are found.", "\n")

# How many commas are in each read result?

print(allpiledup.iloc[:,[1,8]])

print("The total dots and commas are added together to create a summary of the matches.","\n")

# How many dots and commas are in each read result?

allpiledup["Summation"] = allpiledup["Total_Dots"] + allpiledup["Total_Commas"]
print(allpiledup.iloc[:,[1,9]])
print(" How many read results are not dots or commas? This is the difference.", "\n")
print(allpiledup.iloc[:,[3,10]])

print("What is the fraction of Base matches?", "\n")
print(allpiledup.iloc[:,[1,3,11]])

print("What is the fraction of Base mismatches?","\n")
print(allpiledup.iloc[:,[1,3,12]])
print("There are rows that have a base fraction equal to one")
onPiledUp = allpiledup[allpiledup["MatchFraction"]== 1.000000 ]
#print(onPiledUp.iloc[:,[1,3,11]])

print("The number that have a base fraction equal to one is the same as the number of rows:")
print(" There are this many rows where the base fraction is 1:", onPiledUp.shape)

### Make a histogram of the Base Mismatch Fraction.

import matplotlib.pyplot as plt
#The histogram is a type of visualisation that allows you to explore the distribution of your variables.
plt.hist(allpiledup["BaseMisMatchFraction"], bins = 100, color = 'blue' )
plt.title("The frequency of the Base Mismatch Fractions")
plt.xlabel("Base Mismatch Fraction")
# The x axis has been limited in order not to include the values at 0.00
plt.xlim([0.02,0.12])
plt.ylabel("Frequency")
plt.show()
plt.clf

plt.hist(allpiledup["MatchFraction"], bins = 100, color = 'orange')
plt.title("The frequency of the Base Match Fractions")
plt.xlabel("Base Match Fraction")
# The x axis has been limited in order not to include the values at 1.00
plt.xlim([0.88,0.98])
plt.ylabel("Frequency")
plt.show()
plt.clf()