Counting data with ht seq
1
0
Entering edit mode
11 months ago
aranyak111 • 0

I am trying to use a pipeline for RNA seq which has been used by one of my colleagues. I have got in total of 8 files. These files are 8 count files of RNA seq data. I am using less command

less wth1count.txt
less wth2count.txt
less wth3count.txt
less wth4count.txt
less 99h1count.txt
less 99h2count.txt
less 99h3count.txt
less 99h4count.txt


and to count the number of lines for each of the 8 files the command that I use is

wc -l wth1count.txt
wc -l wth2count.txt
wc -l wth3count.txt
wc -l wth4count.txt
wc -l 99h1count.txt
wc -l 99h2count.txt
wc -l 99h3count.txt
wc -l 99h4count.txt


My subsequent goal is to filter out reads that are lower than 5 and for that I am using python scripts

cat filtercounts.py

#!/usr/bin/python

import sys

#Read in the HTseq gene count output files for each of the 8 samples

inputlist = [sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5], sys.argv[6], sys.argv[7], sys.argv[8]]

##Determine the genes with sufficient coverage
#Set up a variable to hold genes with sufficient coverage in any of the 8 samples

genelist = []

#For each HTseq file

for i in range(len(inputlist)):
#Open it
inputfile = open(inputlist[i], 'r')

#For each line
for line in inputfile:
#Determine the number of reads/transcripts for this gene
splitline = line.strip().split('\t')
#Does it pass the threshold?
if int(splitline[1]) >= 5:
#If so, add it to list
if not splitline[0] in genelist:
genelist.append(splitline[0])
#Close the file
inputfile.close()

#Write out the list of sufficiently expressed genes
outputlist = open(sys.argv[9], 'w')

for gene in genelist:
outputlist.write(gene + '\n')

outputlist.close()

##Filter each of the HTseq files for only sufficiently expressed genes across our samples
#For each HTseq file
for i in range(len(inputlist)):
#Open it
inputfile = open(inputlist[i], 'r')
#Open an output 'filtered' file
outputfile = open(sys.argv[i+10], 'w')
#For each line
for line in inputfile:
#Determine the gene
splitline = line.strip().split('\t')
#Is it in our list?
if splitline[0] in genelist:
#If so, write out the original line
outputfile.write(line)
#Close the files
inputfile.close()
outputfile.close()


Problem - When I am trying to execute the script I am getting the error "index list out of range"

I am also not sure where in the script I can feed the string of files as input?

Any help will be useful.

alignment rna seq • 414 views
0
Entering edit mode

Please invest some time into formatting your post properly. Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (text becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.

Also, avoid blank lines between lines of code - that just makes code harder to read and adds to the character count unnecessaily. See:

for i in range(len(inputlist)):
#Open it
inputfile = open(inputlist[i], 'r')
#Open an output 'filtered' file
outputfile = open(sys.argv[i+10], 'w')


for i in range(len(inputlist)):

#Open it

inputfile = open(inputlist[i], 'r')

#Open an output 'filtered' file

outputfile = open(sys.argv[i+10], 'w')

0
Entering edit mode

I am getting the error "index list out of range"

0
Entering edit mode
#!/usr/bin/python

import sys

#Read in the HTseq gene count output files for each of the 8 samples
inputlist = [sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5], sys.argv[6], sys.argv[7], sys.argv[8]]

#Determine the genes with sufficient coverage
#Set up a variable to hold genes with sufficient coverage in any of the 7 samples
genelist = []

#For each HTseq file
for i in range(len(inputlist)):
#Open it
inputfile = open(inputlist[i], 'r')
#For each line
for line in inputfile:
#Determine the number of reads/transcripts for this gene
splitline = line.strip().split('\t')
#Does it pass the threshold?
if int(splitline[1]) >= 5:
#If so, add it to list
if not splitline[0] in genelist:
genelist.append(splitline[0])
#Close the file
inputfile.close()

#Write out the list of sufficiently expressed genes
outputlist = open(sys.argv[8], 'w')
for gene in genelist:
outputlist.write(gene + '\n')
outputlist.close()

##Filter each of the HTseq files for only sufficiently expressed genes across our samples
#For each HTseq file
for i in range(len(inputlist)):
#Open it
inputfile = open(inputlist[i], 'r')
#Open an output 'filtered' file
outputfile = open(sys.argv[i+9], 'w')
#For each line
for line in inputfile:
#Determine the gene
splitline = line.strip().split('\t')
#Is it in our list?
if splitline[0] in genelist:
#If so, write out the original line
outputfile.write(line)
#Close the files
inputfile.close()
outputfile.close()


The full traceback is

Traceback (most recent call last):
File "/gpfs/ycga/project/nicoli/ag2646/99H50/filtercounts.py", line 9, in <module>
inputlist = [sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5], sys.argv[6], sys.argv[7], sys.argv[8]]
IndexError: list index out of range

0
Entering edit mode

I would highly recommend you learn to separate your code into functions. For example:

def htseq_remove_low_genes( file_in, file_out, threshold ):
# output a new file with low genes filtered out
with open(file_in,'r') as f, open(file_out,'w') as fout:
for r in f:
if int(r.strip.split('\t')[1]) > threshold):
fout.write(r)
return file_out


You can then just call this function whenever you need to.

0
Entering edit mode

Does this answer OP's question? If not, it should be a comment, not an answer. Please read the posts under the how-to tag to contribute to and use the forum better.

0
Entering edit mode

Oh I see, understood - will use appropriately next time. Thanks for helping to moderate this forum!

0
Entering edit mode

No problem, I've moved you answer to a comment this time.

1
Entering edit mode
11 months ago
1. If you want to use python, your life will be much easier if you use pandas to make data frames out of this.
2. You're going to use R in the next step anyway, so why not read it in (using DESeq2) with DESeqDatasetFromHTSeqCount and then:

.

KEEP = which(apply(counts(dds), 1, min) >= 5)
dds = dds[KEEP]


That is vastly simpler than what you're doing.