Question

Closed:InsideDNA: 5 Unix commands every bioinformatician should know

3

Entering edit mode

7.8 years ago

shilparaopradeep ▴ 190

Bioinformaticians often have to manage large text files containing reads, sequences of genomes, alignments, genetic variants, and so on. Sometimes the files you receive do not have quite (or indeed at all) the same format required by tools used in the downstream analysis. In such cases you have to rearrange the data in your files – change the delimiters, order of columns or, quite likely, discard some unnecessary values. Manual processing of large files will take plenty of time, so bioinformaticians need some programs for formatting the files. Some of the most common formatting tasks (for example, editing the header of a sam file) can be solved easily by using special tools. However, formatting tasks are so diverse, that the entire spectrum of such tasks will hardly ever be addressed by ready-to-use tools. So, bioinformaticians often have to write small programs in order to rearrange data in a specific file in a certain way. Luckily, Terminal (or Console) provides opportunities to perform rather complex manipulations of text files through the use of very simple commands. In this tutorial we will discuss some useful recipes for bioinformatics file processing.

1. Merging multiple files into one

To begin with, log in to the InsideDNA, navigate to the Files tab, and create a folder called Formatting. Upload the example files that we will use in this tutorial (click on the button Download data at the top of the page), or use your own files.

We will work with the virtual Terminal of the InsideDNA platform. To activate it, click on the Terminal button on the left.

To start work, navigate to the Formatting folder by typing the command:

cd Formatting

To see the folder content, use the ‘ls’ command:

ls

We need to unzip the file with raw data. Use unzip command for this purpose:

unzip source_formatting.zip

Run ls again. Now you can see the list of files that you uploaded to the Formatting folder.

Firstly, we will study a very simple but common task – merging multiple files into a single one. Genomes are often stored in the form of separate contigs or chromosomes, but sometimes it is necessary to have them in the form of a single file (for example, to use a genome as a reference). This can be done using a single command:

cat *.fa > genome.fa

This particular variant of command will merge all files in the current folder with the extension .fa in the into a single file named genome.fa (asterisk is equivalent to “anything”). You find the new file if you type the command ‘ls’ once again. Note that all the original .fa files are still in place.

2. Extracting a subset of data into a new file

Sometimes files contains information that is unnecessary for the next stage in processing. You may want to discard the unnecessary information in order to make files smaller (which could be essential when working with large amounts of data) or to prevent misinterpretation of extra information by downstream software. For such types of processing it is useful to study a language called “awk”.

For example, to select certain columns from an input file and write them into new one, you can use a command such as this one:

awk '{ print $1,$2}' /data/userXXX/Formatting/filtered_snps.vcf > /data/userXXX/Formatting/snp_positions.txt

In this case awk will take the first two columns of the input file filtered_snps.vcf (don’t forget to replace XXX with your own userID, which can be found at the top of the Terminal folder) and save them into the new file snp_positions.txt. Note that you can change the order in which columns will be written into the output file: for example, if you modify from '{ print $1,$2}' to '{ print $2,$1}', the order of columns in the output file will be reversed.

You can check that the new file now appears in the Formatting folder, by using the ls command:

Awk takes a small program, marked by {}, and runs it in all lines of an input file. By default, it takes an arbitrary number of spaces or tabs as a delimiter. To show awk that your input file has another type of delimiter, for example, a comma, you can use the option -F:

awk -F, '{ print $3 }' /data/userXXX/Formatting/filename.csv > /data/userXXX/Formatting/new_filename.csv

What if you need to extract not columns, but rows from the file? For example, a very common task is to select variants with a special filter value from a VCF file. In this case you need to take all the lines where the required value occurs in a specified field:

awk '$7 == "PASS" { print }' /data/userXXX/Formatting/filtered_snps.vcf > /data/userXXX/Formatting/pass_snps.vcf

This command will take lines that have the word PASS in the 7th column (which is the one with filter flags) and write them to the file pass_snps.vcf.

With awk, you can combine requirements into more complex commands using the && (double ampersand) sign and other logical signs. For instance, this command:

awk '$1 == chr3 && $7 == "PASS" { print }' /data/userXXX/Formatting/filtered_snps.vcf > /data/userXXX/Formatting/pass_snps_chr3.vcf

will take just the lines featuring SNPs on the 3rd chromosome, which have also passed all the filters (so they have corresponding values in the 1st and 7th columns of the VCF file), and write these to the file named pass_snps_chr3.

3. Changing the specified patterns

Sometimes you need to replace a specific word or value in an input file with another one. For example, a very common requirement of tools, when working with alignments, is to have all the chromosome names in the format chr1, chr2, and so on. For such purposes, the “sed” language is very useful. For instance, this short command:

sed 's/chromosome/chr/g' < input.txt > output.txt

will replace each instance of the word “chromosome” in the input file with the letters “chr” and will write the result in the file output.txt (option g when using the sed command specifies that each instance of “chromosome” in the file should be replaced, rather than just the first occurrence in each line, which is the default for sed).

Sed is commonly used to fix the headers of sam and bam files. There is a special tool called samtools reheader for this purpose, which takes bam files and a sam file with a correct header as two inputs and amend the header in a bam file. The reason for this is that bam files are human unreadable and therefore you cannot manually modify the header in a bam file.

Below we are listing a sequence of commands which allows you to modify bam header.

The first command we are going to use is samtools view -H. It will create a file called header.sam using the header of the input candida.bam file:

idna_submit.py -t reheader -c 1 -r 0.6 -e "idna_samtools_view -H /data/userXXX/Formatting/candida.bam > /data/userXXX/Formatting/header.sam"

Note that in this case we used container idna_submit.py to send our task to the virtual machine. We gave this machine one core and 0.6 Gb of RAM, which are very small amounts of resources. Other tasks may require many more cores and much more RAM to be completed in reasonable time.

You can monitor the progress of tasks and send them to virtual machines, using the Tasks tab:

When the task is finished, you will find the new file header.sam in the Formatting folder. To see what was written in this file, use the less command:

cd Formatting
less header.sam

As you can see, the names of chromosomes in this header have a very complex structure, for example Ca21chr1_C_albicans_SC5314. To exit less mode, press the Q key. To change chromosome names to the standard chr1 format, we will use a sed command like the following:

sed -e 's/SN:Ca21chr\([0-9R]\)\_C\_albicans\_SC5314/SN:chr\1/' -e 's/SN:Ca19-mtDNA/SN:chrM/' < /data/userXXX/Formatting/header.sam > /data/userXXX/Formatting/header2.sam

This command replaces all the complex chromosome names with the letters chr, which are followed with the number of chromosomes, or the letter R. Sed remembers this number or letter because of the specific pattern $[0-9R]$, we used in the command. The program will search for this character after the sequence SN:Ca21chr and before the sequence _C_albicans_SC5314, as we specified in the first part of command; it will then replace all instances of this pattern with the symbols SN:chr and the “remembered” character. Notice that we need to put \ before each _ sign in the described pattern, in order to prevent sed from interpreting them as delimiters.

We need to tackle the case of mitochondrial chromosomes separately, since these have a different structure in our source header file. The second part of the command that comes after the –e option is related to this task. After the sed run, you can use the “less” command to check that new file ‘header2.sam’ has the desired structure for chromosomes names:

less header2.sam

Finally, you need to replace the header of the original bam file with the newly generated correct header. We will use samtools reheader command for this purpose:

idna_submit.py -t rehead2 -c 1 -r 0.6 -e "samtools_reheader /data/userXXX/Formatting/header2.sam /data/userXXX/Formatting/candida.bam"

Note that you don’t need to apply the steps that use samtools view and samtools reheader if you need to fix a file in sam or in other human-readable formats – you can edit them directly by applying the sed command.

Now you have mastered the basic techniques of formatting that are essential for processing large files containing bioinformatics data. Since fixing the format of files comprises a substantial part of a bioinformatician’s work, hopefully such techniques will help you to save plenty of time. Stay tuned!

insideDNA genomics • 2.3k views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 7.8 years ago by shilparaopradeep ▴ 190