Question: How to extract specific columns based on a set of different header patterns in a big tab-delimited file (PERL)
gravatar for cmcouto.silva
3.6 years ago by
São Paulo, Brazil.
cmcouto.silva30 wrote:

Hello folk, I'm beginner in perl and I'm trying to extract the whole column with a set of specific patterns in tab-delimited file (very large). I saw some scripts but only when you know which column are you interested in (small file).

My file structure looks like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ID01    ID02    ID03    etc...
3   66894   rs9681213   0   1   .   PASS    .   GT  0|1 0|1 0|1 etc...
3   95973   rs1400176   0   1   .   PASS    .   GT  1|1 1|1 1|1 etc...
3   104972  rs990284    0   1   .   PASS    .   GT  0|1 0|1 0|0 etc...
3   114133  rs954824    0   1   .   PASS    .   GT  1|1 1|1 1|1 etc...
and so on...

Let's suppose that I want to extract four columns (ID01, ID03, ID45, ID80). These IDs could be (optionally) in another text file for parsing. I suppose I have to storage these patterns in an array, right!?

I tried to split the @_ data (for getting the lines) but I couldn't get the correct column $_[?] to print off. The file is too big and I can't see in which column number they are. I really get stuck...

That's the code I've been working on:

#! usr/bin/perl
use strict;
use warnings;

my $file = $ARGV[0] || die;
my @fname = split(/_/, $file);      #my file name is chr3_etc.vcf

open(FILE, '<', "$file") || die;
open(OUT,'>', "$fname[0]_NAD.vcf");

my @NAD_ids = qw(NAD15 NAD54 NAD55 NAD56 NAD57 NAD58 NAD59 NAD64 NAD93 NAD93 NAD98);

while (<FILE>) {
    next if m/##/;      #exclude the additional information above the table
    chomp; @_=split;    #getting the columns
        if (@NAD_ids) {
            print OUT "$_[0]\t$_[1]\t$_[2]\t$_[3]\t$_[4]\t$_[5]\t$_[6]\t$_[7]\t$_[8]\t$_[($NAD_ids[0])]" . "\n";    # see the last one (problem);


close FILE; close OUT;

Please consider the "NADs" like the IDs.

snp perl • 2.8k views
ADD COMMENTlink modified 3.6 years ago by SES8.4k • written 3.6 years ago by cmcouto.silva30

So, what error messages are you getting when you run your code, why is it not working the way you want it to?

I don't understand what

       if (@NAD_ids)

Is supposed to do, are you trying to compare each line of the file to the list of IDs, are you trying to print entire lines where the IDs in certain columns match your list?

ADD REPLYlink written 3.6 years ago by mastal5112.0k

I think OP is trying to ensure they actually do need to pick a column.

ADD REPLYlink written 3.6 years ago by RamRS28k

No, I don't! That's exactly what Ram said as a reply here. Just to make sure the IDs exist. Also, I'm trying to do what he said in his comment. That's actually the way I'm thinking to solve this, but if anyone knows another way, I entirely open to see how you 'd do that!

ADD REPLYlink written 3.6 years ago by cmcouto.silva30

Ah, OK. This looks like genotype data, so first you want to check that the file contains the IDs/samples you want. But then, for the computer to print out the correct columns, you will have to tell it which columns to print (give it the array index of the matching columns), which is essentially what Ram said.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by mastal5112.0k
gravatar for SES
3.6 years ago by
Vancouver, BC
SES8.4k wrote:

You are very close and I appreciate the effort. I'll share the thing you are missing and some other helpful tips because you basically have all the logic right. (I personally think it would be better to use vcf-tools or one of the associated VCF modules to work on the indexed files more efficiently. That being said, let's work with what you posted.) The most important thing to remember is to use lexical variables to limit the scope of your actions. When you are using loops and modifying the topic $_ or trying to directly modify things like @_ (this special notation only applies to subroutine arguments, so it was misplaced) it can get confusing. That is one reason it's best to work with lexical variables and avoid the magic (also, readability). Exceptions are in map/grep/sort blocks. Also important is that bare filehandles have global scope, so they are to be avoided in all cases. These can cause problems from a large distance in the code.

For the question, it sounds like you want to pick genotypes from one file and select those from a VCF file, right? It's not completely obvious at first because your examples don't match up. Also, there were duplicate IDs in your example array, which may or may not matter but it's important to check your inputs (you can solve this problem using the same module I used below using the uniq method). I modified the files to demonstrate, otherwise there was nothing to be done. The ID file:

And the VCF:

Here is some code:

The first thing to do is read in the IDs, which is what the map block is doing. What that is doing specifically is removing newlines and spaces so the IDs will match those in the file, and then assigning the keys a value of 1 (to say it's 'defined' in Perl-speak). The space/newline issue is important in any programming language. The rest is basically what you had with some minor changes. To pick the correct genotypes you find the column (array index), then test if any matches were found with if (@indexes) {, and if not you stop reading to be efficient (literally, you say that's the last line we need). The data is printed with an "array slice" and join, and these two idioms will save you a lot of typing. If you are unsure about what is happening here, use Data::Printer (the usage of which will be very familiar if you know Ruby) or Data::Dump::Color (personal preference) to peek and the data structures in the script.

Running this prints:

3   66894   rs9681213   0   1   .   PASS    .   GT  0|1 0|1 0|1
3   95973   rs1400176   0   1   .   PASS    .   GT  1|1 1|1 1|1
3   104972  rs990284    0   1   .   PASS    .   GT  0|1 0|0 0|1
3   114133  rs954824    0   1   .   PASS    .   GT  1|1 1|1 1|1

That's what I think you wanted. If that's not exactly what you were asking then I'm confident you can modify it to your needs. That will likely be good enough, but as I mentioned above, I'd personally use a toolkit or some kind of VCF library for this since it has to be a solved problem (I recall doing this recently and I’ll update if I remember the command).

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by SES8.4k

It worked! Thank you very very much, SES! It's the first time I post here I am very surprise with the answers. It's really good to see a community helping each other. I hope to be able to help as well, soon.

Indeed I used vcf-tools for extracting individuals from a vcf file, which was basically what I've intended to do here. I have big vcf files with info about several populations (SNP array) and I want to analyse them individually, compare the genetic diversity between them (for it I'm using seslcan program). I have 22 chromosomes and the test was properly performed in the first two of them, but from the 3rd chromosome vcf file along It didn't work after I filter individuals for another vcf file (but it worked with the original ones). So I wanted to extract few individuals manually ( in Perl) for a new vcf file and see what it's happening. As I said, I'm new on bioinformatics (Perl is the first language I've practicing), once I used to work with cell culture. Your tips were very well received! Thank u all!

ADD REPLYlink written 3.6 years ago by cmcouto.silva30
gravatar for RamRS
3.6 years ago by
Houston, TX
RamRS28k wrote:


As far as I can see, you're trying to index an array with the contents of the line ($_) using the column name ($NAD_ids[0]) as the subscript. You know it's a column name, but the program doesn't.

You may want to store the header as an array, find the index of the target columns in that header array and then use those indices to index each line.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by RamRS28k

Yeah, that's it! Makes sense...

I store the header as an array so that I find the index (e.g. using grep or modules like List::MoreUtils), right!? I didn't try it yet, but I think I can do it. But I'll have to find the targets one by one? Anyway, thank you for your feedback!

ADD REPLYlink written 3.6 years ago by cmcouto.silva30

I've given you the lead to follow - the rest is on you :)

It's a fairly trivial solution, I'm sure you'll figure it out.

ADD REPLYlink written 3.6 years ago by RamRS28k

Also, you realize awk might be super useful here, right? grep non info (^[^#][^#]) lines to pipe to awk, at NR==1, pick the header indices you'll use, for rest of the NR, print those indices.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by RamRS28k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 769 users visited in the last hour