A crash course in Perl and DNA.

In my Senior year of undergrad I took a class called “DNA.”  It was a cross-listed course; Computer Science and Biology.  I thought it might be interesting, and it involved coding up some stuff in Perl.  The course was basically a crash-course in both Perl and DNA. As such, some of the starting code (such as reading from a file) was given to us. Lucky!

For our final project we had to come up with an application summing up the knowledge we learned in class.  I decided to do the following:

Summary

In DNA there are four possible nucleotides that can be used: T, C, A, and G.  Codons are formed by joining three of these nucleotides together.  This means that there are sixty-four possible codons (4x4x4).  The purpose of these codons is to provide a mechanism for the organism to produce amino acids.  There are 20 different amino acids.  This means that there will be some codons that, while having different nucleotide structures, allow for the production of the same amino acids.

Different organisms may show different preference, or a bias, towards different codons to encode the same amino acid.  The reasons for this happening are still vague, but much research is being done to get a clue as to why certain codons are preferred over others.

In bacteria, for example, different variables have been found to contribute to certain codon usage.  Some of these include GC-composition, the likelihood of a specific strand to mutate, and RNA stability. (Curr Issues Mol Biol., Oct 3 2001)  It has also been found that codon bias may be used to improve translation efficiency. (Yizhar Lavner, Daniel Kotlar, 2004)

Proposed Solution

The software being proposed will be able to compare different biases toward codons in any set of organisms.  It will be able to find GC-rich regions, as well as output information to an Excel format file for analysis.  By reading in a DNA sequence, breaking it up into codons, and using a hash table to keep track of the counts of each different codon, a statistical analysis can be performed on the bias towards one codon or another to encode the same amino acid.  A visual analysis can be performed using Microsoft Excel’s chart capabilities.

The program will start by getting the input.  This is done by using glob to grab a bunch of files stored in a local folder.  The file names will be stored in an array.  The program will then enter a while loop to go through each file in the array and perform work on it.

#!/usr/bin/perl

use strict;
use warnings;

# needed for MacOS X if using BBedit or TexWrangler (see comments below)
use File::Spec;

#-------------------------------------------------------------------------------
# CONSTANTS
#-------------------------------------------------------------------------------

my $MAX_CODONS_OUTPUT = 64;               # top-N most frequent codons
my $dataDirectory = "inputGeneDirectory"; # relative path of directory holding input files
my $outputFileName = "outputResults.xls"; # set-up OUTPUT file (tab-delimited for Excel)
my $TRUE = 1;                             # bool value for true
my $FALSE = 0;                            # bool value for false
my $CODON_LENGTH = 3;                     # size of codon to use in current run

#-------------------------------------------------------------------------------
# VARIABLES/DATA STRUCTURES
#-------------------------------------------------------------------------------

# ARRAYS
my @fileList;      # all files in the directory ($dataDirectory) of data files
my @allCodons;     # an array to hold each codon
my @aminoAcids;    # a list of frequency counts for each aminoAcid and codons
my @sorted;        # a list of the counts sorted highest to lowest

# HASH MAPS
my %codonCounts;        # a hash table of frequency counts for each codon
my %codonMap;           # declare a hash table to hold the Amino Acid abbreviations
                        # corresponding to each possible triple of DNA (codon).
                        # Amino Acids are abgreviated with the ONE LETTER codes.
                        # (see the table at the bottom of this program for AA codes)

# VARIABLES
my $pathname;           # the path to the files
my $filename;           # the file name being worked on
my $nextFile;           # the next file in the list of files
my $sequence;           # the DNA sequence being read in
my $sequenceLength;     # the length of the DNA sequence
my $numCodons = 0;      # the number of codons found
my $nextCodon;          # scalar to hold each codon, one at a time
my $numSortedCodons = 0;# the number of sorted codons, useful if less than N
my $numToPrint;         # the final number of top-N codons to print
my $num = 0;            # counter variable, i would not work for some reason
my $rank = 0;           # the rank of the current codon
my $currentCount = 0;   # the number of times the current codon was found
my $proportion;         # the proportion of this codon compared to all codons
my $GCRich;             # the GC-rich region that was found
my $lengthGC;           # the length of this GC-rich region
my $firstGCRich = $TRUE;# is this the first GC-rich region found for this file?
my $sumGCLengths = 0;   # the sum of all of the GC-rich region lengths
my $percentGCs = 0;     # the percentage of the file that is rich in GC
my $nextAcid;           # the next amino acid in the array

#===============================================================================
#                      START PROGRAM
#===============================================================================

%codonMap = getAminoAcidLookUpTable(); # populate the codon hash map->amino acids

printTitle();             # print the title of the report (to the console)

#============================
# SET OUTPUT INFORMATION

#set the path name for the directory
setPathname ();

# set the output filename
$outputFileName = $pathname.$outputFileName;

# create the output file (tab-delimited for Excel)
open (OUTPUT, ">", $outputFileName)
    or die "Cannot create the file: $outputFileName: $!";
#============================
# SET INPUT INFORMATION

# use filename with .fna to get all FASTA files in the directory
$filename = $pathname.$dataDirectory."/*.fna";

# grab a "glob" of files, each filename is put into elements of the array @fileList
@fileList = glob($filename);

The first step when entering the while loop is to get the DNA from the file.  The input expected will be a FASTA format file.  The sequence must also begin with a start codon and end with a stop codon.

#========================================================
# FOREACH file in the directory of input data files

print "========================================================\n\n";

foreach $nextFile (@fileList)
{
    %codonCounts = ();       # clear the hash table

    # print the current filename to the screen (console, stdout)
    print "Next file is: $nextFile \n";

    # print filename to Excel (output) file
    print OUTPUT "$nextFile\n";

    # skip Linux files with names that start with ".";
    if ($nextFile =~ m/^\./)
    {
        next;
    }

    # get the DNA sequence and its length.
    $sequence = readInDNA($nextFile);
    $sequenceLength = length ( $sequence );

    print "Length of the DNA sequence: $sequenceLength \n\n";

    # get the last codon from the sequence
    my $lastCodon = substr ( $sequence, $sequenceLength - 3, 3 );

    # check if the sequence starts with an ATG.
    if ( substr ( $sequence, 0, $CODON_LENGTH ) ne "ATG" )
    {
        print "Gene must start with ATG.\n";
        print OUTPUT "Gene must start with ATG. Results may not be valid!\n\n";
    }

    # check if the sequence ends with a STOP.
    if ( $lastCodon ne "TAG" && $lastCodon ne "TAA" && $lastCodon ne "TGA" )
    {
        print "Gene must end with stop codon.\n";
        print OUTPUT "Gene must end with stop codon. Results may not be valid!\n\n";
    } 

The next step is to break the DNA sequence given into codons.  Since we know that the sequence is in the correct reading frame, no changes in the reading frame will need to occur.  For each codon a count will be increased.  This will be implemented using a hash table.  So if the sequence is CCACCA, the hash table will have one entry, CCA, with a count of 2.

# break the sequence into codons, storing each codon in a slot in an array
    @allCodons = breakIntoCodons($sequence );
    $numCodons = scalar ( @allCodons );    # set the number of codons found

    print "Number of codons: $numCodons \n";
    print OUTPUT "Number of codons: $numCodons \n";

    # go through the array of codons...
    foreach $nextCodon (@allCodons)
    {
        $codonCounts { $nextCodon }++;    #increase the count for this codon
    }

After reading all of the codons in the sequence, all of the unique codons found, along with their counts, will be output to an Excel format .xls file.  This is the first of two Excel files being used for this project.  Next, the hash table will be analyzed to see which types of amino acids are present.  These will be matched up to the codons to get a count of how many amino acids were formed using a specific codon.  The results of this will be output to the same Excel format file as above.  This will include all of the amino acids, sorted by the most biased codons that formed that amino acid.

# sort the array of codons by their count
    @sorted =  sort { $codonCounts{$b} <=> $codonCounts{$a} or $a cmp $b } keys(%codonCounts);

    # get the length of how many unique codons were found
    $numSortedCodons = scalar ( @sorted );

    # get the amino acid that corresponds to each unique codon
    for ( $num = 0; $num < $numSortedCodons; $num++ )
    {
        $aminoAcids [$num] = translate ( $sorted[$num] );
    }

The next part is to find GC-rich regions in the sequence.  The point of doing this is to find if there is a certain bias only during GC-rich regions.  The same procedure will be applied as above, but a new file will be created just for the GC-rich regions.

#===============================================================================
#		start search for GC-rich regions
#===============================================================================

    print "Searching for GC-rich regions greater than length 4...\n";

    # Search for GC-rich sequences of 4G/Cs or more
    while ( $sequence =~ m/([GC]{4,})/g )
    {
        $GCRich = $1;                       # the region
        $lengthGC = length ( $GCRich );     # length of region
        $sumGCLengths += $lengthGC;	        # total length of GC regions
        $firstGCRich = $FALSE;              # no long first region found
    }

    # if this is the first GC-rich region in the file...
    if ( $firstGCRich == $TRUE )
    {
        print "No GC-rich regions were located. \n";
        print "========================================================\n\n";
    }
    else
    {
        print "Found some! \n";
        print "========================================================\n\n";
    }

    # find the GC percentage
    $percentGCs = ( $sumGCLengths / $sequenceLength ) * 100;

    $sumGCLengths = 0;      # start over the GC-length
    $firstGCRich = $TRUE;   # set the next GC-rich region found as the first

The next part is to loop back to the beginning and move-on to the next file.  Each file will be separated by a few spaces.

The last step of the program deals with the output to the screen.   The program will output the number of codons found, each file name, and the relevant number of amino acids found.

#===============================================================================
#			start outputting results
#===============================================================================

    printf OUTPUT "GC%%: \t%5.1f \n\n", $percentGCs;
    printf OUTPUT "Rank\tCodon\tAmino Acid\tFrequency\tProportion\n";

    # if there are less than N "top-N" codons found
    if ( $MAX_CODONS_OUTPUT > $numSortedCodons )
    {
        $numToPrint = $numSortedCodons;	# set the number to output
    }

    # otherwise...
    else
    {
        $numToPrint = $MAX_CODONS_OUTPUT; # set to constant N, defined at top
    }

    # print the top N codons: rank, codon, occurences, and proportion
    for ( $num = 0; $num < $numToPrint; $num++ )
    {
        $rank++;     					                # set this codon's rank
        $nextCodon = $sorted[$num];			            # get the codon
        $currentCount = $codonCounts { $nextCodon };    # get the count
        $proportion = $currentCount / $numCodons;       # calculate proportion
        $nextAcid = $aminoAcids[$num];                  # get the next A.A.

        # don't output start or end codons, they are unnecessary
        if ( $nextAcid ne "..." && $nextAcid ne "Met" )
        {
            # now print...
            printf OUTPUT "$rank\t$nextCodon\t$nextAcid\t$currentCount\t%5.3f\n", $proportion;
        }
        else
        {
            # if it was a start or end codon, decrease rank by 1
            $rank--;
        }
    }

    $rank = 0;		        # start the ranks over
    $percentGCs = 0;        # start over the %GCs

    # make some space in the Excel otuput between tables
    print OUTPUT "\n\n\n";
}

# w00t, done!
close ( OUTPUT );
print "\n\nDone.\n";

#----- end of Main Program -----------------------------------------------------

The rest of the file contains various subroutines to read in the DNA and break it apart into codons, as well as set up the codon map.

#-----------\
# readInDNA \
#-------------------------------------------------------------------------------
# Subroutine to open a FASTA formatted file of DNA sequence, remove
# newline characters at the end of each line, discard any blank lines
# discard lines containing a comment(#) symbol, removes any digits,
# removes any extra whitespace characters, converts nucleotides
# to uppercase, and return as one big string.
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# 1 argument:  name of file holding the DNA
#
# RETURNS:  file of DNA as one string
#
# A sample of how you might "call" this subroutine from your program
# to 'get' the DNA from a Fasta formatted file for a gene sequence
#
#                  $sequence = readInDNA("humanXM_012345.fna");

sub readInDNA
{
    my $filename = shift;
    my $line;
    my $firstline;
    my $seq;

        # try to open the file; if no file by that name, print a warning and exit
        open (FNAFILE, "<", $filename)
            or die "Cannot open the input file: $filename: $!";

        if ( !($firstline = <FNAFILE>) )
        {
               print "Can NOT read the header line from the file called: $filename \n";
               exit();
         }

    $seq = "";

      # WHILE not EOF, grab next line
    while ($line = <FNAFILE>)
    {
        chomp $line;

            # discard blank line
            if ($line =~ m/^\s*$/)
            {
                    next;
                }

            # discard comment line (any line with a comment(#) symbol)
            if ($line =~ m/^\s*#/)
            {
                    next;
                }

                $line =~ tr/0123456789//d;    # remove all digits
        $line =~ tr/     \n\r//d;        # remove any extra whitespace
        $line = uc($line);        # convert everything to upper-case

            $seq = $seq . $line;        # concatenate new line onto the entire text so far

    } # end WHILE not EOF

    close(FNAFILE);

    return $seq;
}
#----- end of readInDNA --------------------------------------------------------


#-----------------\
# breakIntoCodons \
#-------------------------------------------------------------------------------
# Subroutine to break the text passed to it into codons of length $CODON_LENGTH.
#
# Partially WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: an array of codons.

sub breakIntoCodons
{
    my $text = shift;        # the sequence being searched
       my $DNAlen = length($text);     # the length of the sequence
    my @codons;              # array of codons to fill and return when done
    my $i=0;                # points to start of the next codon each time
    my $j=0;                # $codon[$j] is the next codon
        my $end = $DNAlen-($CODON_LENGTH);    # don't fall off the end of the sequence
        my $tempMotif;            # the current codon being worked on

        while ( $i < $end )
       {
               # get the next codon
               $tempMotif = substr($text, $i, $CODON_LENGTH);

                # if there is some other letter than ACGT...
              if ( $tempMotif =~ m/[^ACGT]/i )
               {
                    # skip to next valid codon
                 $i += $CODON_LENGTH;
               }

                # otherwise, add this codon to valid codons
               else
               {
                 $codons[$j] = $tempMotif; # add to array of codons
                $i += $CODON_LENGTH;     # move to next codon
                 $j++;                    # move to next box in array
               }
    }

        return @codons;
}
#----- end of breakIntoCodons --------------------------------------------------


#-------------\
# setPathname \
#-------------------------------------------------------------------------------
# Subroutine to set the pathname.  Useful to keep the main program a little
# more readable.  Expects a global variable of $pathname.
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: nothing.

sub setPathname
{
    #-----------------------------------------------------------------------
    # establish the absolute pathname to the directory of input files
    #-----------------------------------------------------------------------
    # MacOS USERS TAKE NOTE
    # determine the absolute pathname to the current directory
    # next two lines need to be uncommented for MacOS X (e.g., BBedit or TexWrangler)
    # my @path       = File::Spec->splitdir( File::Spec->rel2abs($0) );
    # $pathname      = File::Spec->catfile( @path[0 .. $#path - 1], "");
    #-----------------------------------------------------------------------
    # uncomment if using
    # WINDOWS or Linux (or MacOS X on command-line)
    $pathname   = "";
    #-----------------------------------------------------------------------
}
#----- end of setPathname ------------------------------------------------------


#------------------------\
# getAminoAcidLookUpTable \
#-----------------------------------------------------------------------
# Construct hash that maps codons to amino acids by reading the table
# from DATA lines at the end of the program, which have the form:
# Residue Codon1 Codon2 ...
#
# no arguments
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: hash table of codon and their one-letter amino acid abbreviations

sub getAminoAcidLookUpTable
{
  my %codonMap;

  while (my $in = <DATA> )    # assigns next line of DATA to $in; fails if none
  {
    chomp($in);                   # remove line feed at end of $in
    my @codons = split " ",$in;
    my $residue = shift @codons;  # remove first item from @codons and assign
    foreach my $nnn (@codons) {
        $codonMap{$nnn} = $residue;
    }
  }

  return %codonMap;

}
#----- end of getAminoAcidLookUpTable ------------------------------------------


#-----------\
# translate  \
#-----------------------------------------------------------------------
# translates (presumably pre-transcribed) DNA strings to proteins
#
# 1 argument: the DNA string to translate
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: the protein string. (1-letter Amino Acid abbreviations)

sub translate
{
    my $dna = shift(@_);  # the DNA string to be translated.

    my $pro = "";
    while ( $dna =~ s/(...)// ) {
        $pro = $pro . $codonMap{$1};
    }
    return $pro;
}
#----- end of translate --------------------------------------------------------

#define the mapping from codons to amino acids...
__END__
Ala GCT GCC GCA GCG
Arg CGT CGC CGA CGG AGA AGG
Asn AAT AAC
Asp GAT GAC
Cys TGT TGC
Gln CAA CAG
Glu GAA GAG
Gly GGT GGC GGA GGG
His CAT CAC
Ile ATT ATC ATA
Lei TTA TTG CTT CTC CTA CTG
Lys AAA AAG
Met ATG
Phe TTT TTC
Pro CCT CCC CCA CCG
Ser TCT TCC TCA TCG AGT AGC
Thr ACT ACC ACA ACG
Trp TGG
Tye TAT TAC
Val GTT GTC GTA GTG
... TAA TAG TGA

Results

The results proved that there is a definite bias towards some codons to create certain amino acids, namely asparagine, aspartic acid, and lysine. GC-richness appeared to have no effect on codon bias. Any answers as to why there is a codon bias at all are still being researched.

For the full project, including raw notes and code, download this zip file: codonBiasFinal
(Note: this project is kind of thrown together, meaning standard coding practices may not have been followed, such as breaking up parts of the program into subroutines, etc. If I remember correctly, it was a requirement of the professors that all work be done in the provided subroutines, with no new subs created, but I could be wrong. Either way, looking at the code now it’s atrocious.)

One thought on “A crash course in Perl and DNA.

Comments are closed.