new(-seq => "aaaatgggggggggggccccgtt". EDIT: This is a new answer which assumes that your sequence header line begins with ">". If I understand your question correctly, you are trying to extract the part of the header line which reads numreads=NN and determine whether NN > 500. The answer is yes, this can be done using Bio::SeqIO. III.4.3 Parsing BLAST and FASTA reports with Search and SearchIO . examples/ directories (see bioperl.pod for a description of all these .. A common - and tedious - bioinformatics task is that of converting. sequence data among the many widely used data formats. Bioperl's SeqIO. object, however, makes this chore a.">
1 Metaxe

Bioperl Seqio Formats For Essays

Beginners HOWTO




Authors

Brian Osborne briano at bioteam.net

Copyright

This document is copyright Brian Osborne. It can be copied and distributed under the terms of the Perl Artistic License.

Abstract

This is a HOWTO that talks about using Bioperl, for biologists who would like to learn more about writing their own bioinformatics scripts using Bioperl. Bioperl is an open source bioinformatics toolkit used by researchers all over the world. If you’re looking for a script built to fit your exact needs you may or may not find it in Bioperl (check the scripts and examples directories). What you will find is an extensive set of Perl modules that will enable you to write your own script, and a community of people who are willing to help you.

Introduction

If you’re a molecular biologist it’s likely that you’re interested in gene and protein sequences, and you study them in some way on a regular basis. Perhaps you’d like to try your hand at automating some of these tasks, or you’re just curious about learning more about the programming side of bioinformatics. In this HOWTO you’ll see discussions of some of the common uses of Bioperl, like sequence analysis with BLAST and retrieving sequences from public databases. You’ll also see how to write Bioperl scripts that chain these tasks together, that’s how you’ll be able to do really powerful things with Bioperl.

You will also see some discussions of software concepts, this can’t be avoided. The more you understand about programming the better but all efforts will be made to not introduce too much unfamiliar material. However, there will be an introduction to modularity, or objects. This is one of the aspects of the Bioperl package that you’ll have to come to grips with as you attempt more complex tasks with your scripts.

One of the challenging aspects of learning a new skill is learning the jargon, and programming certainly has its share of interesting terms and buzz phrases. Be patient - remember that the programmers learning biology have had just as tough a task (if not worse - just ask them!).

Note This HOWTO does not discuss a very nice module that’s designed for beginners, Bio::Perl. The reason is that though this is an excellent introductory tool, it is not object-oriented and can’t be extended. What we’re attempting here is to introduce the core of Bioperl and show you ways to expand your new-found skills.

Installing Bioperl

Start at Installing Bioperl. Many of the letters to the bioperl-l mailing list concern problems with installation, and there is a set of concerns that come up repeatedly:

  • On Windows, messages like: , or . The explanation is that Bioperl does not supply every accessory module that’s necessary to run all of Bioperl. You’ll need to search other repositories to install all of these accessory modules. See the INSTALL.WIN file for more information.

  • On Unix, messages like . This means that Perl could not find a particular module and the explanation usually is that this module is not installed.

  • Seeing messages like . If you see an error during installation consider whether this problem is going to affect your use of Bioperl. There are roughly 1000 modules in Bioperl, and ten times that many tests are run during the installation. If there’s a complaint about GD it’s only relevant if you want to use the modules, if you see an error about some XML parser it’s only going to affect you if you’re reading XML files.

Getting Assistance

You may run into problems installing Bioperl or writing scripts using Bioperl. If you need assistance the way to get it is to mail bioperl-l@bioperl.org. There are many helpful people who regularly read this list but if you want their advice it’s best to give sufficient detail.

Please include:

  • The version of Bioperl you’re working with.
  • The platform or operating system you’re using.
  • What you are trying to do.
  • The code that gives the error, if you’re writing a script.
  • Any error messages you saw.

Every once in a while a message will appear in coming from someone in distress that goes unanswered. The explanation is usually that the person neglected to include 1 or more of the details above, usually the script or the error messages.

Perl Itself

Here are a few things you might want to look at if you want to learn more about Perl:

  • Learning Perl is the most frequently cited beginner’s book.

  • Perl in a Nutshell is also good. Not much in the way of examples, but covers many topics succinctly.

  • Perl’s own documentation. Do from the command-line for an introduction. Perldoc can give you documentation of any module that is installed on your system: do to view documentation of some module. For instance (assuming Bioperl has been installed) try:

Writing a script

Sometimes the trickiest part is this step, writing something and getting it to run, so this section attempts to address some of the more common tribulations.

In Unix when you’re ready to work you’re usually in the command-line or “shell” environment. First find out Perl’s version by typing this command:

You will see something like:

Hopefully you’re using Perl version 5.8 or higher, earlier versions may be troublesome. Now let’s find out where the Perl program is located:

or:

(Windows)

This will give you something like:

Now that we know where Perl is located we’re ready to write a script, and line 1 of the script will specify this location. You might be using some Unix word processor, emacs or vi, for example (or nano, very easy to use, but not found on all Unix machines unfortunately). If you’re on Windows then Wordpad will work. Start to write your script by entering something like:

And make this the first line of the script:

Creating a sequence, and an Object

Our first script will create a sequence. Well, not just a sequence, you will be creating a sequence object, since Bioperl is written in an object-oriented way. Why be object-oriented? Why introduce these odd or intrusive notions into software that should be ‘biological’ or ‘intuitive? The reason is that thinking in terms of modules or objects turns out to be the most flexible, and ultimately the simplest, way to deal with data as complex as biological data. Once you get over your initial skepticism, and have written a few scripts, you will find this idea of an object becoming a bit more natural.

One way to think about an object in software is that it is a container for data. The typical sequence entry contains different sorts of data (a sequence, one or more identifiers, and so on) so it will serve as a nice example of what an object can be.

All objects in Bioperl are created by specific Bioperl modules, so if you want to create an object you’re also going to have to tell Perl which module to use. Let’s add another line:

This line tells Perl to use a module on your machine called Bio/Seq.pm. We will use this module to create a object. The module is one of the central modules in Bioperl. The analogous object, or Sequence object, or Seq object, is ubiquitous in Bioperl, it contains a single sequence and associated names, identifiers, and properties. Let’s create a very simple sequence object at first, like so:

The variable is the Sequence object, a simple one, containing just a sequence. Note that the code tells Bioperl that the sequence is DNA (the choices here are dna, rna, and protein), this is the wise thing to do. If you don’t tell Bioperl it will attempt to guess the alphabet. Normally it guesses correctly but if your sequence has lots of odd or ambiguous characters, such as N or X, Bioperl’s guess may be incorrect and this may lead to some problems.

Sequence objects can be created manually, as above, but they’re also created automatically in many operations in Bioperl, for example when alignment files or database entries or BLAST reports are parsed.

Any time you explicitly create an object, you will use this method. The syntax of this line is one you’ll see again and again in Bioperl: the name of the object or variable, the module name, the symbol, the method name , some argument name like , the symbol, and then the argument or value itself, like aaaatgggggggggggccccgtt.

Note You may have come across the term “function” or “sub-routine”. In object-oriented programming the term is used instead.

The object was described as a data container, but it is more than that. It can also do work, meaning it can use or call specific methods taken from the module or modules that were used to create it. For example, the Bio::Seq module can access a method named that will print out the sequence of objects. You could use it like this:

As you’d expect, this script will print out aaaatgggggggggggccccgtt. That symbol is used when an object calls or accesses its methods.

Let’s make our example a bit more true-to-life, since a typical sequence object needs an identifier, perhaps a description, in addition to its sequence.

aaaatgggggggggggccccgtt, #12345, and example 1 are called “arguments” in programming jargon. You could say that this example shows how to pass arguments to the method.

Writing a sequence to a file

This next example will show how two objects can work together to create a sequence file. We already have a Sequence object, , and we will create an additional object whose responsibility it is to read from and write to files. This object is the Bio::SeqIO object, where IO stands for Input/Output. By using in this manner you will be able to get input and make output for all of the sequence file formats supported by Bioperl (the SeqIO HOWTO has a complete list of supported formats). The way you create objects is very similar to the way we used to create a Bio::Seq, or sequence, object:

Note that in the argument. This character indicates that we’re going to write to the file named sequence.fasta, the same character we’d use if we were using Perl’s function to write to a file. The argument, fasta, tells the object that it should create the file in fasta format.

Let’s put our 2 examples together:

Let’s consider that last line where you see two objects since this is where some neophytes start to get a bit nervous. What’s going on there? In that line we handed or passed the Sequence object to the object as an argument to its method. Another way to think about this is that we hand the Sequence object to the object since understands how to take information from the Sequence object and write to a file using that information, in this case in fasta format. If you run this script like this:

You should create a file called sequence.fasta that looks like this:

Let’s demonstrate the intelligence of Bio::SeqIO - the example below shows what is created when the argument to is set to genbank instead of fasta:

Retrieving a sequence from a file

One beginner mistake is to not use Bio::SeqIO when working with sequence files. This is understandable in some respects. You may have read about Perl’s function, and Bioperl’s way of retrieving sequences may look overly complicated, at first. But don’t use ! Using immediately forces you to do the parsing of the sequence file and this can get complicated very quickly. Trust the Bio::SeqIO object, it’s built to open and parse all the common sequence formats, it can read and write to files, and it’s built to operate with all the other Bioperl modules that you will want to use.

Let’s read the file we created previously, sequence.fasta, using Bio::SeqIO. The syntax will look familiar:

One difference is immediately apparent: there is no character. Just as with with the function this means we’ll be reading from the sequence.fasta file. Let’s add the key line, where we actually retrieve the Sequence object from the file using the method:

Here we’ve used the method of the object. When you use, or call, the object will get the next available sequence, in this case the first sequence in the file that was just opened. The Sequence object that’s created, , is functionally identical to the Sequence object we created manually in our first example. This is another idiom that’s used frequently in Bioperl, the next_something method. You’ll come across the same idea in the method of reading and writing alignment files and the method of reading the output of sequence comparison programs such as BLAST and HMMER.

If there were multiple sequences in the input file you could just continue to call in a loop, and SeqIO would retrieve the Seq objects, one by one, until none were left, starting with the 1st sequence in the file:

Do you have to supply a argument when you are reading from a file, as we did? Not necessarily, but it’s the safe thing to do. If you don’t give a format then the SeqIO object will try to determine the format from the file suffix or extension (and a list of the file extensions is in the SeqIO HOWTO. In fact, the suffix fasta is one that SeqIO understands, so is unnecessary above. Without a known suffix SeqIO will attempt to guess the format based on the file’s contents but there’s no guarantee that it can guess correctly for every single format.

It may be useful to tell SeqIO the alphabet of the input, using the argument. What this does is to tell SeqIO not to try to determine the alphabet (dna, rna, protein). This helps because Bioperl may guess incorrectly (for example, Bioperl is going to guess that the protein sequence is DNA). There may also be odd characters present in the sequence that SeqIO objects to (e.g. ). Set to a value when reading sequences and SeqIO will not attempt to guess the alphabet of those sequences or validate the sequences.

Retrieving a sequence from a database

One of the strengths of Bioperl is that it allows you to retrieve sequences from all sorts of sources, files, remote databases, local databases, regardless of their format. Let’s use this capability to get a entry from Genbank. What will we retrieve? Again, a Sequence object. Let’s choose our module:

We could also query SwissProt, GenPept, EMBL, SeqHound, Entrez Gene, or RefSeq in an analogous fashion (e.g ). Now we’ll create the object:

In this case we’ve created a database object using the method, but without any arguments. Let’s ask the object to do something useful:

The argument passed to the method is an identifier, 2, a Genbank GI number. You could also use the method with an accession number (e.g. A12345) or using a versioned accession number (e.g. A12345.2). Make sure to use the proper identifier for the method you use, the methods are not interchangeable.

Retrieving multiple sequences from a database

There are more sophisticated ways to query Genbank than this. This next example attempts to do something biological, using the module Bio::DB::Query::GenBank. Want all Arabidopsis topoisomerases from Genbank Nucleotide? This would be a reasonable first attempt:

The length is limited by , don’t want to download genomes!

Note This capability to query by string and field is only available for [GenBank as of Bioperl version 1.5, queries to other databases, like Swissprot or EMBL, are limited to identifiers and accessions.

Here’s another query example, this one will retrieve all Trypanosoma brucei ESTs:

You can find detailed information on Genbank’s query fields here.

That is how we would construct a query object, but we haven’t retrieved sequences yet. To do so we will have to create a database object, some object that can get Sequence objects for us, just as we did in the first Genbank example:

That and its method may not look familiar. The idea is that you will use a stream whenever you expect to retrieve a stream or series of sequence objects. Much like , but built to retrieve one or more objects, not just one object.

Notice how carefully separated the responsibilities of each object are in the code above: there’s an object just to hold the query, an object to execute the query using this query object, an object to do the I/O, and finally the sequence object.

Note Be careful what you ask for, many of today’s nucleotide database entries are genome-size and you will probably run out of memory if your query happens to match one of these monstrosities. You can use the field to limit the size of the sequences you retrieve.

The Sequence Object

There’s been a lot of discussion around the Sequence object, and this object has been created in a few different ways, but we haven’t shown what it’s capable of doing. The table below lists the methods available to you if you have a Sequence object in hand. “Returns” means what the object will give you when you ask it for data. Some methods, such as , can be used to get or set values. You’re setting when you assign a value, you’re getting when you ask the object what values it has. For example, to get or retrieve a value

To set or assign a value:

The table below shows the methods you’re likely to use with the Sequence object directly. Bear in mind that not all values, such as or , are found in all sequence formats, you have to know something about your input sequences in order to get some of these values.

NameReturnsExampleNote
accession_numberidentifier$acc = $so->accession_numberget or set an identifier
alphabetalphabet$so->alphabet(‘dna’)get or set the alphabet (‘dna’,’rna’,’protein’)
authorityauthority, if available$so->authority(“DB”)get or set the organization
descdescription$so->desc(“Example 1”)get or set a description
display_ididentifier$so->display_id(“M123456”)get or set an identifier
divisiondivision, if available (e.g. PRI)$div = $so->divisionget division (e.g. “PRI”)
get_datesarray of dates, if available@dates = $so->get_datesget dates
get_secondary_accessionsarray of secondary accessions, if available@accs = $so->get_secondary_accessionsget other identifiers
is_circularBooleanif $so->is_circular{}get or set
keywordskeywords, if available@array = $so->keywordsget or set keywords
lengthlength, a number$len = $so->lengthget the length
moleculemolecule type, if available$type = $so->moleculeget molecule (e.g. “RNA”, “DNA”)
namespacenamespace, if available$so->namespace(“Private”)get or set the name space
newSequence object$so = Bio::Seq->new(-seq => “MPQRAS”)create a new one, see for more
pidpid, if available$pid = $so->pidget pid
primary_ididentifier$so->primary_id(12345)get or set an identifier
revcomSequence object$so2 = $so1->revcomReverse complement
seqsequence string$seq = $so->seqget or set the sequence
seq_versionversion, if available$so->seq_version(“1”)get or set a version
speciesSpecies object$species_obj = $so->speciesSee for more
subseqsequence string$string = $seq_obj->subseq(10,40)Arguments are start and end
translateprotein Sequence object$prot_obj = $dna_obj->translate
truncSequence object$so2 = $so1->trunc(10,40)Arguments are start and end

Table 1. Sequence objects method.

There are also a number of methods that are concerned with the Features and Annotations associated with the Sequence object. This is something of a tangent but if you’d like to learn more see the Feature-Annotation HOWTO. The methods related to this topic are shown below.

NameReturnsNote
get_SeqFeaturesarray of SeqFeature objects
get_all_SeqFeaturesarray of SeqFeature objects arrayincludes sub-features
remove_SeqFeaturesarray of SeqFeatures removed
feature_countnumber of SeqFeature objects
add_SeqFeatureannotation array of Annotation objectsget or set

Table 2. Feature and Annotation methods.

Example Sequence Objects

Let’s use some of the methods above and see what they return when the sequence object is obtained from different sources. In the Genbank example we’re assuming we’ve used Genbank to retrieve or create a Sequence object. So this object could have have been retrieved like this:

Or it could have been created from a file like this:

What the Genbank file looks like:

Either way, the values returned by various methods are shown below.

MethodReturns
display_idECORHO
descE.coli rho gene coding for transcription termination factor.
display_nameECORHO
accessionJ01673
primary_id147605
seq_version1
keywordsattenuator; leader peptide; rho gene; transcription terminator
is_circular
namespace
authority
length1880
seqAACCCT…ACAGGAC
divisionBCT
moleculeDNA
get_dates26-APR-1993
get_secondary_accessionsJ01674

Table 3. Values from Genbank.

There’s a few comments that need to be made. First, you noticed that there’s an awful lot of information missing. All of this missing information is stored in what Bioperl calls Features and Annotations, see the Feature and Annotation HOWTO if you’d like to learn more about this. Second, a few of the methods don’t return anything, like and . The reason is that though these are good values in principle there are no commonly agreed upon standard names - perhaps someday the authors will be able to rewrite the code when all our public databases agree what these values should be. Finally, you may be wondering why the method names are what they are and why particular fields or identifiers end up associated with particular methods. Again, without having standard names for things that are agreed upon by the creators of our public databases all the authors could do is use common sense, and these choices seem to be reasonable ones.

Next let’s take a look at the values returned by the methods used by the Sequence object when a Fasta file is used as input. The Fasta file entry looks like this, clearly much simpler than the corresponding Genbank entry:

And here are the values:

MethodReturns
display_idgi|147605|gb|J01673.1|ECORHO
descE.coli rho gene coding for transcription termination factor
display_namegi|147605|gb|J01673.1|ECORHO
accessionunknown
primary_idgi|147605|gb|J01673.1|ECORHO
is_circular
namespace
authority
length1880
seqAACCCT…ACAGGAC

Table 4. Values from Fasta.

If you compare these values to the values taken from the Genbank entry you’ll see that certain values are missing, like . That’s because values like these aren’t usually present in a Fasta file.

Another natural question is why the values returned by methods like are different even though the only thing distinguishing these entries are their respective formats. The reason is that there are no rules governing how one interconverts formats, meaning how Genbank creates Fasta files from Genbank files may be different from how SwissProt performs the same interconversion. Until the organizations creating these databases agree on standard sets of names and formats all that Bioperl can do is do make reasonable choices.

Yes, Bioperl could follow the conventions of a single organization like Genbank such that returns the same value when using Genbank format or Genbank’s fasta format but the authors have elected not to base Bioperl around the conventions of any one organization.

Let’s use a Swissprot file as our last example. The input entry looks like this:

The corresponding set of values is shown below.

MethodReturns
display_idA2S3_RAT
descAmyotrophic lateral … protein of 98 kDa).
display_nameA2S3_RAT
accessionQ8R2H7
is_circular
namespace
authority
seq_version
keywordsCoiled coil; Alternative splicing; Polymorphism
length913
seqMSLSQ…ILKED
divisionRAT
get_dates28-FEB-2003 (Rel. 41, Created)
get_secondary_accessionsQ8R2H6 Q8R4G3

Table 5. Values from SwissProt.

As in the Genbank example there’s information that the Sequence object doesn’t supply, and it’s all stored in Annotation objects. See the Feature and Annotation HOWTO for more.

Translating

Translation in bioinformatics can mean slightly different things, either translating a nucleotide sequence from start to end or translate the actual coding regions in mRNAs or cDNAs. The Bioperl implementation of sequence translation does both of these.

Any sequence object with an alphabet of dna or rna can be translated by simply using which returns a protein sequence object:

All codons will be translated, including those before and after any initiation and termination codons. For example, ttttttatgccctaggggg will be translated to .

However, the method can also be passed several optional parameters to modify its behavior. For example, you can tell to modify the characters used to represent terminator (default is *) and unknown amino acids (default is X).

You can also determine the frame of the translation. The default frame starts at the first nucleotide (frame 0). To get translation in the next frame we would write:

If we want to translate full coding regions (CDS) the way major nucleotide databanks EMBL, GenBank and DDBJ do it, the method has to perform more checks. Specifically, needs to confirm that the open reading frame has appropriate start and terminator codons at the very beginning and the very end of the sequence and that there are no terminator codons present within the sequence in frame 0. In addition, if the genetic code being used has an atypical (non-ATG) start codon, the method needs to convert the initial amino acid to methionine. These checks and conversions are triggered by setting “complete” to 1:

If is set to true and the criteria for a proper CDS are not met, the method, by default, issues a warning. By setting to 1, one can instead instruct the program to die if an improper CDS is found, e.g.

The codontable_id argument to makes it possible to use alternative genetic codes. There are currently 16 codon tables defined, including Standard, Vertebrate Mitochondrial, Bacterial, Alternative Yeast Nuclear and Ciliate, Dasycladacean and Hexamita Nuclear. All these tables can be seen in Bio::Tools::CodonTable. For example, for mitochondrial translation:

You can also create a custom codon table and pass this to , the code will look something like this:

See Bio::Tools::CodonTable for information on the format of a codon table.

can also find the open reading frame (ORF) starting at the 1st initiation codon in the nucleotide sequence, regardless of its frame, and translate that:

Most of the codon tables, including the default codon table NCBI Standard, have initiation codons in addition to ATG. To tell to use only ATG or atg as the initiation codon set -start to :

The argument only applies when is set to 1.

Last trick. By default will translate the termination codon to some special character (the default is , but this can be reset using the argument).

When is set to 1 this character is removed. So, with this:

the sequence tttttatgccctaggggg will be translated to , not .

See Bio::Tools::CodonTable and Bio::PrimarySeqI for more information on translation.

Obtaining basic sequence statistics

In addition to the methods directly available in the Seq object, Bioperl provides various helper objects to determine additional information about a sequence. For example, the Bio::Tools::SeqStats object provides methods for obtaining the molecular weight of the sequence as well the number of occurrences of each of the component residues (bases for a nucleic acid or amino acids for a protein.) For nucleic acids, also returns counts of the number of codons used. For example:

Note: sometimes sequences will contain ambiguous codes. For this reason, returns a reference to a two element array containing a greatest lower bound and a least upper bound of the molecular weight.

The SeqWords object is similar to SeqStats and provides methods for calculating frequencies of “words” (e.g. tetramers or hexamers) within the sequence. See Bio::Tools::SeqStats and Bio::Tools::SeqWords for more information.

Running applications: BLAST

This section is outdated, please see HOWTO:BlastPlus. BLAST is no longer supported by NCBI, it has been superceded by BLAST+.

You have access to a large number of sequence analysis programs within Bioperl. Typically this means you have a means to run the program and frequently a means of parsing the resulting output, or report, as well. Certainly the most popular analytical program is BLAST so let’s use it as an example. First you’ll need to get BLAST, installed on your machine and running, versions of the program that can run on all the popular operating systems can be downloaded from NCBI. The example code assumes that you used the formatdb program to index the database sequence file db.fa.

As usual, we start by choosing a module to use, in this case . You stipulate some blastall parameters used by the blastall program by using . As you’d expect, we want to create a Blast object, and we will pass a Sequence object to the Blast object, this Sequence object will be used as the query:

By calling the method you’re actually running BLAST, creating the report file, and parsing the report file’s contents. All the data in the report ends up in the report object, and you can access or print out the data in all sorts of ways. The report object, , and the result object, , come from the SearchIO modules. The [SearchIO HOWTO] will tell you all about using these objects to extract useful data from your BLAST analyses.

Here’s an example of how one would use SearchIO to extract data from a [BLAST] report:

This code prints out details about the match when the HSP or aligned pair are greater than 75% identical.

Sometimes you’ll see errors when you try to use that have nothing to do with Bioperl. Make sure that BLAST is set up properly and running before you attempt to script it.

Bioperl enables you to run a wide variety of bioinformatics programs but in order to do so, in most cases, you will need to install the accessory bioperl-run package. In addition there is no guarantee that there is a corresponding parser for the program that you wish to run, but parsers have been built for the most popular programs. You can find the bioperl-run package on the download page.

Indexing for Fast Retrieval

One of the under-appreciated features of Bioperl is its ability to index sequence files. The idea is that you would create some sequence file locally and create an index file for it that enables you to retrieve sequences from the sequence file. Why would you want to do this? Speed, for one. Retrieving sequences from local, indexed sequence files is much faster than using the module used above that retrieves from a remote database. It’s also much faster than using SeqIO, in part because SeqIO is stepping through a file, one sequence at a time, starting at the beginning of the file. Flexibility is another reason. What if you’d created your own collection of sequences, not found in a public database? By indexing this collection you’ll get fast access to your sequences.

There’s only one requirement here, the term or id that you use to retrieve the sequence object must be unique in the index, these indices are not built to retrieve multiple sequence objects from one query.

There are a few different modules in Bioperl that can index sequence files, the modules and Bio::DB::Fasta. All these modules are scripted in a similar way: you first index one or more files, then retrieve sequences from the indices. Let’s begin our script with the use statement and also set up our environment with some variables (the sequence file will be called sequence.fa):

The lines above show that you can set environmental variables from within Perl and they are stored in Perl’s own hash. This is essentially the same thing as the following in tcsh or csh:

Or the following in the bash shell:

The variable refers to the indexing scheme, and is the scheme that comes with Perl. stipulates the location of the index file, and this way you could have more than one index file per sequence file if you wanted, by designating multiple locations (and the utility of more than 1 index will become apparent).

Now let’s construct the index:

You would execute this script in the directory containing the sequence.fa file, and it would create an index file called sequence.fa.idx. Then you would retrieve a sequence object like this:

By default the fasta indexing code will use the string following the character as a key, meaning that fasta header line should look something like this if you want to fetch using the value 48882:

However, what if you wanted to retrieve using some other key, like 1CRA in the example above? You can customize the index by using Bio::Index::Fasta’s method, which accepts the name of a function as an argument where that function tells the indexing object what key to use. For example:

To be precise, one would say that the method accepts a reference to a function as an argument.

Bio::DB::Fasta has some features that Bio::Index::Fasta lacks, one of the more useful ones is that it was built to handle very large sequences and can retrieve sub-sequences from genome-size sequences efficiently. Here is an example:

This script indexes the genome.fa file, then retrieves a sub-sequence of CHROMOSOME_I, starting at 11250 and ending at 11333. One can also specify what ids can be used as keys, just as in Bio::Index::Fasta.

There’s a bit more information on indexing in HOWTO:Local_Databases.

Running applications: Searching for genes in genomic DNA

Parsers for widely used gene prediction programs - Genscan, Sim4, Genemark, Grail, ESTScan and MZEF - are available in Bioperl. The interfaces for these parsers are all similar. The syntax is relatively self-explanatory, see Bio::Tools::Genscan, Bio::Tools::Genemark, Bio::Tools::Grail, Bio::Tools::ESTScan, Bio::Tools::MZEF, and Bio::Tools::Sim4::Results for further details. Here are some examples on how to use these modules.

See Bio::Tools::Prediction::Gene and Bio::Tools::Prediction::Exon for more details.

See Bio::SeqFeature::Generic and Bio::Tools::Sim4::Exons for more information.

A parser for the ePCR program is also available. The ePCR program identifies potential PCR-based sequence tagged sites (STSs) For more details see the documentation in Bio::Tools::EPCR. A sample skeleton script for parsing an ePCR report and using the data to annotate a genomic sequence might look like this:

Running applications: Using EMBOSS applications with Bioperl

EMBOSS is an extensive collection of sequence analysis programs written in the C programming language (http://emboss.sourceforge.net/). There are a number of algorithms in EMBOSS that are not found in Bioperl (e.g. calculating DNA melting temperature, finding repeats, identifying prospective antigenic sites) so if you cannot find the function you want in Bioperl you might be able to find it in EMBOSS. The Bioperl code that runs EMBOSS programs is Bio::Factory::EMBOSS.

EMBOSS programs are usually called from the command line but the bioperl-run auxiliary library provides a Perl wrapper for EMBOSS function calls so that they can be executed from within a Perl script. Of course, the EMBOSS package as well as the bioperl-run must be installed in order for the Bioperl wrapper to function.

An example of the Bioperl wrapper where a file is returned would be:

Note that a Seq object was used as input. The EMBOSS object can also accept a file name as input, e.g.

Some EMBOSS programs will return strings, others will create files that can be read directly using Bio::SeqIO. It’s worth mentioning that another way to align sequences in Bioperl is to run a program from the EMBOSS suite, such as matcher. This can produce an output file that Bioperl can read in using Bio::AlignIO:

The Bio::Seq module allows us to create the main sequence object in Bioperl. A Seq object is a sequence with sequence features placed on it. These objects have associated methods that can manipulate sequence data in an enormous number of ways.

Here's the CPAN documentation for this module.


Creating a sequence object

There are a number of ways of creating a Bio::Seq sequence object. You can read in a sequence from a file:

my $seqio = Bio::SeqIO->new( '-format' => 'embl' , -file => 'myfile.dat'); my $seqobj = $seqio->next_seq();

You can download the sequence from a database:

use Bio::DB::Genbank; my $db = Bio::DB::GenBank->new(); my $seqobj = $db->get_Seq_by_acc('X78121');

Or you can make the sequence from strings in your script:

my $seqobj = Bio::Seq->new( -display_id => 'my_id', -seq => $sequence_as_string);

The attributes of a sequence object

The 'keys' of the sequence object (like -display_id) are also called attributes or named parameters. You can only easily use pre-defined attributes. Think about how you print, for example, the sequence out of a sequence object. You type:

You get the sequence out of the object by calling the method seq(), which has been written to access the value of the attribute -seq. If you create a novel attribute in your sequence object, you would also need to write a new method in the class Bio::Seq to access the value of that attribute. You can do that, since Bioperl is Open Access! But it probably more effort than you want to go to on a regular basis.

One upshot of this is that you can get a good idea of the possible attributes of sequence objects by looking at the methods available to them. In reality, you will also rarely need to know the exact names of sequence object attributes because that is largely under the hood - you will normally be using Bio::SeqIO and Bio::Seq to read in existing data files and manipulate them via methods - you won't have to be writing out sequence attribute names and values yourself.


Object methods for Bio::Seq

There are many methods available to Bio::Seq objects. I've listed some of them below, along with information about what arguments they take, and what they return.

These methods return strings, and accept strings in some cases:

$seqobj->seq(); # string of sequence $seqobj->subseq(5,10); # part of the sequence as a string $seqobj->accession_number(); # when there, the accession number $seqobj->alphabet(); # one of 'dna','rna',or 'protein' $seqobj->seq_version() # when there, the version $seqobj->keywords(); # when there, the Keywords line $seqobj->length() # length $seqobj->desc(); # description $seqobj->primary_id(); # a unique id for this sequence regardless # of its display_id or accession number $seqobj->display_id(); # the human readable id of the sequence

Some of these values map to fields in common formats. For example, the display_id() method returns the LOCUS name of a Genbank entry, the (\S+) following the > character in a Fasta file, the ID from a SwissProt file, and so on. The desc() method will return the DEFINITION line of a Genbank file, the description following the display_id in a Fasta file, and the DE field in a SwissProt file.


The following methods return new Seq objects, but do not transfer features across to the new object:

$seqobj->trunc(5,10) # truncation from 5 to 10 as new object $seqobj->revcom # reverse complements sequence $seqobj->translate # translation of the sequence

It is also important to be able to describe defined portions of a sequence. The combination of some description and the corresponding sub-sequence is called a feature - an exon and its coordinates within a gene is an example of a feature, or a domain within a protein. The following methods return an array of SeqFeatureI objects:

$seqobj->get_SeqFeatures # The 'top level' sequence features $seqobj->get_all_SeqFeatures # All sequence features, including sub-seq # features, such as features in an exon # to find out the number of features use: $seqobj->feature_count

Here are just some of the methods available to SeqFeatureI objects:

# these methods return numbers: $feat->start # start position (1 is the first base) $feat->end # end position (2 is the second base) $feat->strand # 1 means forward, -1 reverse, 0 not relevant # these methods return or accept strings: $feat->primary_tag # the name of the sequence feature, eg # 'exon', 'glycoslyation site', 'TM domain' $feat->source_tag # where the feature comes from, eg, 'EMBL_GenBank', # or 'BLAST' # this method returns the more austere PrimarySeq object, not a # Seq object - the main difference is that PrimarySeq objects do not # themselves contain sequence features $feat->seq # the sequence between start,end on the # correct strand of the sequence

Details of all of these methods are given towards the end of this documentation page.


Some examples

This code obtains the whole sequence from a sequence object, and then a slice of it between bases 10 and 50. Note that in Bioperl, sequences start with base 1, not base 0 (as they do in normal Perl strings)!

my $seqstr = $seqobj->seq(); # actual sequence as a string my $seqstr2 = $seqobj->subseq(10,50); # slice in biological coordinates

This code prints out information about the sequence features

my @features = $seqobj->get_SeqFeatures(); # just top level foreach my $feat ( @features ) { print "Feature ", $feat->primary_tag, " starts ", $feat->start, " ends ", $feat->end," strand ",$feat->strand,"\n"; # features retain link to underlying sequence object print "Feature sequence is " ,$feat->seq->seq(), "\n" }

This code checks whether a sequence has a species name associated with it

if( defined $seq->species ) { print "Sequence is from ",$species->binomial_name," [",$species->common_name,"]\n"; }

And some quick examples of sequence translation etc

# you can get truncations, translations and reverse complements. These # all give back Bio::Seq objects themselves, though currently with no # features transfered my $trunc = $seqobj->trunc(100,200); my $rev = $seqobj->revcom(); # there are many options to translate - check out the docs my $trans = $seqobj->translate(); # these functions can be chained together my $trans_trunc_rev = $seqobj->trunc(100,200)->revcom->translate();

What is the difference between

my $trunc = $seqobj->trunc(10, 50)
and
my $subseq = $seqobj->subseq(10, 50)
The difference is in what they return: subseq() returns a string, trunc() returns a new sequence object. It's very very important to always know what exactly it is that a method returns!
Exercises

Write a script that reads in this multi-fasta file, 63.1.data.txt, and prints out a list of the sequence names and lengths.

Read in the same file as above, but now print out the names and the first 20 bases of each sequence.

Write a script that reads in this list of genbank accession numbers, 63.3.list, downloads the sequences, and prints them out to a fasta file.

Modify the script above so that it only prints out sequences if they are from Drosophila melanogaster. If they are from any other species, or have unknown species, print a message.

You have this fasta file, 63.5.fasta. You want to get more info on the features of these genes (fasta files do not contain information on features). Write a script to read in the fasta file, get the sequence IDs, use those to download the full data from Genbank, and print out the start and stop coords of each feature for each gene.

Write a script to read in the same fasta file as above, and print out the reverse-complemented sequences to another fasta file.




Leave a Comment

(0 Comments)

Your email address will not be published. Required fields are marked *