1. The FASTA File Format
The FASTA format is a text-based format for representing either nucleotides sequences or amino acid sequences. Files in FASTA format usually end up with .fasta or .fa extensions. The simplicity of FASTA format make it the most basic bioinformatic file format, and it can be easily manipulated by all programing languages, such as Python, C/C++, Perl, and R.
1.1 The convention of FASTA format
Nowadays, modern bioinformatic programs that rely on the FASTA format expect the sequence headers to be preceded by ">", and the actual sequence, while generally represented as "interleaved" between the headers. For example,
>Seq1
ACACGTGGGTACATGGAGGGGAACAACACACACCAGGGC
>Seq2
ACACGTGGGTACATGGAGGGGAACAACACACACCAGGGCCT
>Seq3
GAGAACACCTGGACACAGGAAGGGGAACAGCACACACTGGGRepresents three sequences, Seq1 and Seq2. The fasta format is easy to manipulate, as we can check whether the starting character of a line is '>' to determine whether the line is a header.
1.2 Reading FASTA Files
FASTA format is easy to read. We can only check whether a line from fasta files starts with '>' to know whether it is a sequence name or it is a sequence. We can write a simple script to read fasta files into memory in Python:
from dataclasses import dataclass
import sys
@dataclass
class FastaRecord:
name: str
sequence: str
def read_fasta(file_path:str):
seqname = ""
sequence = ""
line = ""
fasta_records = []
with open(file_path, "r") as f:
while 1:
line = f.readline().strip()
if not line:
break
if line.startswith('>'):
if seqname:
fasta_records.append(FastaRecord(name = seqname, sequence = sequence))
sequence = ""
seqname = line[1:]
else:
sequence = sequence + line
return fasta_records
if __name__ == "__main__":
fasta_records = read_fasta(sys.argv[1])
print("Number of sequence {}".format(len(fasta_records)))We can run the script by python3 read_fasta.py example.fa.
There are also packages for parsing FASTA format files in Python like the biopython project. You may read this tutorial of biopython to have an idea how to use biopython to parse FASTA files.
In C/C++, you may also write a program like above to parse the FASTA files. However, a single header file kseq.h written by Heng Li provide easy-to-use API to parse FASTA format in C/C++.
You can download the header file from github and run
1.3 Indexing FASTA files
FASTA files can be large. The Ensembl human genome assembly hg38.fa (version 38.97) takes ~3Gb and mm10 (version 38.97) takes ~2.6Gb. Reading these FASTA entirely into memory could be not affordable. In my machine, reading the human genome assembly hg38.fa approximately takes 1.8Gb of memory. However, sometimes we may not want to load the entire FASTA file into memory, as we will study only one chromosome of the genome. For example, we are now studying only chomosome 19, which only takes 1% of the whole size of the genome. A simple way to do this is by skipping all other chomosomes when reading the FASTA file before reaching chomosome 19, but reading such as large files takes time. To perform efficient reading, we can use indexing to store the location of the chromosomes and sequences using an offset value.
A standard implementation of FASTA index can be found in BEDtools. The source code of the implementation can be found here in C++. The FASTA index format is a tab-separated file with 5 columns: chomosome name, chromosome_length, offset, byte length, and character length, which can be represented by a C struct called FastaIndexEntry.
We will demonstrate the indexed reading using the following C code. We define FastaIndex holding a HashMap (from khash.h) of sequence_name → FastaIndexEntry . Note that kvec.h are used to provide a dynamic array data structure.
And reading the fasta index file will be easy to implement:
With the offset information, we can use the mmap library to perform indexed reading.
In the main function, simply call the above functions do get the sequence of the specific chromosome
Here I will also show how to create a Fasta Index file. The code is a C version of the bedtools' implementation of creating Fasta Index file, which is equivalent, simplier, and faster than hstlib's faidx function.
We can simply output the entries to file by
The method behind the codes is simple. It records the length of characters and bytes of each chromosomes, and generates a recorded offsets value in the FastaIndexEntry for Mmap to locate the position of the chromosome. We can use this function in the main function
As a result, the index file for FASTA file looks like this (tab-separated):
The first column is the name of the chromosome. The second column is the total length of nucleotides of the chromosome. The third columns is the offset of the chromosome in the FASTA file. The fourth and fifth column is the length of each line in the chromosome, in actual character length of byte length (including '\n').
You may implement a Python implementation of the above code using the built-in mmap module.
1.4 The chromosome size file
It is convenient to use a file to record the second column of the index file, which is useful when we are manipulating intervals (e.g. BED files).
Last updated
Was this helpful?
