The FASTA format is a text-based format for representing either nucleotides sequences or amino acid sequences. Files in FASTA format usually end up with .fasta or .fa extensions. The simplicity of FASTA format make it the most basic bioinformatic file format, and it can be easily manipulated by all programing languages, such as Python, C/C++, Perl, and R.
1.1 The convention of FASTA format
Nowadays, modern bioinformatic programs that rely on the FASTA format expect the sequence headers to be preceded by ">", and the actual sequence, while generally represented as "interleaved" between the headers. For example,
Represents three sequences, Seq1 and Seq2. The fasta format is easy to manipulate, as we can check whether the starting character of a line is '>' to determine whether the line is a header.
1.2 Reading FASTA Files
FASTA format is easy to read. We can only check whether a line from fasta files starts with '>' to know whether it is a sequence name or it is a sequence. We can write a simple script to read fasta files into memory in Python:
from dataclasses import dataclass
import sys
@dataclass
class FastaRecord:
name: str
sequence: str
def read_fasta(file_path:str):
seqname = ""
sequence = ""
line = ""
fasta_records = []
with open(file_path, "r") as f:
while 1:
line = f.readline().strip()
if not line:
break
if line.startswith('>'):
if seqname:
fasta_records.append(FastaRecord(name = seqname, sequence = sequence))
sequence = ""
seqname = line[1:]
else:
sequence = sequence + line
return fasta_records
if __name__ == "__main__":
fasta_records = read_fasta(sys.argv[1])
print("Number of sequence {}".format(len(fasta_records)))
We can run the script by python3 read_fasta.py example.fa.
There are also packages for parsing FASTA format files in Python like the biopython project. You may read this tutorial of biopython to have an idea how to use biopython to parse FASTA files.
In C/C++, you may also write a program like above to parse the FASTA files. However, a single header file kseq.h written by Heng Li provide easy-to-use API to parse FASTA format in C/C++.
KSEQ_INIT(FILE*, read)
int main(int argv, char** argv)
{
FILE* fp;
kseq_t *seq;
if (!(fp = fopen(fasta_file_path, "r"))) {
perror("File open failed\n");
exit(1);
}
while (kseq_read(seq) >= 0) {
/* do something to seq, for example */
printf("sequence %s has length %d\n", seq->name.s, seq->seq.l);
}
fclose(fp);
exit(0);
}
You can download the header file from github and run
gcc main.c -o main; ./main example.fa
1.3 Indexing FASTA files
FASTA files can be large. The Ensembl human genome assembly hg38.fa (version 38.97) takes ~3Gb and mm10 (version 38.97) takes ~2.6Gb. Reading these FASTA entirely into memory could be not affordable. In my machine, reading the human genome assembly hg38.fa approximately takes 1.8Gb of memory. However, sometimes we may not want to load the entire FASTA file into memory, as we will study only one chromosome of the genome. For example, we are now studying only chomosome 19, which only takes 1% of the whole size of the genome. A simple way to do this is by skipping all other chomosomes when reading the FASTA file before reaching chomosome 19, but reading such as large files takes time. To perform efficient reading, we can use indexing to store the location of the chromosomes and sequences using an offset value.
A standard implementation of FASTA index can be found in BEDtools. The source code of the implementation can be found here in C++. The FASTA index format is a tab-separated file with 5 columns: chomosome name, chromosome_length, offset, byte length, and character length, which can be represented by a C struct called FastaIndexEntry.
typedef struct FastaIndexEntry {
char* name; /* Name of the fasta sequence */
int64_t length; /* length of the sequences bytes stored */
int64_t offset; /* Offset to the begining of the file */
int64_t line_blen; /* line length in bytes, sequence characters */
int64_t line_len; /* line length including newline */
bool full_header;
} FastaIndexEntry;
We will demonstrate the indexed reading using the following C code. We define FastaIndex holding a HashMap (from khash.h) of sequence_name → FastaIndexEntry . Note that kvec.h are used to provide a dynamic array data structure.
KHASH_MAP_INIT_STR(str_hash_t, const void *);
typedef kvec_t(char *) stringVec;
typedef khash_t(str_hash_t) stringMap;
typedef struct FastaIndex {
stringMap *name_field; /* HashMap of <name, FastaIndexEntry> */
bool full_header; /* Whether to remove the '>' of the name */
} FastaIndex;
And reading the fasta index file will be easy to implement:
FastaIndex *readFastaIndex(char* index_file_path, bool full_header)
{
FastaIndex *fi = FastaIndexInit();
char* line = NULL;
uint64_t line_num;
uint64_t line_len;
FILE* fp;
int ret;
if (!(fp = fopen(index_file_path, "rb"))) fatalf("Error: could not open fasta index file %s\n", index_file_path);
while (ret = getline(&line, &line_len, fp) != -1) {
++line_num;
stringVec fields = split(line, "\t");
if (kv_size(fields) == 5) {
char* name = kv_A(fields, 0);
uint64_t end;
uint64_t length;
uint64_t line_blen;
uint64_t line_len;
FastaIndexEntry *entry = FastaIndexEntryInitData(
name,
strtoll(kv_A(fields, 1), &length, 10),
strtoll(kv_A(fields, 2), &end, 10),
strtoll(kv_A(fields, 3), &line_blen, 10),
strtoll(kv_A(fields, 4), &line_len, 10), full_header);
kv_push(char*, fi->sequence_names, name);
EntryToIndex(entry, fi);
} else {
fatalf("Warning: malformed fasta index file %s\n", index_file_path);
}
}
return fi;
}
stringVec split(const char *_s, char *delim)
{
uint8_t i;
stringVec fields;
kv_init(fields);
char* tok;
char* s = malloc(strlen(_s));
memcpy(s, _s, strlen(_s));
tok = strtok(s, delim);
while (tok != NULL) {
if (tok[strlen(tok)-1] == '\n') tok[strlen(tok)-1] = '\0';
kv_push(char *, fields, tok);
tok = strtok(NULL, delim);
}
return fields;
}
With the offset information, we can use the mmap library to perform indexed reading.
struct fmm {
void *mm;
size_t fs;
};
void *readFastaByMmap(char* fasta_file_path)
{
FILE* fp;
if (!(fp = fopen(fasta_file_path, "r"))) fatalf("Error: could not open fasta file %s\n", fasta_file_path);
int fd = fileno(fp); /* Get file discriptor for mmap */
struct stat sb;
if (fstat(fd, &sb) == -1) {
fatal("Failed to stat the file\n");
}
size_t filesize= sb.st_size;
void *filemm = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
struct fmm *ret = malloc(sizeof(struct fmm));
ret->mm = filemm;
ret->fs = filesize;
return ret;
}
char *getFastaSequenceMmap(void *filemm, FastaIndex *fi, char *seq_name)
{
khiter_t k = kh_get(str_hash_t, fi->name_field, seq_name);
if (k == kh_end(fi->name_field)) fatalf("%s not found in index", seq_name);
FastaIndexEntry *entry = kh_val(fi->name_field, k);
int64_t newlines_in_sequence = entry->length / entry->line_blen;
int64_t seqlen = newlines_in_sequence + entry->length;
char* seq = (char *)calloc(seqlen + 1, 1);
memcpy(seq, (char *) filemm + entry->offset, seqlen);
seq[seqlen] = '\0'; /* end of string char */
char* pbegin = seq;
char* pend = seq + (seqlen/sizeof(char));
if (*pbegin == '\n' || *pbegin == '\0') {
seq = seq + 1;
}
if (*pend == '\n') {
*pend = '\0';
}
return seq;
}
In the main function, simply call the above functions do get the sequence of the specific chromosome
int main(int argc, char const *argv[])
{
FastaIndex* fi = readFastaIndex("hg38.fai", 1);
struct fmm *m = readFastaByMmap("hg38.fa");
printf("%s\n", getFastaSequenceMmap(m->mm, fi, "9"));
munmap(m->mm, m->fs);
free(m);
FastaIndexDestory(fi); // the destroy function impl. not shown here
}
Here I will also show how to create a Fasta Index file. The code is a C version of the bedtools' implementation of creating Fasta Index file, which is equivalent, simplier, and faster than hstlib's faidx function.
void *writeFastaIndex(char* fasta_file_path, bool full_header)
{
int fasta_file_path_len = strlen(fasta_file_path);
char *index_file_path = malloc(strlen(fasta_file_path)+4);
strcpy(index_file_path, fasta_file_path);
strcpy(index_file_path + fasta_file_path_len, ".fai");
FastaIndexEntry *entry = fastaIndexEntryInit();
entry->full_header = full_header;
FastaIndex *fi = fastaIndexInit();
char *line = NULL;
int64_t line_length;
int64_t offset;
int64_t line_number;
bool mismatched_line_len = false;
bool empty_line = false;
FILE* fp;
FILE* op;
size_t bufsize = 0;
if (!(fp = fopen(fasta_file_path, "r"))) fatalf("Error: could not open fasta file %s\n", fasta_file_path);
if (!(op = fopen(index_file_path, "w+"))) fatalf("Error: could not open fasta index file for writing %zu\n", index_file_path);
while ((line_length = getline(&line, &bufsize, fp)) != -1) {
line_length -= 1;
++line_number;
if (line[0] == ';') {
// fasta comment and skip
} else if (line[0] == '+') {
// fasta quality header
line_length = getline(&line, &bufsize, fp);
line_length -= 1;
offset += line_length + 1;
line_length = getline(&line, &bufsize, fp);
line_length -= 1;
} else if (line[0] == '>' || line[0] == '@') {
// if we aren't on the first entry, push the last sequence into the index
if (entry->name != "" && entry->name != NULL) {
mismatched_line_len = false;
empty_line = false;
entryToIndex(entry, fi);
fastaIndexEntryEmpty(entry);
}
char *new_name = malloc(line_length+1);
strcpy(new_name, line + 1);
entry->name = new_name;
} else {
// assume we have a sequence file
if (entry->offset == -1) {
entry->offset = offset;
}
entry->length += line_length;
if (entry->line_len) {
if (mismatched_line_len || empty_line) {
if (line_length == 0) {
empty_line = true;
}
else {
if (empty_line) {
fatal("Error: found an empty line\n");
} else {
fatal("Error: mismatched line length\n");
}
fatal("Error: in generating index file\n");
}
}
if (entry->line_len != line_length + 1) {
mismatched_line_len = true;
if (line_length == 0) {
empty_line = true;
}
}
} else {
entry->line_len = line_length + 1;
}
entry->line_blen = entry->line_len - 1;
}
offset += line_length + 1;
}
entryToIndex(entry, fi);
fastaIndexEntryEmpty(entry);
indexToFile(fi, op);
fastaIndexDestory(fi);
}
The method behind the codes is simple. It records the length of characters and bytes of each chromosomes, and generates a recorded offsets value in the FastaIndexEntry for Mmap to locate the position of the chromosome. We can use this function in the main function
The first column is the name of the chromosome. The second column is the total length of nucleotides of the chromosome. The third columns is the offset of the chromosome in the FASTA file. The fourth and fifth column is the length of each line in the chromosome, in actual character length of byte length (including '\n').
You may implement a Python implementation of the above code using the built-in mmap module.
1.4 The chromosome size file
It is convenient to use a file to record the second column of the index file, which is useful when we are manipulating intervals (e.g. BED files).