1. The FASTA File Format

The FASTA format is a text-based format for representing either nucleotides sequences or amino acid sequences. Files in FASTA format usually end up with .fasta or .fa extensions. The simplicity of FASTA format make it the most basic bioinformatic file format, and it can be easily manipulated by all programing languages, such as Python, C/C++, Perl, and R.

1.1 The convention of FASTA format

Nowadays, modern bioinformatic programs that rely on the FASTA format expect the sequence headers to be preceded by ">", and the actual sequence, while generally represented as "interleaved" between the headers. For example,

>Seq1
ACACGTGGGTACATGGAGGGGAACAACACACACCAGGGC
>Seq2
ACACGTGGGTACATGGAGGGGAACAACACACACCAGGGCCT
>Seq3
GAGAACACCTGGACACAGGAAGGGGAACAGCACACACTGGG

Represents three sequences, Seq1 and Seq2. The fasta format is easy to manipulate, as we can check whether the starting character of a line is '>' to determine whether the line is a header.

1.2 Reading FASTA Files

FASTA format is easy to read. We can only check whether a line from fasta files starts with '>' to know whether it is a sequence name or it is a sequence. We can write a simple script to read fasta files into memory in Python:

from dataclasses import dataclass
import sys
@dataclass 
class FastaRecord:
    name: str
    sequence: str

def read_fasta(file_path:str):
    seqname = ""
    sequence = ""
    line = ""
    fasta_records = []
    with open(file_path, "r") as f:
        while 1:
            line = f.readline().strip()
            if not line:
                break
            if line.startswith('>'):
                if seqname:
                    fasta_records.append(FastaRecord(name = seqname, sequence = sequence))
                    sequence = ""
                seqname = line[1:]
            else:
                sequence = sequence + line
    return fasta_records


if __name__ == "__main__":
    fasta_records = read_fasta(sys.argv[1])
    print("Number of sequence {}".format(len(fasta_records)))

We can run the script by python3 read_fasta.py example.fa.

There are also packages for parsing FASTA format files in Python like the biopython project. You may read this tutorial of biopython to have an idea how to use biopython to parse FASTA files.

In C/C++, you may also write a program like above to parse the FASTA files. However, a single header file kseq.h written by Heng Li provide easy-to-use API to parse FASTA format in C/C++.

KSEQ_INIT(FILE*, read)

int main(int argv, char** argv)
{
    FILE* fp;
    kseq_t *seq;
    if (!(fp = fopen(fasta_file_path, "r"))) {
        perror("File open failed\n");
        exit(1);
    }
    while (kseq_read(seq) >= 0) {
        /* do something to seq, for example */
        printf("sequence %s has length %d\n", seq->name.s, seq->seq.l);
    }
    fclose(fp);
    exit(0);
}

You can download the header file from github and run

gcc main.c -o main; ./main example.fa

1.3 Indexing FASTA files

FASTA files can be large. The Ensembl human genome assembly hg38.fa (version 38.97) takes ~3Gb and mm10 (version 38.97) takes ~2.6Gb. Reading these FASTA entirely into memory could be not affordable. In my machine, reading the human genome assembly hg38.fa approximately takes 1.8Gb of memory. However, sometimes we may not want to load the entire FASTA file into memory, as we will study only one chromosome of the genome. For example, we are now studying only chomosome 19, which only takes 1% of the whole size of the genome. A simple way to do this is by skipping all other chomosomes when reading the FASTA file before reaching chomosome 19, but reading such as large files takes time. To perform efficient reading, we can use indexing to store the location of the chromosomes and sequences using an offset value.

A standard implementation of FASTA index can be found in BEDtools. The source code of the implementation can be found here in C++. The FASTA index format is a tab-separated file with 5 columns: chomosome name, chromosome_length, offset, byte length, and character length, which can be represented by a C struct called FastaIndexEntry.

typedef struct FastaIndexEntry {
    char* name;         /* Name of the fasta sequence */
    int64_t length;     /* length of the sequences bytes stored */
    int64_t offset;     /* Offset to the begining of the file */
    int64_t line_blen;  /* line length in bytes, sequence characters */
    int64_t line_len;   /* line length including newline */
    bool full_header;
} FastaIndexEntry;

We will demonstrate the indexed reading using the following C code. We define FastaIndex holding a HashMap (from khash.h) of sequence_nameFastaIndexEntry . Note that kvec.h are used to provide a dynamic array data structure.

KHASH_MAP_INIT_STR(str_hash_t, const void *);
typedef kvec_t(char *) stringVec;
typedef khash_t(str_hash_t) stringMap;
typedef struct FastaIndex {
    stringMap *name_field;    /* HashMap of <name, FastaIndexEntry> */
    bool full_header;         /* Whether to remove the '>' of the name */
} FastaIndex;

And reading the fasta index file will be easy to implement:

FastaIndex *readFastaIndex(char* index_file_path, bool full_header) 
{
    FastaIndex *fi = FastaIndexInit();
    char* line = NULL;
    uint64_t line_num;
    uint64_t line_len;
    FILE* fp;
    int ret;
    if (!(fp = fopen(index_file_path, "rb"))) fatalf("Error: could not open fasta index file %s\n", index_file_path);
    while (ret = getline(&line, &line_len, fp) != -1) {
        ++line_num;
        stringVec fields = split(line, "\t");
        if (kv_size(fields) == 5) {
            char* name = kv_A(fields, 0);
            uint64_t end;
            uint64_t length;
            uint64_t line_blen;
            uint64_t line_len;
            FastaIndexEntry *entry = FastaIndexEntryInitData(
                name, 
                strtoll(kv_A(fields, 1), &length, 10), 
                strtoll(kv_A(fields, 2), &end, 10), 
                strtoll(kv_A(fields, 3), &line_blen, 10), 
                strtoll(kv_A(fields, 4), &line_len, 10), full_header);
            kv_push(char*, fi->sequence_names, name);
            EntryToIndex(entry, fi);
        } else {
            fatalf("Warning: malformed fasta index file %s\n", index_file_path);
        }
    }
    return fi;
}
stringVec split(const char *_s, char *delim)
{
    uint8_t i;
    stringVec fields;
    kv_init(fields);
    char* tok;
    char* s = malloc(strlen(_s));
    memcpy(s, _s, strlen(_s));
    tok = strtok(s, delim);
    while (tok != NULL) {
        if (tok[strlen(tok)-1] == '\n') tok[strlen(tok)-1] = '\0';
        kv_push(char *, fields, tok);
        tok = strtok(NULL, delim);
    }
    return fields;
}

With the offset information, we can use the mmap library to perform indexed reading.

struct fmm {
    void *mm;
    size_t fs;
};
void *readFastaByMmap(char* fasta_file_path)
{
    FILE* fp;
    if (!(fp = fopen(fasta_file_path, "r"))) fatalf("Error: could not open fasta file %s\n", fasta_file_path);
    int fd = fileno(fp); /* Get file discriptor for mmap */
    struct stat sb;
    if (fstat(fd, &sb) == -1) {
        fatal("Failed to stat the file\n");
    }
    size_t filesize= sb.st_size;
    void *filemm = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
    struct fmm *ret = malloc(sizeof(struct fmm));
    ret->mm = filemm;
    ret->fs = filesize;
    return ret;
}

char *getFastaSequenceMmap(void *filemm, FastaIndex *fi, char *seq_name)
{
    khiter_t k = kh_get(str_hash_t, fi->name_field, seq_name);
    if (k == kh_end(fi->name_field)) fatalf("%s not found in index", seq_name);
    FastaIndexEntry *entry = kh_val(fi->name_field, k);
    int64_t newlines_in_sequence = entry->length / entry->line_blen;
    int64_t seqlen = newlines_in_sequence + entry->length;
    char* seq = (char *)calloc(seqlen + 1, 1);
    memcpy(seq, (char *) filemm + entry->offset, seqlen);
    seq[seqlen] = '\0'; /* end of string char */
    char* pbegin = seq;
    char* pend = seq + (seqlen/sizeof(char));
    if (*pbegin == '\n' || *pbegin == '\0') {
        seq = seq + 1;
    }
    if (*pend == '\n') {
        *pend = '\0';
    }
    return seq;
}

In the main function, simply call the above functions do get the sequence of the specific chromosome

int main(int argc, char const *argv[])
{
    FastaIndex* fi = readFastaIndex("hg38.fai", 1);
    struct fmm *m = readFastaByMmap("hg38.fa");
    printf("%s\n", getFastaSequenceMmap(m->mm, fi, "9"));
    munmap(m->mm, m->fs);
    free(m);
    FastaIndexDestory(fi); // the destroy function impl. not shown here
}

Here I will also show how to create a Fasta Index file. The code is a C version of the bedtools' implementation of creating Fasta Index file, which is equivalent, simplier, and faster than hstlib's faidx function.

void *writeFastaIndex(char* fasta_file_path, bool full_header)
{
    int fasta_file_path_len = strlen(fasta_file_path);
    char *index_file_path = malloc(strlen(fasta_file_path)+4);
    strcpy(index_file_path, fasta_file_path);
    strcpy(index_file_path + fasta_file_path_len, ".fai");
    FastaIndexEntry *entry = fastaIndexEntryInit();
    entry->full_header = full_header;
    FastaIndex *fi = fastaIndexInit();
    char *line = NULL;
    int64_t line_length;
    int64_t offset;
    int64_t line_number;
    bool mismatched_line_len = false;
    bool empty_line = false;
    FILE* fp;
    FILE* op;
    size_t bufsize = 0;
    if (!(fp = fopen(fasta_file_path, "r"))) fatalf("Error: could not open fasta file %s\n", fasta_file_path);
    if (!(op = fopen(index_file_path, "w+"))) fatalf("Error: could not open fasta index file for writing %zu\n", index_file_path);
    while ((line_length = getline(&line, &bufsize, fp)) != -1) {
        line_length -= 1;
        ++line_number;
        if (line[0] == ';') {
            // fasta comment and skip
        } else if (line[0] == '+') {
            // fasta quality header
            line_length = getline(&line, &bufsize, fp);
            line_length -= 1;
            offset += line_length + 1;
            line_length = getline(&line, &bufsize, fp);
            line_length -= 1;
        } else if (line[0] == '>' || line[0] == '@') {
            // if we aren't on the first entry, push the last sequence into the index
            if (entry->name != "" && entry->name != NULL) {
                mismatched_line_len = false;
                empty_line = false;
                entryToIndex(entry, fi);
                fastaIndexEntryEmpty(entry);
            }

            char *new_name = malloc(line_length+1);
            strcpy(new_name, line + 1);
            entry->name = new_name;

        } else {
            // assume we have a sequence file
            if (entry->offset == -1) {
                entry->offset = offset;
            }
            entry->length += line_length;
            if (entry->line_len) {
                if (mismatched_line_len || empty_line) {
                    if (line_length == 0) {
                        empty_line = true;
                    }
                    else {
                        if (empty_line) {
                            fatal("Error: found an empty line\n");
                        } else {
                            fatal("Error: mismatched line length\n");
                        }
                        fatal("Error: in generating index file\n");
                    }
                }
                if (entry->line_len != line_length + 1) {
                    mismatched_line_len = true;
                    if (line_length == 0) {
                        empty_line = true;
                    }
                } 
            } else {
                entry->line_len = line_length + 1;
            }
            entry->line_blen = entry->line_len - 1;
        }
        offset += line_length + 1;
    }
    entryToIndex(entry, fi);
    fastaIndexEntryEmpty(entry);
    indexToFile(fi, op);
    fastaIndexDestory(fi);
}

We can simply output the entries to file by

void entryToFile(FastaIndexEntry *entry, FILE* op)
{
    if (entry->full_header) {
        fprintf(op, "%s\t%lld\t%lld\t%lld\t%lld\n",entry->name, entry->length, entry->offset, entry->line_blen, entry->line_len);
    } else {
        char buf[100];
        memset(buf, 0, 100);
        for (int i = 0; i < strlen(entry->name); ++i) {
            if (entry->name[i] != ' ') {
                buf[i] = entry->name[i];
            }
        }
        fprintf(op, "%s\t%lld\t%lld\t%lld\t%lld\n", buf, entry->length, entry->offset, entry->line_blen, entry->line_len);
    }
}

The method behind the codes is simple. It records the length of characters and bytes of each chromosomes, and generates a recorded offsets value in the FastaIndexEntry for Mmap to locate the position of the chromosome. We can use this function in the main function

writeFastaIndex("/Users/snowxue/Desktop/Homo_sapiens.GRCh38.dna.primary_assembly.fa", 0);

As a result, the index file for FASTA file looks like this (tab-separated):

1	248956422	56	60	61
10	133797422	253105810	60	61
11	135086622	389133248	60	61
12	133275309	526471372	60	61
13	114364328	661967995	60	61
14	107043718	778238454	60	61
15	101991189	887066292	60	61
16	90338345	990757392	60	61
17	83257441	1082601434	60	61
18	80373285	1167246557	60	61
19	58617616	1248959454	60	61
2	242193529	1308554087	60	61
20	64444167	1554784232	60	61
21	46709983	1620302526	60	61
22	50818468	1667791066	60	61
3	198295559	1719456565	60	61
4	190214555	1921057106	60	61
5	181538259	2114441960	60	61
6	170805979	2299005913	60	61
7	159345973	2472658715	60	61
8	145138636	2634660511	60	61
9	138394717	2782218181	60	61
MT	16569	2922919531	60	61
X	156040895	2922936433	60	61
Y	57227415	3081578071	60	61

The first column is the name of the chromosome. The second column is the total length of nucleotides of the chromosome. The third columns is the offset of the chromosome in the FASTA file. The fourth and fifth column is the length of each line in the chromosome, in actual character length of byte length (including '\n').

You may implement a Python implementation of the above code using the built-in mmap module.

1.4 The chromosome size file

It is convenient to use a file to record the second column of the index file, which is useful when we are manipulating intervals (e.g. BED files).

1	248956422
2	242193529
3	198295559
4	190214555
5	181538259
6	170805979
7	159345973
X	156040895
8	145138636
9	138394717
11	135086622
10	133797422
12	133275309
13	114364328
14	107043718
15	101991189
16	90338345
17	83257441
18	80373285
20	64444167
19	58617616
Y	57227415
22	50818468
21	46709983
MT	16569

Last updated