4. The Tracks File Format

Including various tracks format such as: WIG/bigWig, bedGraph, etc

The bigWig file format

The bigWig format is for display of dense, continuous data that will be displayed as a graph. BigWig files are created initially from WIG type files, using the UCSC program wigToBigWig. Alternatively, bigWig files can be created from bedGraph files, using the UCSC program bedGraphToBigWig. In either case, the resulting bigWig files are in an indexed binary format. The main advantage of the bigWig files is that only the portions of the files needed to display a particular region are transferred, so for large data sets bigWig is considerably faster than regular WIG files.

-- Broard Institute

The bigWig track format can be visualized by genome browsers like Integrative Genomics Viewer - Broad Institute and other web-based genome browsers like VALIS.

A descriptive file format of bigWig can be found at this header file https://raw.githubusercontent.com/dpryan79/libBigWig/master/bigWig.h. The bigWig file is self-indexed, which means users can randomly access to data blocks that contains the track data according to the offsets of the data blocks stored in the index header blocks at the beginning of the file.

In brief, the there are three parts in abigWig file.

  1. Header section

  2. Chromosome List section

  3. The data sections (will not be read until later accession)

  4. Index section

These section holds everything needed to randomly access a bigWig file.

bigWig Header

As described in bigWig.h, the header of the bigWig files contains the following fields

typedef struct {    uint16_t version; /* The version information of the file.*/    uint16_t nLevels; /* The number of "zoom" levels.*/    uint64_t ctOffset; /* The offset to the on-disk chromosome tree list.*/    uint64_t dataOffset; /* The on-disk offset to the first block of data.*/    uint64_t indexOffset; /* The on-disk offset to the data index.*/    uint16_t fieldCount; /* Total number of fields.*/    uint16_t definedFieldCount; /* Number of fixed-format BED fields.*/    uint64_t sqlOffset; /* The on-disk offset to an SQL string. This is unused.*/    uint64_t summaryOffset; /* If there's a summary, this is the offset to it on the disk.*/    uint32_t bufSize; /* The compression buffer size (if the data is compressed).*/    uint64_t extensionOffset; /* Unused*/    bwZoomHdr_t *zoomHdrs; /* Pointers to the header for each zoom level.*/    //total Summary    uint64_t nBasesCovered; /* The total bases covered in the file.*/    double minVal; /* The minimum value in the file.*/    double maxVal; /* The maximum value in the file.*/    double sumData; /* The sum of all values in the file.*/    double sumSquared; /* The sum of the squared values in the file.*/} bigWigHdr_t;

0-64 bytes of the file are reserved space.

Field

Offset

version

0x4

nLevels

0x6

ctOffset

0x8

dataOffset

0x10

indexOffset

0x18

fieldConnt)

0x20

definedFieldCount

0x22

sqlOffset

0x24

summaryOffset

0x2c

bufSize

0x34

extensionOffset

0x38

After these header fields, the bigwig file goes with a zoom header. BigWig files have multiple "zoom" levels, each of which has its own header.

typedef struct {
    uint16_t version; /* The version information of the file.*/
    uint16_t nLevels; /* The number of "zoom" levels.*/
    uint64_t ctOffset; /* The offset to the on-disk chromosome tree list.*/
    uint64_t dataOffset; /* The on-disk offset to the first block of data.*/
    uint64_t indexOffset; /* The on-disk offset to the data index.*/
    uint16_t fieldCount; /* Total number of fields.*/
    uint16_t definedFieldCount; /* Number of fixed-format BED fields.*/
    uint64_t sqlOffset; /* The on-disk offset to an SQL string. This is unused.*/
    uint64_t summaryOffset; /* If there's a summary, this is the offset to it on the disk.*/
    uint32_t bufSize; /* The compression buffer size (if the data is compressed).*/
    uint64_t extensionOffset; /* Unused*/
    bwZoomHdr_t *zoomHdrs; /* Pointers to the header for each zoom level.*/
    //total Summary
    uint64_t nBasesCovered; /* The total bases covered in the file.*/
    double minVal; /* The minimum value in the file.*/
    double maxVal; /* The maximum value in the file.*/
    double sumData; /* The sum of all values in the file.*/
    double sumSquared; /* The sum of the squared values in the file.*/
} bigWigHdr_t;

The zoom level header contains arrays of dataOffset and indexOffset regarding to different zoom levels.

A header and index that points to an R-tree that in turn points to data blocks. A node within an R-tree holding the index forc data. For more information see the bigWig index section.

After the zoom header section, the bigWig files follows with file summary information, including nBasesCovered, minVal, maxVal, sumData, and sumSquared.

bigWig chromosome list

The offset of the chromosome list section to the file start can be accessed from the bigWig header. THe chromosome list section begins with some basic summary fields including itemsPerBlock, keySize, ValueSize, and itemCount.

The chromosome list structure is defined as

typedef struct {
    int64_t nKeys; /* The number of chromosomes */
    char **chrom; /* A list of null terminated chromosomes */
    uint32_t *len; /* The lengths of each chromosome */
} chromList_t;

that contains an array of strings of null terminated chromosomes. In the bigWig files, the chromosome names are just stored sequentially padded with 2-byte flags (These flags are actually 1 byte flag isLeaf and 1 byte padding. The isLeaf flag is now deprecated in the bigWig file format)

bigWig index

The bwRTree and bwRTree_Node data structures are defined as follows:

typedef struct bwRTreeNode_t {
    uint8_t isLeaf; /* Is this node a leaf?*/
    //1 byte of padding
    uint16_t nChildren; /* The number of children of this node, all lists have this length.*/
    uint32_t *chrIdxStart; /* A list of the starting chromosome indices of each child.*/
    uint32_t *baseStart; /* A list of the start position of each child.*/
    uint32_t *chrIdxEnd; /* A list of the end chromosome indices of each child.*/
    uint32_t *baseEnd; /* A list of the end position of each child.*/
    uint64_t *dataOffset; /* For leaves, the offset to the on-disk data. For twigs, the offset to the child node.*/
    union {
        uint64_t *size; /* Leaves only: The size of the data block.*/
        struct bwRTreeNode_t **child; /* Twigs only: The child node(s).*/
    } x; /* A union holding either size or child*/
} bwRTreeNode_t;


typedef struct {
    uint32_t blockSize; /* The maximum number of children a node can have*/
    uint64_t nItems; /* The total number of data blocks pointed to by the tree. This is completely redundant.*/
    uint32_t chrIdxStart; /* The index to the first chromosome described.*/
    uint32_t baseStart; /* The first position on chrIdxStart with a value.*/
    uint32_t chrIdxEnd; /* The index of the last chromosome with an entry.*/
    uint32_t baseEnd; /* The last position on chrIdxEnd with an entry.*/
    uint64_t idxSize; /* This is actually the offset of the index rather than the size?!? Yes, it's completely redundant.*/
    uint32_t nItemsPerSlot; /* This is always 1!*/
    //There's 4 bytes of padding in the file here
    uint64_t rootOffset; /* The offset to the root node of the R-Tree (on disk). Yes, this is redundant.*/
    bwRTreeNode_t *root; /* A pointer to the root node.*/
} bwRTree_t;

The bwRTree data structure is actually a region-tree (RTree) data structure that is usually used for indexing multi-dimensional information such as geographical coordinates. The bwRTree arrange the genome coordinates in a hierachical format that provide information to randomly access the entire bigWig file.

Last updated