The HiC class

A class for handling HiC analysis.

class hifive.hic.HiC(filename, mode='r', silent=False)

This is the class for handling HiC analysis.

This class relies on Fend and HiCData for genomic position and interaction count data. Use this class to perform filtering of fends based on coverage, model fend bias and distance dependence, and downstream analysis and manipulation. This includes binning of data, plotting of data, modeling of data, and statistical analysis.

Note

This class is also available as hifive.HiC

When initialized, this class creates an h5dict in which to store all data associated with this object.

Parameters:
  • filename (str.) – The file name of the h5dict. This should end with the suffix ‘.hdf5’
  • mode (str.) – The mode to open the h5dict with. This should be ‘w’ for creating or overwriting an h5dict with name given in filename.
  • silent (bool.) – Indicates whether to print information about function execution for this object.
Returns:

HiC class object.

cis_heatmap(chrom, start=None, stop=None, startfend=None, stopfend=None, binsize=0, binbounds=None, datatype='enrichment', arraytype='compact', maxdistance=0, skipfiltered=False, returnmapping=False, dynamically_binned=False, minobservations=0, searchdistance=0, expansion_binsize=0, removefailed=False, image_file=None, **kwargs)

Return a heatmap of cis data of the type and shape specified by the passed arguments.

This function returns a heatmap for a single chromosome region, bounded by either ‘start’ and ‘stop’ or ‘startfend’ and ‘stopfend’ (‘start’ and ‘stop’ take precedence), or if given, the outer coordinates of the array passed by ‘binbounds’. If none of these are specified, data for the complete chromosome is used. The data in the array is determined by the ‘datatype’, being raw, fend-corrected, distance-corrected, enrichment, or expected data. The array shape is given by ‘arraytype’ and can be compact, upper, or full. See hic_binning for further explanation of ‘datatype’ and ‘arraytype’. The returned data will include interactions ranging from zero to ‘maxdistance’ apart. If maxdistance is zero, all interactions within the requested bounds are returned. If using dynamic binning (‘dynamically_binned’ is set to True), ‘minobservations’, ‘searchdistance’, ‘expansion_binsize’, and ‘removefailed’ are used to control the dynamic binning process. Otherwise these arguments are ignored.

Parameters:
  • chrom (str.) – The name of a chromosome to obtain data from.
  • start (int.) – The smallest coordinate to include in the array, measured from fend midpoints. If both ‘start’ and ‘startfend’ are given, ‘start’ will override ‘startfend’. If unspecified, this will be set to the midpoint of the first fend for ‘chrom’. Optional.
  • stop (int.) – The largest coordinate to include in the array, measured from fend midpoints. If both ‘stop’ and ‘stopfend’ are given, ‘stop’ will override ‘stopfend’. If unspecified, this will be set to the midpoint of the last fend plus one for ‘chrom’. Optional.
  • startfend (int.) – The first fend to include in the array. If unspecified and ‘start’ is not given, this is set to the first fend in ‘chrom’. In cases where ‘start’ is specified and conflicts with ‘startfend’, ‘start’ is given preference. Optional
  • stopfend (str.) – The first fend not to include in the array. If unspecified and ‘stop’ is not given, this is set to the last fend in ‘chrom’ plus one. In cases where ‘stop’ is specified and conflicts with ‘stopfend’, ‘stop’ is given preference. Optional.
  • binsize (int.) – This is the coordinate width of each bin. If ‘binsize’ is zero, unbinned data is returned. If ‘binbounds’ is not None, this value is ignored.
  • binbounds (numpy array) – An array containing start and stop coordinates for a set of user-defined bins. Any fend not falling in a bin is ignored. Optional.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’. Observed values are always in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, unfiltered fends return value of one. Expected values are returned for ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fend’ uses only fend correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values.
  • arraytype (str.) – This determines what shape of array data are returned in. Acceptable values are ‘compact’, ‘full’, and ‘upper’. ‘compact’ means data are arranged in a N x M x 2 array where N is the number of fends or bins, M is the maximum number of steps between included fend pairs or bin pairs and data are stored such that bin n,m contains the interaction values between n and n + m + 1. ‘full’ returns a square, symmetric array of size N x N x 2. ‘upper’ returns only the flattened upper triangle of a full array, excluding the diagonal of size (N * (N - 1) / 2) x 2.
  • maxdistance (str.) – This specifies the maximum coordinate distance between bins that will be included in the array. If set to zero, all distances are included.
  • skipfiltered (bool.) – If ‘True’, all interaction bins for filtered out fends are removed and a reduced-size array is returned.
  • returnmapping (bool.) – If ‘True’, a list containing the data array and a 1d array containing fend numbers included in the data array if unbinned or a 2d array of N x 4 containing the first fend and last fend plus one included in each bin and first and last coordinates if binned is return. Otherwise only the data array is returned.
  • dynamically_binned (bool.) – If ‘True’, return dynamically binned data.
  • minobservations (int.) – The fewest number of observed reads needed for a bin to counted as valid and stop expanding.
  • searchdistance (int.) – The furthest distance from the bin minpoint to expand bounds. If this is set to zero, there is no limit on expansion distance.
  • expansion_binsize (int.) – The size of bins to use for data to pull from when expanding dynamic bins. If set to zero, unbinned data is used.
  • removefailed (bool.) – If a non-zero ‘searchdistance’ is given, it is possible for a bin not to meet the ‘minobservations’ criteria before stopping looking. If this occurs and ‘removefailed’ is True, the observed and expected values for that bin are zero.
  • image_file (str.) – If a filename is specified, a PNG image file is written containing the heatmap data. Arguments for the appearance of the image can be passed as additional keyword arguments.
Returns:

Array in format requested with ‘arraytype’ containing data requested with ‘datatype’. If returnmapping is True, a list is returned containined the requested data array and an array of associated positions (dependent on the binning options selected).

filter_fends(mininteractions=10, mindistance=0, maxdistance=0)

Iterate over the dataset and remove fends that do not have ‘minobservations’ within ‘maxdistance’ of themselves using only unfiltered fends.

In order to create a set of fends that all have the necessary number of interactions, after each round of filtering, fend interactions are retallied using only interactions that have unfiltered fends at both ends.

Parameters:
  • mininteractions (int.) – The required number of interactions for keeping a fend in analysis.
  • mindistance (int.) – The minimum inter-fend distance used to count fend interactions.
  • maxdistance (int.) – The maximum inter-fend distance used to count fend interactions. A value of 0 indicates no maximum should be used.
Returns:

None

find_binning_fend_corrections(mindistance=0, maxdistance=0, chroms=[], num_bins=[20, 20, 20], parameters=['even', 'even', 'even-const'], model=['gc', 'len', 'distance'], learning_threshold=1.0, max_iterations=10, usereads='cis')

Using a multivariate binning model, learn correction values for combinations of model parameter bins. This function is MPI compatible.

Parameters:
  • mindistance (int.) – The minimum inter-fend distance to be included in modeling.
  • maxdistance (int.) – The maximum inter-fend distance to be included in modeling.
  • chroms (list) – A list of chromosomes to calculate corrections for. If set as None, all chromosome corrections are found.
  • remove_distance (bool.) – Use distance dependence curve in prior probability calculation for each observation.
  • model (list) – A list of fend features to be used in model. Valid values are ‘len’, ‘distance’, and any features included in the creation of the associated Fend object. The ‘distance’ parameter is only good with ‘cis’ or ‘all’ reads. If used with ‘all’, distances will be partitioned into n - 1 bins and the final distance bin will contain all trans data.
  • num_bins (list) – A list of the number of approximately equal-sized bins two divide model components into.
  • learning_threshold (float) – The minimum change in log-likelihood needed to continue iterative learning process.
  • max_iterations (int.) – The maximum number of iterations to use for learning model parameters.
  • usereads (str.) – Specifies which set of interactions to use, ‘cis’, ‘trans’, and ‘all’.
Para parameters:
 

A list of types, one for each model parameter. Types can be either ‘even’ or ‘fixed’, indicating whether each parameter bin should contain approximately even numbers of interactions or be of fixed width spanning 1 / Nth of the range of the parameter’s values, respectively. Parameter types can also have the suffix ‘-const’ to indicate that the parameter should not be optimized.

Returns:

None

find_distance_parameters(numbins=90, minsize=200, maxsize=0, corrected=False)

Count reads and possible interactions from valid fend pairs in each distance bin to find mean bin signals. This function is MPI compatible.

This partitions the range of interaction distances (measured from mipoints of the involved fends) from the ‘minsize’ to ‘maxsize’ into a number of partitions equal to ‘numbins’. The first bin contains all distances less than or equal to ‘minsize’. The remaining bins are defined such that their log ranges are equal to one another. The curve defined by the mean interaction value of each bin can be smoothed using a triangular smoothing operation.

Parameters:
  • numbins (int.) – The number of bins to divide the distance range into. The first bin extends from zero to ‘minsize’, while the remaining bins are divided into evenly-spaced log-sized bins from ‘minsize’ to ‘maxsize’ or the maximum inter-fend distance, whichever is greater.
  • minsize (int.) – The upper size limit of the smallest distance bin.
  • maxsize (int.) – If this value is larger than the largest included chromosome, it will extend bins out to maxsize. If this value is smaller, it is ignored.
  • corrected (bool.) – If True, correction values are applied to counts prior to summing.
Returns:

None

find_express_fend_corrections(iterations=100, mindistance=0, maxdistance=0, remove_distance=True, usereads='cis', mininteractions=0, chroms=, []precorrect=False)

Using iterative matrix-balancing approximation, learn correction values for each valid fend. This function is MPI compatible.

Parameters:
  • iterations (int.) – The number of iterations to use for learning fend corrections.
  • mindistance (int.) – This is the minimum distance between fend midpoints needed to be included in the analysis. All possible and observed interactions with a distance shorter than this are ignored. If ‘usereads’ is set to ‘trans’, this value is ignored.
  • maxdistance (int.) – The maximum inter-fend distance to be included in modeling. If ‘usereads’ is set to ‘trans’, this value is ignored.
  • remove_distance (bool.) – Specifies whether the estimated distance-dependent portion of the signal is removed prior to learning fend corrections.
  • usereads (str.) – Specifies which set of interactions to use, ‘cis’, ‘trans’, or ‘all’.
  • mininteractions (int.) – If a non-zero ‘mindistance’ is specified or only ‘trans’ interactions are used, fend filtering will be performed again to ensure that the data being used is sufficient for analyzed fends. This parameter may specify how many interactions are needed for valid fends. If not given, the value used for the last call to filter_fends() is used or, barring that, one.
  • chroms (list) – A list of chromosomes to calculate corrections for. If set as None, all chromosome corrections are found.
  • precorrect (bool.) – Use binning-based corrections in expected value calculations, resulting in a chained normalization approach.
Returns:

None

find_probability_fend_corrections(mindistance=0, maxdistance=0, minchange=0.0001, burnin_iterations=10000, annealing_iterations=10000, learningrate=0.1, display=0, chroms=, []precalculate=True, precorrect=False)

Using gradient descent, learn correction values for each valid fend based on a Poisson distribution of observations. This function is MPI compatible.

Parameters:
  • mindistance (int.) – The minimum inter-fend distance to be included in modeling.
  • maxdistance (int.) – The maximum inter-fend distance to be included in modeling.
  • minchange (float) – The minimum mean change in fend correction parameter values needed to keep running past ‘burnin_iterations’ number of iterations during burn-in phase.
  • burnin_iterations (int.) – The number of iterations to use with constant learning rate in gradient descent for learning fend corrections.
  • annealing_iterations (int.) – The number of iterations to use with a linearly-decreasing learning rate in gradient descent for learning fend corrections.
  • learningrate (float) – The gradient scaling factor for parameter updates.
  • display (int.) – Specifies how many iterations between when cost is calculated and displayed as model is learned. If ‘display’ is zero, the cost is not calculated of displayed.
  • chroms (list) – A list of chromosomes to calculate corrections for. If set as None, all chromosome corrections are found.
  • precalculate (bool.) – Specifies whether the correction values should be initialized at the fend means.
  • precorrect (bool.) – Use binning-based corrections in expected value calculations, resulting in a chained normalization approach.
Returns:

None

find_trans_means()

Calculate the mean signals across all valid fend-pair trans interactions for each chromosome pair.

Returns:None
learn_fend_3D_model(chrom, minobservations=10)

Learn coordinates for a 3D model of data using an approximate PCA dimensional reduction.

This function makes use of the mlpy function PCAFast() to reduce the data to a set of three coordinates per fend. Cis data for all unfiltered fends for the specified chromosome are dynamically binned to yield a complete distance matrix. The diagonal is set equal to the highest valid enrichment value after dynamic binning. This N x N matrix is passed to PCAFast() and reduced to an N x 3 matrix.

Parameters:
  • chrom (str.) – The chromosome to learn the model for.
  • minobservations (int.) – The minimum number of observed reads needed to cease bin expansion in the dynamic binning phase.
Returns:

Array containing a row for each valid fend and columns containing X coordinate, Y coordinate, Z coordinate, and sequence coordinate (fend midpoint).

load()

Load analysis parameters from h5dict specified at object creation and open h5dicts for associated HiCData and Fend objects.

Any call of this function will overwrite current object data with values from the last save() call.

Returns:None
load_data(filename)

Load fend-pair counts and fend object from HiCData object.

Parameters:filename (str.) – Specifies the file name of the HiCData object to associate with this analysis.
Returns:None
reset_filter()

Return all fends to a valid filter state.

Returns:None
save(out_fname=None)

Save analysis parameters to h5dict.

Parameters:filename (str.) – Specifies the file name of the HiC object to save this analysis to.
Returns:None
trans_heatmap(chrom1, chrom2, start1=None, stop1=None, startfend1=None, stopfend1=None, binbounds1=None, start2=None, stop2=None, startfend2=None, stopfend2=None, binbounds2=None, binsize=1000000, skipfiltered=False, datatype='enrichment', returnmapping=False, dynamically_binned=False, minobservations=0, searchdistance=0, expansion_binsize=0, removefailed=False, image_file=None, **kwargs)

Return a heatmap of trans data of the type and shape specified by the passed arguments.

This function returns a heatmap for trans interactions between two chromosomes within a region, bounded by either ‘start1’, ‘stop1’, ‘start2’ and ‘stop2’ or ‘startfend1’, ‘stopfend1’, ‘startfend2’, and ‘stopfend2’ (‘start’ and ‘stop’ take precedence), or if given, the outer coordinates of the arrays passed by ‘binbounds1’ and ‘binbounds2’. The data in the array is determined by the ‘datatype’, being raw, fend-corrected, distance-corrected, enrichment, or expected data. The array shape is always rectangular. See hic_binning for further explanation of ‘datatype’. If using dynamic binning (‘dynamically_binned’ is set to True), ‘minobservations’, ‘searchdistance’, ‘expansion_binsize’, and ‘removefailed’ are used to control the dynamic binning process. Otherwise these arguments are ignored.

Parameters:
  • chrom1 (str.) – The name of the first chromosome to obtain data from.
  • chrom2 (str.) – The name of the second chromosome to obtain data from.
  • start1 (int.) – The coordinate at the beginning of the smallest bin from ‘chrom1’. If unspecified, ‘start1’ will be the first multiple of ‘binsize’ below the ‘startfend1’ mid. If there is a conflict between ‘start1’ and ‘startfend1’, ‘start1’ is given preference. Optional.
  • stop1 (int.) – The largest coordinate to include in the array from ‘chrom1’, measured from fend midpoints. If both ‘stop1’ and ‘stopfend1’ are given, ‘stop1’ will override ‘stopfend1’. ‘stop1’ will be shifted higher as needed to make the last bin of size ‘binsize’. Optional.
  • startfend1 (int.) – The first fend from ‘chrom1’ to include in the array. If unspecified and ‘start1’ is not given, this is set to the first valid fend in ‘chrom1’. In cases where ‘start1’ is specified and conflicts with ‘startfend1’, ‘start1’ is given preference. Optional
  • stopfend1 – The first fend not to include in the array from ‘chrom1’. If unspecified and ‘stop1’ is not given, this is set to the last valid fend in ‘chrom1’ + 1. In cases where ‘stop1’ is specified and conflicts with ‘stopfend1’, ‘stop1’ is given preference. Optional.
  • binbounds1 (numpy array) – An array containing start and stop coordinates for a set of user-defined bins to use for partitioning ‘chrom1’. Any fend not falling in a bin is ignored.
  • start2 (int.) – The coordinate at the beginning of the smallest bin from ‘chrom2’. If unspecified, ‘start2’ will be the first multiple of ‘binsize’ below the ‘startfend2’ mid. If there is a conflict between ‘start2’ and ‘startfend2’, ‘start2’ is given preference. Optional.
  • stop2 (int.) – The largest coordinate to include in the array from ‘chrom2’, measured from fend midpoints. If both ‘stop2’ and ‘stopfend2’ are given, ‘stop2’ will override ‘stopfend2’. ‘stop2’ will be shifted higher as needed to make the last bin of size ‘binsize’. Optional.
  • startfend2 (int.) – The first fend from ‘chrom2’ to include in the array. If unspecified and ‘start2’ is not given, this is set to the first valid fend in ‘chrom2’. In cases where ‘start2’ is specified and conflicts with ‘startfend2’, ‘start2’ is given preference. Optional
  • stopfend2 (str.) – The first fend not to include in the array from ‘chrom2’. If unspecified and ‘stop2’ is not given, this is set to the last valid fend in ‘chrom2’ + 1. In cases where ‘stop2’ is specified and conflicts with ‘stopfend2’, ‘stop1’ is given preference. Optional.
  • binbounds2 (numpy array) – An array containing start and stop coordinates for a set of user-defined bins to use for partitioning ‘chrom2’. Any fend not falling in a bin is ignored.
  • binsize (int.) – This is the coordinate width of each bin. If binbounds is not None, this value is ignored.
  • skipfiltered (bool.) – If ‘True’, all interaction bins for filtered out fends are removed and a reduced-size array is returned.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’. Observed values are always in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, unfiltered fends return value of one. Expected values are returned for ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fend’ uses only fend correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values.
  • returnmapping (bool.) – If ‘True’, a list containing the data array and two 2d arrays of N x 4 containing the first fend and last fend plus one included in each bin and first and last coordinates for the first and second chromosomes is returned. Otherwise only the data array is returned.
  • dynamically_binned (bool.) – If ‘True’, return dynamically binned data.
  • minobservations (int.) – The fewest number of observed reads needed for a bin to counted as valid and stop expanding.
  • searchdistance (int.) – The furthest distance from the bin minpoint to expand bounds. If this is set to zero, there is no limit on expansion distance.
  • expansion_binsize (int.) – The size of bins to use for data to pull from when expanding dynamic bins. If set to zero, unbinned data is used.
  • removefailed (bool.) – If a non-zero ‘searchdistance’ is given, it is possible for a bin not to meet the ‘minobservations’ criteria before stopping looking. If this occurs and ‘removefailed’ is True, the observed and expected values for that bin are zero.
  • image_file (str.) – If a filename is specified, a PNG image file is written containing the heatmap data. Arguments for the appearance of the image can be passed as additional keyword arguments.
Returns:

Array in format requested with ‘arraytype’ containing data requested with ‘datatype’. If returnmapping is True, a list is returned containined the requested data array and an array of associated positions (dependent on the binning options selected).

write_heatmap(filename, binsize, includetrans=True, datatype='enrichment', chroms=[])

Create an h5dict file containing binned interaction arrays, bin positions, and an index of included chromosomes. This function is MPI compatible.

Parameters:
  • filename (str.) – Location to write h5dict object to.
  • binsize (int.) – Size of bins for interaction arrays.
  • includetrans (bool.) – Indicates whether trans interaction arrays should be calculated and saved.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’. Observed values are always in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, unfiltered fends return value of one. Expected values are returned for ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fend’ uses only fend correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values.
  • chroms (list) – A list of chromosome names indicating which chromosomes should be included. If left empty, all chromosomes are included. Optional.
Returns:

None