The FiveC class

A class for handling 5C analysis.

class hifive.fivec.FiveC(filename, mode='r', silent=False)

This is the class for handling 5C analysis.

This class relies on Fragment and FiveCData for genomic position and interaction count data. Use this class to perform filtering of fragments based on coverage, model fragment bias and distance dependence, and downstream analysis and manipulation. This includes binning of data, plotting of data, and statistical analysis.

Note

This class is also available as hifive.FiveC

When initialized, this class creates an h5dict in which to store all data associated with this object.

Parameters:
  • filename (str.) – The file name of the h5dict. This should end with the suffix ‘.hdf5’
  • mode (str.) – The mode to open the h5dict with. This should be ‘w’ for creating or overwriting an h5dict with name given in filename.
  • silent (bool.) – Indicates whether to print information about function execution for this object.
Returns:

FiveC class object.

cis_heatmap(region, binsize=0, binbounds=None, start=None, stop=None, startfrag=None, stopfrag=None, datatype='enrichment', arraytype='full', skipfiltered=False, returnmapping=False, dynamically_binned=False, minobservations=0, searchdistance=0, expansion_binsize=0, removefailed=False, image_file=None, **kwargs)

Return a heatmap of cis data of the type and shape specified by the passed arguments.

This function returns a heatmap for a single region, bounded by either ‘start’ and ‘stop’ or ‘startfend’ and ‘stopfend’ (‘start’ and ‘stop’ take precedence). If neither is given, the complete region is included. The data in the array is determined by the ‘datatype’, being raw, fragment-corrected, distance-corrected, enrichment, or expected data. The array shape is given by ‘arraytype’ and can be compact (if unbinned), upper, or full. See fivec_binning for further explanation of ‘datatype’ and ‘arraytype’. If using dynamic binning (‘dynamically_binned’ is set to True), ‘minobservations’, ‘searchdistance’, ‘expansion_binsize’, and ‘removefailed’ are used to control the dynamic binning process. Otherwise these arguments are ignored.

Parameters:
  • region (int.) – The index of the region to obtain data from.
  • binsize (int.) – This is the coordinate width of each bin. If ‘binsize’ is zero, unbinned data is returned.
  • binbounds (numpy array) – An array containing start and stop coordinates for a set of user-defined bins. Any fragment not falling in a bin is ignored.
  • start (int.) – The smallest coordinate to include in the array, measured from fragment midpoints. If both ‘start’ and ‘startfrag’ are given, ‘start’ will override ‘startfrag’. If unspecified, this will be set to the midpoint of the first fragment for ‘region’. Optional.
  • stop (int.) – The largest coordinate to include in the array, measured from fragment midpoints. If both ‘stop’ and ‘stopfrag’ are given, ‘stop’ will override ‘stopfrag’. If unspecified, this will be set to the midpoint of the last fragment plus one for ‘region’. Optional.
  • startfrag (int.) – The first fragment to include in the array. If unspecified and ‘start’ is not given, this is set to the first fragment in ‘region’. In cases where ‘start’ is specified and conflicts with ‘startfrag’, ‘start’ is given preference. Optional
  • stopfrag (str.) – The first fragment not to include in the array. If unspecified and ‘stop’ is not given, this is set to the last fragment in ‘region’ plus one. In cases where ‘stop’ is specified and conflicts with ‘stopfrag’, ‘stop’ is given preference. Optional.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fragment’, ‘enrichment’, and ‘expected’. Observed values are always in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, unfiltered fends return value of one. Expected values are returned for ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fragment’ uses only fragment correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values.
  • arraytype (str.) – This determines what shape of array data are returned in. Acceptable values are ‘compact’ (if unbinned), ‘full’, and ‘upper’. ‘compact’ means data are arranged in a N x M x 2 array where N and M are the number of forward and reverse probe fragments, respectively. ‘full’ returns a square, symmetric array of size N x N x 2 where N is the total number of fragments or bins. ‘upper’ returns only the flattened upper triangle of a full array, excluding the diagonal of size (N * (N - 1) / 2) x 2, where N is the total number of fragments or bins.
  • skipfiltered (bool.) – If True, all interaction bins for filtered out fragments are removed and a reduced-size array is returned.
  • returnmapping (bool.) – If True, a list containing the data array and either a 1d array containing fragment numbers included in the data array if the array is not compact or two 1d arrays containin fragment numbers for forward and reverse fragments if the array is compact is return. Otherwise only the data array is returned.
  • dynamically_binned (bool.) – If True, return dynamically binned data.
  • minobservations (int.) – The fewest number of observed reads needed for a bin to counted as valid and stop expanding.
  • searchdistance (int.) – The furthest distance from the bin minpoint to expand bounds. If this is set to zero, there is no limit on expansion distance.
  • expansion_binsize (int.) – The size of bins to use for data to pull from when expanding dynamic bins. If set to zero, unbinned data is used.
  • removefailed (bool.) – If a non-zero ‘searchdistance’ is given, it is possible for a bin not to meet the ‘minobservations’ criteria before stopping looking. If this occurs and ‘removefailed’ is True, the observed and expected values for that bin are zero.
  • image_file (str.) – If a filename is specified, a PNG image file is written containing the heatmap data. Arguments for the appearance of the image can be passed as additional keyword arguments.
Returns:

Array in format requested with ‘arraytype’ containing data requested with ‘datatype’. If returnmapping is True, a list is returned containined the requested data array and an array of associated positions (dependent on the binning options selected).

filter_fragments(mininteractions=20, mindistance=0, maxdistance=0)

Iterate over the dataset and remove fragments that do not have ‘minobservations’ using only unfiltered fragments and interactions falling with the distance limits specified.

In order to create a set of fragments that all have the necessary number of interactions, after each round of filtering, fragment interactions are retallied using only interactions that have unfiltered fragments at both ends.

Parameters:
  • mininteractions (int.) – The required number of interactions for keeping a fragment in analysis.
  • mindistance (int.) – The minimum inter-fragment distance to be included in filtering.
  • maxdistance (int.) – The maximum inter-fragment distance to be included in filtering. A value of zero indicates no maximum cutoff.
Returns:

None

find_binning_fragment_corrections(mindistance=0, maxdistance=0, model=['gc', 'len'], num_bins=[10, 10], parameters=['even', 'even'], learning_threshold=1.0, max_iterations=100, usereads='cis', regions=[], precorrect=False)

Using multivariate binning model, learn correction values for combinations of model parameter bins.

Parameters:
  • mindistance (int.) – The minimum inter-fend distance to be included in modeling.
  • maxdistance (int.) – The maximum inter-fend distance to be included in modeling.
  • model (list) – A list of fragment features to be used in model. Valid values are ‘len’ and any features included in the creation of the associated Fragment object.
  • num_bins (int.) – The number of approximately equal-sized bins two divide model components into.
  • remove_distance (bool.) – Use distance dependence curve in prior probability calculation for each observation.
  • learning_threshold (float) – The minimum change in log-likelihood needed to continue iterative learning process.
  • max_iterations (int.) – The maximum number of iterations to use for learning model parameters.
  • usereads (str.) – Specifies which set of interactions to use, ‘cis’, ‘trans’, and ‘all’.
  • regions (list) – A list of regions to calculate corrections for. If set as None, all region corrections are found.
  • precorrect (bool.) – Use fragment-based corrections in expected value calculations, resulting in a chained normalization approach.
Para parameters:
 

A list of types, one for each model parameter. Types can be either ‘even’ or ‘fixed’, indicating whether each parameter bin should contain approximately even numbers of interactions or be of fixed width spanning 1 / Nth of the range of the parameter’s values, respectively. Parameter types can also have the suffix ‘-const’ to indicate that the parameter should not be optimized.

Returns:

None

find_distance_parameters()

Regress log counts versus inter-fragment distances to find slope and intercept values and then find the standard deviation of corrected counts.

Returns:None
find_express_fragment_corrections(mindistance=0, maxdistance=0, iterations=1000, remove_distance=False, usereads='cis', regions=, []precorrect=False)

Using iterative approximation, learn correction values for each valid fragment.

Parameters:
  • mindistance (int.) – The minimum inter-fragment distance to be included in modeling.
  • maxdistance (int.) – The maximum inter-fragment distance to be included in modeling.
  • iterations (int.) – The number of iterations to use for learning fragment corrections.
  • remove_distance (bool.) – Specifies whether the estimated distance-dependent portion of the signal is removed prior to learning fragment corrections.
  • usereads (str.) – Specifies which set of interactions to use, ‘cis’, ‘trans’, or ‘all’.
  • regions (list) – A list of regions to calculate corrections for. If set as None, all region corrections are found.
  • precorrect (bool.) – Use binning-based corrections in expected value calculations, resulting in a chained normalization approach.
Returns:

None

find_probability_fragment_corrections(mindistance=0, maxdistance=0, burnin_iterations=5000, annealing_iterations=10000, learningrate=0.1, precalculate=True, regions=, []precorrect=False)
Using gradient descent, learn correction values for each valid fragment based on a Log-Normal distribution of observations.
Parameters:
  • mindistance (int.) – The minimum inter-fragment distance to be included in modeling.
  • maxdistance (int.) – The maximum inter-fragment distance to be included in modeling.
  • burnin_iterations (int.) – The number of iterations to use with constant learning rate in gradient descent for learning fragment corrections.
  • annealing_iterations (int.) – The number of iterations to use with a linearly-decreasing learning rate in gradient descent for learning fragment corrections.
  • learningrate (float) – The gradient scaling factor for parameter updates.
  • precalculate (bool.) – Specifies whether the correction values should be initialized at the fragment means.
  • regions (list) – A list of regions to calculate corrections for. If set as None, all region corrections are found.
  • precorrect (bool.) – Use binning-based corrections in expected value calculations, resulting in a chained normalization approach.
Returns:

None

find_trans_mean()

Calculate the mean signal across all valid fragment-pair trans (inter-region) interactions.

Returns:None
load()

Load analysis parameters from h5dict specified at object creation and open h5dicts for associated FiveCData and Fragment objects.

Any call of this function will overwrite current object data with values from the last save() call.

Returns:None
load_data(filename)

Load fragment-pair counts and fragment object from FiveCData object.

Parameters:filename (str.) – Specifies the file name of the FiveCData object to associate with this analysis.
Returns:None
save(out_fname=None)

Save analysis parameters to h5dict.

Parameters:filename (str.) – Specifies the file name of the FiveC object to save this analysis to.
Returns:None
trans_heatmap(region1, region2, binsize=1000000, binbounds1=None, start1=None, stop1=None, startfrag1=None, stopfrag1=None, binbounds2=None, start2=None, stop2=None, startfrag2=None, stopfrag2=None, datatype='enrichment', arraytype='full', returnmapping=False, dynamically_binned=False, minobservations=0, searchdistance=0, expansion_binsize=0, removefailed=False, skipfiltered=False, image_file=None, **kwargs)

Return a heatmap of trans data of the type and shape specified by the passed arguments.

This function returns a heatmap for trans interactions between two regions, bounded by either ‘start1’, ‘stop1’, ‘start2’ and ‘stop2’ or ‘startfrag1’, ‘stopfrag1’, ‘startfrag2’, and ‘stopfrag2’ (‘start’ and ‘stop’ take precedence). The data in the array is determined by the ‘datatype’, being raw, fragment-corrected, distance-corrected, enrichment, or expected data. The array shape is always rectangular but can be either compact (which returns two arrays) or full. See fivec_binning for further explanation of ‘datatype’ and ‘arraytype’. If using dynamic binning (‘dynamically_binned’ is set to True), ‘minobservations’, ‘searchdistance’, ‘expansion_binsize’, and ‘removefailed’ are used to control the dynamic binning process. Otherwise these arguments are ignored.

Parameters:
  • region1 (int.) – The index of the first region to obtain data from.
  • region2 (int.) – The index of the second region to obtain data from.
  • binsize (int.) – This is the coordinate width of each bin.
  • binbounds1 (numpy array) – An array containing start and stop coordinates for a set of user-defined bins for ‘region1’. Any fragment not falling in a bin is ignored.
  • start1 (int.) – The coordinate at the beginning of the smallest bin from ‘region1’. If unspecified, ‘start1’ will be the first multiple of ‘binsize’ below the ‘startfrag1’ mid. If there is a conflict between ‘start1’ and ‘startfrag1’, ‘start1’ is given preference. Optional.
  • stop1 (int.) – The largest coordinate to include in the array from ‘region1’, measured from fragment midpoints. If both ‘stop1’ and ‘stopfrag1’ are given, ‘stop1’ will override ‘stopfrag1’. ‘stop1’ will be shifted higher as needed to make the last bin of size ‘binsize’. Optional.
  • startfrag1 (int.) – The first fragment from ‘region1’ to include in the array. If unspecified and ‘start1’ is not given, this is set to the first valid fend in ‘region1’. In cases where ‘start1’ is specified and conflicts with ‘startfrag1’, ‘start1’ is given preference. Optional.
  • stopfrag1 (int.) – The first fragment not to include in the array from ‘region1’. If unspecified and ‘stop1’ is not given, this is set to the last valid fragment in ‘region1’ + 1. In cases where ‘stop1’ is specified and conflicts with ‘stopfrag1’, ‘stop1’ is given preference. Optional.
  • start1 – The coordinate at the beginning of the smallest bin from ‘region1’. If unspecified, ‘start1’ will be the first multiple of ‘binsize’ below the ‘startfrag1’ mid. If there is a conflict between ‘start1’ and ‘startfrag1’, ‘start1’ is given preference. Optional.
  • binbounds2 (numpy array) – An array containing start and stop coordinates for a set of user-defined bins for ‘region2’. Any fragment not falling in a bin is ignored.
  • stop2 (int.) – The largest coordinate to include in the array from ‘region2’, measured from fragment midpoints. If both ‘stop2’ and ‘stopfrag2’ are given, ‘stop2’ will override ‘stopfrag2’. ‘stop2’ will be shifted higher as needed to make the last bin of size ‘binsize’. Optional.
  • startfrag2 (int.) – The first fragment from ‘region2’ to include in the array. If unspecified and ‘start2’ is not given, this is set to the first valid fend in ‘region2’. In cases where ‘start2’ is specified and conflicts with ‘startfrag2’, ‘start2’ is given preference. Optional.
  • stopfrag2 (int.) – The first fragment not to include in the array from ‘region2’. If unspecified and ‘stop2’ is not given, this is set to the last valid fragment in ‘region2’ + 2. In cases where ‘stop2’ is specified and conflicts with ‘stopfrag2’, ‘stop2’ is given preference. Optional.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fragment’, ‘enrichment’, and ‘expected’. Observed values are aways in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, non-filtered bins return value of 1. Expected values are returned for ‘distance’, ‘fragment’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fragment’ uses only fragment correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values.
  • arraytype (str.) – This determines what shape of array data are returned in. Acceptable values are ‘compact’ (if unbinned) and ‘full’. ‘compact’ means data are arranged in a N x M x 2 array where N and M are the number of forward and reverse probe fragments, respectively. If compact is selected, only data for the forward primers of ‘region1’ and reverse primers of ‘region2’ are returned. ‘full’ returns a square, symmetric array of size N x N x 2 where N is the total number of fragments or bins.
  • returnmapping (bool.) – If ‘True’, a list containing the data array and mapping information is returned. Otherwise only a data array(s) is returned.
  • dynamically_binned (bool.) – If ‘True’, return dynamically binned data.
  • minobservations (int.) – The fewest number of observed reads needed for a bin to counted as valid and stop expanding.
  • searchdistance (int.) – The furthest distance from the bin minpoint to expand bounds. If this is set to zero, there is no limit on expansion distance.
  • expansion_binsize (int.) – The size of bins to use for data to pull from when expanding dynamic bins. If set to zero, unbinned data is used.
  • removefailed (bool.) – If a non-zero ‘searchdistance’ is given, it is possible for a bin not to meet the ‘minobservations’ criteria before stopping looking. If this occurs and ‘removefailed’ is True, the observed and expected values for that bin are zero.
  • skipfiltered (bool.) – If ‘True’, all interaction bins for filtered out fragments are removed and a reduced-size array is returned.
  • image_file (str.) – If a filename is specified, a PNG image file is written containing the heatmap data. Arguments for the appearance of the image can be passed as additional keyword arguments.
Returns:

Array in format requested with ‘arraytype’ containing inter-region data requested with ‘datatype’. If ‘returnmapping’ is True, a list is returned with mapping information. If ‘arraytype’ is ‘full’, a single data array and two 1d arrays of fragments corresponding to rows and columns, respectively is returned. If ‘arraytype’ is ‘compact’, two data arrays are returned (forward1 by reverse2 and forward2 by reverse1) along with forward and reverse fragment positions for each array for a total of 5 arrays.

write_heatmap(filename, binsize, includetrans=True, datatype='enrichment', arraytype='full', regions=[])

Create an h5dict file containing binned interaction arrays, bin positions, and an index of included regions.

Parameters:
  • filename (str.) – Location to write h5dict object to.
  • binsize (int.) – Size of bins for interaction arrays. If “binsize” is zero, fragment interactions are returned without binning.
  • includetrans (bool.) – Indicates whether trans interaction arrays should be calculated and saved.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fragment’, ‘enrichment’, and ‘expected’. Observed values are aways in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, non-filtered bins return value of 1. Expected values are returned for ‘distance’, ‘fragment’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fragment’ uses only fragment correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values.
  • arraytype (str.) – This determines what shape of array data are returned in. Acceptable values are ‘compact’ and ‘full’. ‘compact’ means data are arranged in a N x M x 2 array where N is the number of bins, M is the maximum number of steps between included bin pairs, and data are stored such that bin n,m contains the interaction values between n and n + m + 1. ‘full’ returns a square, symmetric array of size N x N x 2.
  • regions (list.) – If given, indicates which regions should be included. If left empty, all regions are included.
Returns:

None