API

gendata

class gendata.IntGenoData(genotypes: DataFrame, snps: DataFrame, samples: DataFrame)

Bases: AbstractGenoData

A class to hold and perform basic operations on integer genotype data.

property af: Series

Calculate the allele frequency of each SNP.

Returns:

A series of allele frequencies indexed by rsID.

Return type:

pd.Series

flip_snps(*rsids: str)

Flip A1 and A2 for the selected SNPs.

Parameters:

rsids (str) – List of rsids to flip.

Returns:

GenoData object.

Return type:

Type[IntGenoData]

hwe(midp: bool) Series

Calculate the exact HWE p-values for each SNP.

Parameters:

midp (bool) – Apply midp adjustment.

Returns:

A series of HWE p-values indexed by rsID.

Return type:

pd.Series

property maf: Series

Calculate the minor allele frequency for each SNP.

Returns:

A series of minor allele frequencies indexed by rsID.

Return type:

pd.Series

property min_maf: float

Find the minimim minor allele frequency.

Returns:

The minimum minor allele frequency.

Return type:

float

property n_het: Series

Number of samples heterozygous.

Returns:

Count of heterozygous samples by rsID.

Return type:

pd.Series

property n_hom1: Series

Number of samples homozygous in the reference allele.

Returns:

Count of homozygous (A1) samples by rsID.

Return type:

pd.Series

property n_hom2: Series

Number of samples homozygous in the alternate allele.

Returns:

Count of homozygous (A2) samples by rsID.

Return type:

pd.Series

save(out: str)

Save the integer genotype data to a .bed/.bim/.fam fileset.

Parameters:

out (str) – The path to save the data to. The path should not include the file extension, which will be added automatically.

standardised()

Standardise genotypes.

Returns:

A standardised genotype data object.

Return type:

Type[StdGenoData]

standardized()

Standardise genotypes.

Returns:

A standardised genotype data object.

Return type:

Type[StdGenoData]

static to_maf(af: float) float

Convert a biallelic allele frequency to a minor allele frequency.

Parameters:

af (float) – Allele frequency.

Raises:

ValueError – If af is not in [0, 1].

Returns:

Minor allele frequency.

Return type:

float

class gendata.StdGenoData(genotypes: DataFrame, snps: DataFrame, samples: DataFrame)

Bases: AbstractGenoData

A class to hold and perform operations on standardised genotype data.

calculate_grm(individuals: list[str] | None = None, weights: dict[str, float] | None = None) GRM

Calculate a full GRM.

Parameters:
  • individuals (Optional[list[str]]) – List of individual IDs to include in the GRM.

  • weights (Optional[dict[str, float]]) – Weights to apply to the GRM.

Returns:

The GRM.

Return type:

GRM

calculate_ldm_blocks(block_map: Series | dict[str, int], n_cores: int | None = None) dict[int, dict[str, ndarray | DataFrame]]

Calculate a block-wise LD matrix, where the each block ends at one of the listed terminal rsids.

Parameters:
  • block_map (Union[pd.Series, dict[str, int]]) – Mapping of rsids to block numbers.

  • n_cores (Optional[int]) – Number of cores to use for parallelisation. If None, defaults to the number of cores available.

Returns:

A dictionary containing an LD matrix dictionary for each block.

Return type:

dict[int, dict[str, Union[np.ndarray, pd.DataFrame]]]

calculate_ldm_window(window: int | None = None, n_cores: int | None = None, sparse: bool = True, tol: float = 0.001) dict[int, csr_matrix]

Calculate a sparse windowed LD matrix in CSR format.

Note: These calculations are quite fast, so parallelisation may not always be necessary.

Parameters:
  • window (Optional[int]) – Set LD correlations to zero for all values separated by a distance of greater than window. If None, the window will be set to the maximum distance between SNPs.

  • n_cores (Optional[int]) – Number of cores to use for parallelisation. If None, defaults to the number of cores available.

  • sparse (bool) – Whether to make sparse matrices.

  • tol (float) – Tolerance for sparse matrix construction.

Returns:

Dictionary containing a sparse LD matrix for each chromosome.

Return type:

dict[int, sp.csr_matrix]

flip_snps(*rsids: str)

Flip A1 and A2 for the selected SNPs.

Parameters:

rsids (str) – List of rsids to flip.

Returns:

StdGenoData object.

Return type:

Type[StdGenoData]

gendata.merge(*genotype_data: Type[AbstractGenoData]) Type[AbstractGenoData]

Merge sets of genotype data for different chromosomes/loci from the same set of samples.

Parameters:

genotype_data (AbstractGenoData) – Genetic data objects to merge.

Raises:
  • TypeError – If genetic data inputs are not all of the same object type.

  • ValueError – The set of SNPs overlaps across genetic data inputs.

  • ValueError – The set of samples is not the same across all genetic data

Returns:

Merged genetic data.

gendata.read_bed(paths: str | list[str], rsids: list | None = None, individuals: list | None = None, num_threads: int | None = 1) IntGenoData

Read raw genotypes into an annotated data frame.

Can take a direct path to a .bed/.bim/.fam file or a list of paths to a set of .bed/.bim/.fam files.

Parameters:
  • paths (Union[str, list[str]]) – Paths to files containing paths to .bed/.bim/.fam filesets to load together. Can be a single path or a list of paths. If a multiple paths are given, the data will be merged after loading.

  • rsids (Optional[list], optional) – Filter SNPs to this set of rsIDs. If not provided, no filtering will occur.

  • individuals (Optional[list], optional) – Filters samples to this list of individuals. If not provided, no filtering will occur.

  • num_threads (Optional[int], optional) – Specifies the number of threads to use when reading bed files. Defaults to 1.

Returns:

Annotated genotype data object.

Return type:

IntGenoData

core

gendata.core

A module containing functions and classes to facilitate the importing of genetic data.

cov

gendata.cov

Module containing functions for calculating covariances and covariance matrices.

grm

gendata.grm

Module to hold classes and functions related to genetic relationship matrices.

utils

gendata.utils