API

gendata

class gendata.IntGenoData(genotypes: DataFrame, snps: DataFrame, samples: DataFrame)

Bases: AbstractGenoData

A class to hold and perform basic operations on integer genotype data.

property af: Series

Calculate the allele frequency of each SNP.

Returns:: A series of allele frequencies indexed by rsID.
Return type:: pd.Series

flip_snps(*rsids: str)

Flip A1 and A2 for the selected SNPs.

Parameters:: rsids (str) – List of rsids to flip.
Returns:: GenoData object.
Return type:: Type[IntGenoData]

hwe(midp: bool) → Series

Calculate the exact HWE p-values for each SNP.

Parameters:: midp (bool) – Apply midp adjustment.
Returns:: A series of HWE p-values indexed by rsID.
Return type:: pd.Series

property maf: Series

Calculate the minor allele frequency for each SNP.

Returns:: A series of minor allele frequencies indexed by rsID.
Return type:: pd.Series

property min_maf: float

Find the minimim minor allele frequency.

Returns:: The minimum minor allele frequency.
Return type:: float

property n_het: Series

Number of samples heterozygous.

Returns:: Count of heterozygous samples by rsID.
Return type:: pd.Series

property n_hom1: Series

Number of samples homozygous in the reference allele.

Returns:: Count of homozygous (A1) samples by rsID.
Return type:: pd.Series

property n_hom2: Series

Number of samples homozygous in the alternate allele.

Returns:: Count of homozygous (A2) samples by rsID.
Return type:: pd.Series

save(out: str)

Save the integer genotype data to a .bed/.bim/.fam fileset.

Parameters:: out (str) – The path to save the data to. The path should not include the file extension, which will be added automatically.

standardised()

Standardise genotypes.

Returns:: A standardised genotype data object.
Return type:: Type[StdGenoData]

standardized()

Standardise genotypes.

Returns:: A standardised genotype data object.
Return type:: Type[StdGenoData]

static to_maf(af: float) → float

Convert a biallelic allele frequency to a minor allele frequency.

Parameters:: af (float) – Allele frequency.
Raises:: ValueError – If af is not in [0, 1].
Returns:: Minor allele frequency.
Return type:: float

class gendata.StdGenoData(genotypes: DataFrame, snps: DataFrame, samples: DataFrame)

Bases: AbstractGenoData

A class to hold and perform operations on standardised genotype data.

calculate_grm(individuals: list[str] | None = None, weights: dict[str, float] | None = None) → GRM

Calculate a full GRM.

Parameters:

individuals (Optional[list[str]]) – List of individual IDs to include in the GRM.
weights (Optional[dict[str, float]]) – Weights to apply to the GRM.

Returns:

The GRM.

Return type:

GRM

calculate_ldm_blocks(block_map: Series | dict[str, int], n_cores: int | None = None) → dict[int, dict[str, ndarray | DataFrame]]

Calculate a block-wise LD matrix, where the each block ends at one of the listed terminal rsids.

Parameters:

block_map (Union[pd.Series, dict[str, int]]) – Mapping of rsids to block numbers.
n_cores (Optional[int]) – Number of cores to use for parallelisation. If None, defaults to the number of cores available.

Returns:

A dictionary containing an LD matrix dictionary for each block.

Return type:

dict[int, dict[str, Union[np.ndarray, pd.DataFrame]]]

calculate_ldm_window(window: int | None = None, n_cores: int | None = None, sparse: bool = True, tol: float = 0.001) → dict[int, csr_matrix]

Calculate a sparse windowed LD matrix in CSR format.

Note: These calculations are quite fast, so parallelisation may not always be necessary.

Parameters:

window (Optional[int]) – Set LD correlations to zero for all values separated by a distance of greater than window. If None, the window will be set to the maximum distance between SNPs.
n_cores (Optional[int]) – Number of cores to use for parallelisation. If None, defaults to the number of cores available.
sparse (bool) – Whether to make sparse matrices.
tol (float) – Tolerance for sparse matrix construction.

Returns:

Dictionary containing a sparse LD matrix for each chromosome.

Return type:

dict[int, sp.csr_matrix]

flip_snps(*rsids: str)

Flip A1 and A2 for the selected SNPs.

Parameters:: rsids (str) – List of rsids to flip.
Returns:: StdGenoData object.
Return type:: Type[StdGenoData]

gendata.merge(*genotype_data: Type[AbstractGenoData]) → Type[AbstractGenoData]

Merge sets of genotype data for different chromosomes/loci from the same set of samples.

Parameters:

genotype_data (AbstractGenoData) – Genetic data objects to merge.

Raises:

TypeError – If genetic data inputs are not all of the same object type.
ValueError – The set of SNPs overlaps across genetic data inputs.
ValueError – The set of samples is not the same across all genetic data

Returns:

Merged genetic data.

gendata.read_bed(paths: str | list[str], rsids: list | None = None, individuals: list | None = None, num_threads: int | None = 1) → IntGenoData

Read raw genotypes into an annotated data frame.

Can take a direct path to a .bed/.bim/.fam file or a list of paths to a set of .bed/.bim/.fam files.

Parameters:

paths (Union[str, list[str]]) – Paths to files containing paths to .bed/.bim/.fam filesets to load together. Can be a single path or a list of paths. If a multiple paths are given, the data will be merged after loading.
rsids (Optional[list], optional) – Filter SNPs to this set of rsIDs. If not provided, no filtering will occur.
individuals (Optional[list], optional) – Filters samples to this list of individuals. If not provided, no filtering will occur.
num_threads (Optional[int], optional) – Specifies the number of threads to use when reading bed files. Defaults to 1.

Returns: