Usage

Installation

To install the package, you can use the following command:

(venv) $ pip install https://github.com/michaelbennett99/gendata_package/archive/master.zip

Usage

The package provides functions for loading and merging data, and classes for storing and acting on genetic data.

Functions

The package provides two functions, one for loading data and one for merging data.

gendata.read_bed(paths: str | list[str], rsids: list | None = None, individuals: list | None = None, num_threads: int | None = 1) → IntGenoData

Read raw genotypes into an annotated data frame.

Can take a direct path to a .bed/.bim/.fam file or a list of paths to a set of .bed/.bim/.fam files.

Parameters:

paths (Union[str, list[str]]) – Paths to files containing paths to .bed/.bim/.fam filesets to load together. Can be a single path or a list of paths. If a multiple paths are given, the data will be merged after loading.
rsids (Optional[list], optional) – Filter SNPs to this set of rsIDs. If not provided, no filtering will occur.
individuals (Optional[list], optional) – Filters samples to this list of individuals. If not provided, no filtering will occur.
num_threads (Optional[int], optional) – Specifies the number of threads to use when reading bed files. Defaults to 1.

Returns:

Annotated genotype data object.

Return type:

IntGenoData

gendata.merge(*genotype_data: Type[AbstractGenoData]) → Type[AbstractGenoData]

Merge sets of genotype data for different chromosomes/loci from the same set of samples.

Parameters:

genotype_data (AbstractGenoData) – Genetic data objects to merge.

Raises:

TypeError – If genetic data inputs are not all of the same object type.
ValueError – The set of SNPs overlaps across genetic data inputs.
ValueError – The set of samples is not the same across all genetic data

Returns:

Merged genetic data.

Classes

The package provides two main classes, one for integer 0/1/2 genotypes and one for standardised genotypes.

class gendata.IntGenoData(genotypes: DataFrame, snps: DataFrame, samples: DataFrame)

A class to hold and perform basic operations on integer genotype data.

property af: Series

Calculate the allele frequency of each SNP.

Returns:: A series of allele frequencies indexed by rsID.
Return type:: pd.Series

filter(**kwargs: Any)

Filters genetic data by an arbitrary number of factors.

This function takes an arbitrary number of keyword arguments. For each argument that matches one of a list of defined filtering operations, that operation will be performed according to the value of the argument.

Any keyword arguments that do not match any members of the list will be ignored and a warning will be given.

Parameters:: kwargs – Keyword arguments to filter by.
Returns:: A genetic data object filtered by all applicable filters.
Return type:: Type[self]

flip_snps(*rsids: str)

Flip A1 and A2 for the selected SNPs.

Parameters:: rsids (str) – List of rsids to flip.
Returns:: GenoData object.
Return type:: Type[IntGenoData]

property geno: Series

Calculates genotype missingness rate for each SNP.

Method name is taken from plink.

Returns:: Genotype missingness rate for each SNP, indexed by rsID.
Return type:: pd.Series

hwe(midp: bool) → Series

Calculate the exact HWE p-values for each SNP.

Parameters:: midp (bool) – Apply midp adjustment.
Returns:: A series of HWE p-values indexed by rsID.
Return type:: pd.Series

property iids: Index

Return an index of IIDs. Order is arbitrary.

Returns:: List of all IIDs.
Return type:: pd.Index

property maf: Series

Calculate the minor allele frequency for each SNP.

Returns:: A series of minor allele frequencies indexed by rsID.
Return type:: pd.Series

property max_geno: float

Find the maximum geno.

Returns:: The value of the maximum geno.
Return type:: float

property max_mind: float

Find the maximum mind.

Returns:: The value of the maximum mind.
Return type:: float

property min_maf: float

Find the minimim minor allele frequency.

Returns:: The minimum minor allele frequency.
Return type:: float

property mind: Series

Calculates genotype missingness for each sample.

Method name is taken from plink.

Returns:: A series indexed by IID showing the proportion of genotypes that are missing for each sample.
Return type:: pd.Series

property n_het: Series

Number of samples heterozygous.

Returns:: Count of heterozygous samples by rsID.
Return type:: pd.Series

property n_hom1: Series

Number of samples homozygous in the reference allele.

Returns:: Count of homozygous (A1) samples by rsID.
Return type:: pd.Series

property n_hom2: Series

Number of samples homozygous in the alternate allele.

Returns:: Count of homozygous (A2) samples by rsID.
Return type:: pd.Series

property n_samples: int

Return the number of samples.

Returns:: The number of samples.
Return type:: int

property n_snps: int

Return the number of snps.

Returns:: The number of snps.
Return type:: int

property rsids: Index

Return an index of rsids ordered by chromosome and base position.

Returns:: List of all rsids.
Return type:: pd.Index

Sample randomly from the genetic data.

Parameters:

rsid (Optional[Union[int, float]]) – Number (int) or percentage (float) of SNPs to sample. If None, no sampling will occur.
iid (Optional[Union[int, float]]) – Number (int) or percentage (float) of samples to sample. If None, no sampling will occur.
seed (Optional[int]) – Seed for random number generator. If None, a random seed will be generated.

Returns:

A genetic data object with a random subset of SNPs and samples.

Return type:

Type[self]

save(out: str)

Save the integer genotype data to a .bed/.bim/.fam fileset.

Parameters:: out (str) – The path to save the data to. The path should not include the file extension, which will be added automatically.

property shape: tuple[int, int]

Return the shape of the genotype data.

Returns:: The shape of the genotype data.
Return type:: tuple[int, int]

standardised()

Standardise genotypes.

Returns:: A standardised genotype data object.
Return type:: Type[StdGenoData]

standardized()

Standardise genotypes.

Returns:: A standardised genotype data object.
Return type:: Type[StdGenoData]

static to_maf(af: float) → float

Convert a biallelic allele frequency to a minor allele frequency.

Parameters:: af (float) – Allele frequency.
Raises:: ValueError – If af is not in [0, 1].
Returns:: Minor allele frequency.
Return type:: float

class gendata.StdGenoData(genotypes: DataFrame, snps: DataFrame, samples: DataFrame)

A class to hold and perform operations on standardised genotype data.

calculate_grm(individuals: list[str] | None = None, weights: dict[str, float] | None = None) → GRM

Calculate a full GRM.

Parameters:

individuals (Optional[list[str]]) – List of individual IDs to include in the GRM.
weights (Optional[dict[str, float]]) – Weights to apply to the GRM.

Returns:

The GRM.

Return type:

GRM

calculate_ldm_blocks(block_map: Series | dict[str, int], n_cores: int | None = None) → dict[int, dict[str, ndarray | DataFrame]]

Calculate a block-wise LD matrix, where the each block ends at one of the listed terminal rsids.

Parameters:

block_map (Union[pd.Series, dict[str, int]]) – Mapping of rsids to block numbers.
n_cores (Optional[int]) – Number of cores to use for parallelisation. If None, defaults to the number of cores available.

Returns:

A dictionary containing an LD matrix dictionary for each block.

Return type:

dict[int, dict[str, Union[np.ndarray, pd.DataFrame]]]

calculate_ldm_window(window: int | None = None, n_cores: int | None = None, sparse: bool = True, tol: float = 0.001) → dict[int, csr_matrix]

Calculate a sparse windowed LD matrix in CSR format.

Note: These calculations are quite fast, so parallelisation may not always be necessary.

Parameters:

window (Optional[int]) – Set LD correlations to zero for all values separated by a distance of greater than window. If None, the window will be set to the maximum distance between SNPs.
n_cores (Optional[int]) – Number of cores to use for parallelisation. If None, defaults to the number of cores available.
sparse (bool) – Whether to make sparse matrices.
tol (float) – Tolerance for sparse matrix construction.

Returns:

Dictionary containing a sparse LD matrix for each chromosome.

Return type:

dict[int, sp.csr_matrix]

filter(**kwargs: Any)

Filters genetic data by an arbitrary number of factors.

This function takes an arbitrary number of keyword arguments. For each argument that matches one of a list of defined filtering operations, that operation will be performed according to the value of the argument.

Any keyword arguments that do not match any members of the list will be ignored and a warning will be given.

Parameters:: kwargs – Keyword arguments to filter by.
Returns:: A genetic data object filtered by all applicable filters.
Return type:: Type[self]

flip_snps(*rsids: str)

Flip A1 and A2 for the selected SNPs.

Parameters:: rsids (str) – List of rsids to flip.
Returns:: StdGenoData object.
Return type:: Type[StdGenoData]

property geno: Series

Calculates genotype missingness rate for each SNP.

Method name is taken from plink.

Returns:: Genotype missingness rate for each SNP, indexed by rsID.
Return type:: pd.Series

property iids: Index

Return an index of IIDs. Order is arbitrary.

Returns:: List of all IIDs.
Return type:: pd.Index

property max_geno: float

Find the maximum geno.

Returns:: The value of the maximum geno.
Return type:: float

property max_mind: float

Find the maximum mind.

Returns:: The value of the maximum mind.
Return type:: float

property mind: Series

Calculates genotype missingness for each sample.

Method name is taken from plink.

Returns:: A series indexed by IID showing the proportion of genotypes that are missing for each sample.
Return type:: pd.Series

property n_samples: int

Return the number of samples.

Returns:: The number of samples.
Return type:: int

property n_snps: int

Return the number of snps.

Returns:: The number of snps.
Return type:: int

property rsids: Index

Return an index of rsids ordered by chromosome and base position.

Returns:: List of all rsids.
Return type:: pd.Index

Sample randomly from the genetic data.

Parameters:

rsid (Optional[Union[int, float]]) – Number (int) or percentage (float) of SNPs to sample. If None, no sampling will occur.
iid (Optional[Union[int, float]]) – Number (int) or percentage (float) of samples to sample. If None, no sampling will occur.
seed (Optional[int]) – Seed for random number generator. If None, a random seed will be generated.

Returns:

A genetic data object with a random subset of SNPs and samples.

Return type:

Type[self]

property shape: tuple[int, int]

Return the shape of the genotype data.

Returns:: The shape of the genotype data.
Return type:: tuple[int, int]