Usage
Installation
To install the package, you can use the following command:
(venv) $ pip install https://github.com/michaelbennett99/gendata_package/archive/master.zip
Usage
The package provides functions for loading and merging data, and classes for storing and acting on genetic data.
Functions
The package provides two functions, one for loading data and one for merging data.
- gendata.read_bed(paths: str | list[str], rsids: list | None = None, individuals: list | None = None, num_threads: int | None = 1) IntGenoData
Read raw genotypes into an annotated data frame.
Can take a direct path to a .bed/.bim/.fam file or a list of paths to a set of .bed/.bim/.fam files.
- Parameters:
paths (Union[str, list[str]]) – Paths to files containing paths to .bed/.bim/.fam filesets to load together. Can be a single path or a list of paths. If a multiple paths are given, the data will be merged after loading.
rsids (Optional[list], optional) – Filter SNPs to this set of rsIDs. If not provided, no filtering will occur.
individuals (Optional[list], optional) – Filters samples to this list of individuals. If not provided, no filtering will occur.
num_threads (Optional[int], optional) – Specifies the number of threads to use when reading bed files. Defaults to 1.
- Returns:
Annotated genotype data object.
- Return type:
- gendata.merge(*genotype_data: Type[AbstractGenoData]) Type[AbstractGenoData]
Merge sets of genotype data for different chromosomes/loci from the same set of samples.
- Parameters:
genotype_data (AbstractGenoData) – Genetic data objects to merge.
- Raises:
TypeError – If genetic data inputs are not all of the same object type.
ValueError – The set of SNPs overlaps across genetic data inputs.
ValueError – The set of samples is not the same across all genetic data
- Returns:
Merged genetic data.
Classes
The package provides two main classes, one for integer 0/1/2 genotypes and one for standardised genotypes.
- class gendata.IntGenoData(genotypes: DataFrame, snps: DataFrame, samples: DataFrame)
A class to hold and perform basic operations on integer genotype data.
- property af: Series
Calculate the allele frequency of each SNP.
- Returns:
A series of allele frequencies indexed by rsID.
- Return type:
pd.Series
- filter(**kwargs: Any)
Filters genetic data by an arbitrary number of factors.
This function takes an arbitrary number of keyword arguments. For each argument that matches one of a list of defined filtering operations, that operation will be performed according to the value of the argument.
Any keyword arguments that do not match any members of the list will be ignored and a warning will be given.
- Parameters:
kwargs – Keyword arguments to filter by.
- Returns:
A genetic data object filtered by all applicable filters.
- Return type:
Type[self]
- flip_snps(*rsids: str)
Flip A1 and A2 for the selected SNPs.
- Parameters:
rsids (str) – List of rsids to flip.
- Returns:
GenoData object.
- Return type:
Type[IntGenoData]
- property geno: Series
Calculates genotype missingness rate for each SNP.
Method name is taken from plink.
- Returns:
Genotype missingness rate for each SNP, indexed by rsID.
- Return type:
pd.Series
- hwe(midp: bool) Series
Calculate the exact HWE p-values for each SNP.
- Parameters:
midp (bool) – Apply midp adjustment.
- Returns:
A series of HWE p-values indexed by rsID.
- Return type:
pd.Series
- property iids: Index
Return an index of IIDs. Order is arbitrary.
- Returns:
List of all IIDs.
- Return type:
pd.Index
- property maf: Series
Calculate the minor allele frequency for each SNP.
- Returns:
A series of minor allele frequencies indexed by rsID.
- Return type:
pd.Series
- property max_geno: float
Find the maximum geno.
- Returns:
The value of the maximum geno.
- Return type:
float
- property max_mind: float
Find the maximum mind.
- Returns:
The value of the maximum mind.
- Return type:
float
- property min_maf: float
Find the minimim minor allele frequency.
- Returns:
The minimum minor allele frequency.
- Return type:
float
- property mind: Series
Calculates genotype missingness for each sample.
Method name is taken from plink.
- Returns:
A series indexed by IID showing the proportion of genotypes that are missing for each sample.
- Return type:
pd.Series
- property n_het: Series
Number of samples heterozygous.
- Returns:
Count of heterozygous samples by rsID.
- Return type:
pd.Series
- property n_hom1: Series
Number of samples homozygous in the reference allele.
- Returns:
Count of homozygous (A1) samples by rsID.
- Return type:
pd.Series
- property n_hom2: Series
Number of samples homozygous in the alternate allele.
- Returns:
Count of homozygous (A2) samples by rsID.
- Return type:
pd.Series
- property n_samples: int
Return the number of samples.
- Returns:
The number of samples.
- Return type:
int
- property n_snps: int
Return the number of snps.
- Returns:
The number of snps.
- Return type:
int
- property rsids: Index
Return an index of rsids ordered by chromosome and base position.
- Returns:
List of all rsids.
- Return type:
pd.Index
- sample(rsid: int | float | None = None, iid: int | float | None = None, seed: int | None = None)
Sample randomly from the genetic data.
- Parameters:
rsid (Optional[Union[int, float]]) – Number (int) or percentage (float) of SNPs to sample. If None, no sampling will occur.
iid (Optional[Union[int, float]]) – Number (int) or percentage (float) of samples to sample. If None, no sampling will occur.
seed (Optional[int]) – Seed for random number generator. If None, a random seed will be generated.
- Returns:
A genetic data object with a random subset of SNPs and samples.
- Return type:
Type[self]
- save(out: str)
Save the integer genotype data to a .bed/.bim/.fam fileset.
- Parameters:
out (str) – The path to save the data to. The path should not include the file extension, which will be added automatically.
- property shape: tuple[int, int]
Return the shape of the genotype data.
- Returns:
The shape of the genotype data.
- Return type:
tuple[int, int]
- standardised()
Standardise genotypes.
- Returns:
A standardised genotype data object.
- Return type:
Type[StdGenoData]
- standardized()
Standardise genotypes.
- Returns:
A standardised genotype data object.
- Return type:
Type[StdGenoData]
- static to_maf(af: float) float
Convert a biallelic allele frequency to a minor allele frequency.
- Parameters:
af (float) – Allele frequency.
- Raises:
ValueError – If af is not in [0, 1].
- Returns:
Minor allele frequency.
- Return type:
float
- class gendata.StdGenoData(genotypes: DataFrame, snps: DataFrame, samples: DataFrame)
A class to hold and perform operations on standardised genotype data.
- calculate_grm(individuals: list[str] | None = None, weights: dict[str, float] | None = None) GRM
Calculate a full GRM.
- Parameters:
individuals (Optional[list[str]]) – List of individual IDs to include in the GRM.
weights (Optional[dict[str, float]]) – Weights to apply to the GRM.
- Returns:
The GRM.
- Return type:
GRM
- calculate_ldm_blocks(block_map: Series | dict[str, int], n_cores: int | None = None) dict[int, dict[str, ndarray | DataFrame]]
Calculate a block-wise LD matrix, where the each block ends at one of the listed terminal rsids.
- Parameters:
block_map (Union[pd.Series, dict[str, int]]) – Mapping of rsids to block numbers.
n_cores (Optional[int]) – Number of cores to use for parallelisation. If None, defaults to the number of cores available.
- Returns:
A dictionary containing an LD matrix dictionary for each block.
- Return type:
dict[int, dict[str, Union[np.ndarray, pd.DataFrame]]]
- calculate_ldm_window(window: int | None = None, n_cores: int | None = None, sparse: bool = True, tol: float = 0.001) dict[int, csr_matrix]
Calculate a sparse windowed LD matrix in CSR format.
Note: These calculations are quite fast, so parallelisation may not always be necessary.
- Parameters:
window (Optional[int]) – Set LD correlations to zero for all values separated by a distance of greater than window. If None, the window will be set to the maximum distance between SNPs.
n_cores (Optional[int]) – Number of cores to use for parallelisation. If None, defaults to the number of cores available.
sparse (bool) – Whether to make sparse matrices.
tol (float) – Tolerance for sparse matrix construction.
- Returns:
Dictionary containing a sparse LD matrix for each chromosome.
- Return type:
dict[int, sp.csr_matrix]
- filter(**kwargs: Any)
Filters genetic data by an arbitrary number of factors.
This function takes an arbitrary number of keyword arguments. For each argument that matches one of a list of defined filtering operations, that operation will be performed according to the value of the argument.
Any keyword arguments that do not match any members of the list will be ignored and a warning will be given.
- Parameters:
kwargs – Keyword arguments to filter by.
- Returns:
A genetic data object filtered by all applicable filters.
- Return type:
Type[self]
- flip_snps(*rsids: str)
Flip A1 and A2 for the selected SNPs.
- Parameters:
rsids (str) – List of rsids to flip.
- Returns:
StdGenoData object.
- Return type:
Type[StdGenoData]
- property geno: Series
Calculates genotype missingness rate for each SNP.
Method name is taken from plink.
- Returns:
Genotype missingness rate for each SNP, indexed by rsID.
- Return type:
pd.Series
- property iids: Index
Return an index of IIDs. Order is arbitrary.
- Returns:
List of all IIDs.
- Return type:
pd.Index
- property max_geno: float
Find the maximum geno.
- Returns:
The value of the maximum geno.
- Return type:
float
- property max_mind: float
Find the maximum mind.
- Returns:
The value of the maximum mind.
- Return type:
float
- property mind: Series
Calculates genotype missingness for each sample.
Method name is taken from plink.
- Returns:
A series indexed by IID showing the proportion of genotypes that are missing for each sample.
- Return type:
pd.Series
- property n_samples: int
Return the number of samples.
- Returns:
The number of samples.
- Return type:
int
- property n_snps: int
Return the number of snps.
- Returns:
The number of snps.
- Return type:
int
- property rsids: Index
Return an index of rsids ordered by chromosome and base position.
- Returns:
List of all rsids.
- Return type:
pd.Index
- sample(rsid: int | float | None = None, iid: int | float | None = None, seed: int | None = None)
Sample randomly from the genetic data.
- Parameters:
rsid (Optional[Union[int, float]]) – Number (int) or percentage (float) of SNPs to sample. If None, no sampling will occur.
iid (Optional[Union[int, float]]) – Number (int) or percentage (float) of samples to sample. If None, no sampling will occur.
seed (Optional[int]) – Seed for random number generator. If None, a random seed will be generated.
- Returns:
A genetic data object with a random subset of SNPs and samples.
- Return type:
Type[self]
- property shape: tuple[int, int]
Return the shape of the genotype data.
- Returns:
The shape of the genotype data.
- Return type:
tuple[int, int]