API Reference
Here you will find all the details to data curation functions.
- pMTnet_Omni_Document.data_curation.check_column_names(df: DataFrame) DataFrame[source]
Check if the column names are correct
The main purpose of this function is to make sure that the dataframe provided by the users contains necessary columns so that it can be used by subsequent functions
This function will NOT create mhca, mhcb, mhcaseq, mhcbseq. It will keep the original column mhc, mhcseq.
- Parameters:
df (pd.DataFrame) – A pandas dataframe containing pairing data
- Returns:
A pandas dataframe with corrected column names
- Return type:
pd.DataFrame
- pMTnet_Omni_Document.data_curation.check_species(df: DataFrame) DataFrame[source]
Check the TCR species and pMHC species
- Parameters:
df (pd.DataFrame) – A pandas dataframe containing pairing data
- Returns:
A pandas dataframe with curated data
- Return type:
pd.DataFrame
- pMTnet_Omni_Document.data_curation.check_v_gene_allele(df: DataFrame, a_reference_df: DataFrame, b_reference_df) DataFrame[source]
- pMTnet_Omni_Document.data_curation.check_va_vb(df: DataFrame, background_tcrs_dir: str = './validation_data/') Tuple[DataFrame, DataFrame][source]
Check VA and VB
- Parameters:
df (pd.DataFrame) – A pandas dataframe containing pairing data
background_tcrs_dir (str, optional) – The path to background tcrs data, by default “./validation_data/”
- Returns:
A pandas dataframe with curated data and a pandas dataframe with invalid data
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
- pMTnet_Omni_Document.data_curation.infer_mhc_info(df: DataFrame) DataFrame[source]
Infer the MHC classes and create columns mhca and mhcb
The input df should be the output of the check_column_names function.
- Parameters:
df (pd.DataFrame) – A pandas dataframe containing mhc, mhcseq, and pmhc_species
- Returns:
A column of a pandas dataframe with the inferred MHC classes, MHCs on the alpha chain and the beta chain
- Return type:
pd.DataFrame
- pMTnet_Omni_Document.data_curation.check_mhc(df: DataFrame, mhc_path: str = './validation_data/valid_mhc.txt') Tuple[DataFrame, DataFrame, DataFrame][source]
Check mhc This function will check if the data format conforms to what our model expects
- Parameters:
df (pd.DataFrame) – A pandas dataframe containing pairing data
mhc_path (str) – The file path to valid mhcs
- Returns:
Four pandas dataframe containing curated pairing data, pairs with peptides longer than 30, problematic mhca, and problematic mhcb
- Return type:
Tuple[df.DataFrame, df.DataFrame, df.DataFrame, df.DataFrame]
- pMTnet_Omni_Document.data_curation.check_peptide(df: DataFrame) Tuple[DataFrame, DataFrame][source]
Check peptide columns
- Parameters:
df (pd.DataFrame) – A pandas dataframe with pairing data
- Returns:
A pandas dataframe with curated data and a dataframe with dropped data
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
- pMTnet_Omni_Document.data_curation.check_amino_acids(df_column: DataFrame) DataFrame[source]
Check amino acids are valid This function checks if the amino acids in one column of a dataframe are valid amino acids
- Parameters:
df_column (pd.DataFrame) – One column of a dataframe
- Returns:
Currated column with invalid aa replaced by “_”
- Return type:
pd.DataFrame
- pMTnet_Omni_Document.data_curation.check_amino_acids_columns(df: DataFrame) DataFrame[source]
Check all columns with AA sequences
- Parameters:
df (pd.DataFrame) – A pandas dataframe with pairing data
- Returns:
A pandas dataframe with curated data
- Return type:
pd.DataFrame
- class pMTnet_Omni_Document.data_curation.NumpyArrayEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]
Bases:
JSONEncoder- default(obj)[source]
Implement this method in a subclass such that it returns a serializable object for
o, or calls the base implementation (to raise aTypeError).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
- pMTnet_Omni_Document.data_curation.encode_mhc_seq(df: DataFrame) dict[source]
Encode MHC sequences
- Parameters:
df (pd.DataFrame) – A pandas dataframe containing pairing data
- Returns:
A dictionary of the mhc sequences and their the EMS embeddings
- Return type:
dict
- pMTnet_Omni_Document.data_curation.read_file(file_path: str, save_results: bool = False, output_folder_path: Optional[str] = None, **kwargs) Tuple[DataFrame, dict][source]
Reads in user dataframe and performs some basic data curation
- file_path: str
Path to the dataframe
- save_results: bool
Whether or not the save the result
- output_folder_path: str
The path to the output folder
- **kwargs
Other arguments taken by the read_csv function
- Returns:
A curated pandas dataframe
- Return type:
pd.DataFrame