Data Curation
So I see you want to get into the nitty gritty details of our package. While our API Reference provides all the details of the functions defined in the package, it could be a bit overwhelming. In this relatively detailed tutorial, we will provide an in-depth exploration of how the package works.
Note
We will assume that the user has already imported necessary functions using the command:
from pMTnet_Omni_Document.data_curation import read_file
To perform data curation, the read_file function carries out
the following steps sequentially:
- Check column names
Make sure all necessary columns are present
- Check species
Make sure values in
tcr_speciesandpmhc_speciesarehumanormouse- Check VA and Vb
If
vaandvbare not missing, we will look up their corresponding sequences using our reference data at./validation_dataeven if users supplied the sequences. If we can not find the gene/allele names, the records will be dropped.A separate dataframe
./filename_curated_invalid_v.csvwill contain dropped records.Otherwise, we will use the sequence information with minimal curation
- Infer MHC information
Based on information provided in
pmhc_species, we will infer the information on alpha and beta chains as well as MHC classes- Check MHC
If
mhcis not missing, we will first look up its information in our reference data. If we can not found the MHC or the MHC is missing, we will use the sequence information.Two separate dataframes
./filename_curated_mhc_alpha_dropped.csvand./filename_curated_mhc_beta_dropped.csvwill contain dropped records.- Check peptide
Sequences containing more than 30 AAs will be dropped.
A separate dataframe
./filename_curated_antigen_dropped.csvwill contain dropped records.- Check columns with amino acids
Unknown AAs will be replaced with
_- Encode MHC sequences
When the
dfcontains some MHCs that are NOT in our reference data or are missing, we will inzoke the ESM2 algorithm to encode these sequences, produce a dictionary whose keys are the sequences and values are the embeddings, and save the dictionary as a.jsonfile.
To perform data curation, simply call
df = read_file(file_path='file/path/here.csv',
save_df=True,
output_folder_path='output/folder/path/',
sep=',')
You can inspect the values in df as well as all the
.csv files save to the path you specified with a few
modifications on the file names.
Warning
When uploading your dataset or the curated version produced
by read_file function (recommended)
to DBAI,
make sure to upload the .json file as well. Otherwise,
these records will be dropped.
Warning
The current version of pMTnet_Omni will NOT perform data curation. It will read in the data as is. If the data do not conform with its required format, the program WILL halt.