Amino Acids Sequences Format

Columns that read_file expect to contain sequences of amino acids are

Columns of Amino Acids (AA) Sequences
Name	Meaning
vaseq	The AA sequence the Alpha chain for the V segment
vbseq	The AA sequence the Beta chain for the V segment
cdr3a	The sequence of amino acids for the CDR3 region on the Alpha chain
cdr3b	The sequence of amino acids for the CDR3 region on the Beta chain
peptide	The sequence of amino acids presented by the MHC
mhcseq	The sequence(s) of amino acids of the corresponding MHC

Warning

For vaseq (resp. vbseq), if the corresponding gene/allele names in va (resp. vb) are provided, the sequences will be ignored. If the names can not be found in the reference data provided in ./validation_data, the corresponding records will be dropped. Only when the names are not supplied will we utilize the information given in these two columns.

Warning

For mhcseq, if the corresponding gene/allele names in mhc are provided, we will first look up the corresponding sequences using our own reference data. Only when the names can not be found in our reference data will we utilize the information given in this column.

Overall

The one-letter code of the following 20 Amino Acids are accepted by pMTnet Omni.

Note

If a letter in a provided sequence is not of the 20 amino acids (this includes cases such as unknown amnio acids: X, white spaces: , special characters: _, +, / etc.), it will first be replaced by _. When converting to a matrix of Atchley Factors, those _ will be interpreted as 0,0,0,0,0.

vaseq, vbseq

Warning

If va (resp. vb) is provided, we will perform a look-up using our reference database provided in ./validation_data regardless of the presence of vaseq (resp. vbseq). Only when va (resp. vb) is missing will the algorithm utilize the actual sequence.

Tip

For new TCR sequences (for instance when performing TCR optimization), simply supply vaseq and vbseq and leave va and vb blank.

When using information provided in these two columns, minimal data curation will be performed. Therefore, it’s vital to make sure that the format of your input sequences conforms with ours. One such mismatch could happen when the users truncate the CDR3 part of a sequence.

Warning

Do not truncate the CDR3 part of a sequence.

cdr3a, cdr3b

CDR3s usually start from C and end with F. We are aware of different definitions of CDR3s that result in slightly different start and end boundaries of CDR3s.

Note

We will directly use the sequences provided in these two columns with minimal data curation. Therefore, please use the definition that is consistent with ours.

peptide

Warning

Any AA sequence in the peptide column that contains more than 30 amino acids will be dropped.

mhcseq

We have already computed the ESM embeddings of around 20,000 MHCs. A value (one or two sequences) in this column is used only when we can not find the corresponding value in the mhc column in our database or is missing. When this occurs, the ESM2 algorithm will be invoked to encode the sequences. Here we elaborate on the requirements we impose on the format of MHC amino acid sequences. For the MHC names, please refer to MHC Format.

mhcseq requirements

Human Class I: Only the sequence for the Alpha chain is needed. Our program will impute the Beta chain as human_microglobulin, which is already included in our database. Hence, no additional sequence is needed.
Human Class II HLA that starts with DP or DQ: Here we need the information on both chains. The format we assume is Alpha AA sequence followed by a forward slash /, which is then followed by Beta AA sequence.
Human Class II HLA that starts with DR: There are two possible scenarios that we take into account. If both the user provided information on both chains, then the inference method follows that of the HLA DP and DQ. On the other hand, if only the information on Beta chain is supplied, then only the sequence for the Beta chain is needed. Our program will impute the Alpha chain as DRA*01:01, which is already included in our database. Ergo, no additional sequence is needed.
Mouse Class I: Only the sequence for the Alpha chain is needed. Our program will impute the Beta chain as mouse_microglobulin, which is already included in our database. Therefore, no additional sequence is needed.
Mouse Class II: Although the mhc column only requires one single name. The user needs to supplement the sequences on both chains. The format is Alpha AA sequence followed by a forward slash /, which is then followed by Beta AA sequence.

Sample Input
Class	mhc	mhcseq
Human Class I	A*01:01	MAVMA…TACKV
Human Class II: Only DRB	DRB1*01:01	MVCLK…TGFLS
Human Class II: DRA and DRB	DRA01:01/DRB101:01	MAISG…RRGPL/MVCLK…TGFLS
Human Class II: DP	DPA104:02/DPB101:01	MRPED…AQGPL/MMVLQ…QRGSA
Human Class II: DQ	DQA106:04/DQB102:07	DHVAS…HQGPL/MSWKK…KGLLH
Mouse Class I	H-2-Db	MGAMA…RDCKA
Mouse Class II	H-2-IAk	MPRSR…HPGPL/MALQI…AGLLQ

The following table provides a brief summary.

MHC Classes and Inference Methods
Class	Inference method
Human Class I	The entire sequence will be interpreted as the sequence for the Alpha chain.
Human Class II: Only DRB	The entire sequence will be interpreted as the sequence for the Beta chain.
Human Class II: Other	Sequences for the Alpha chain and the Beta chain should be separated by `/`
Mouse Class I	The entire sequence will be interpreted as the sequence for the Alpha chain.
Mouse Class II	The entire sequence will be interpreted as the sequence for both Alpha and Beta chains.