Amino Acids Sequences Format

Columns that read_file expect to contain sequences of amino acids are

Columns of Amino Acids (AA) Sequences

Name

Meaning

vaseq

The AA sequence the Alpha chain for the V segment

vbseq

The AA sequence the Beta chain for the V segment

cdr3a

The sequence of amino acids for the CDR3 region on the Alpha chain

cdr3b

The sequence of amino acids for the CDR3 region on the Beta chain

peptide

The sequence of amino acids presented by the MHC

mhcseq

The sequence(s) of amino acids of the corresponding MHC

Warning

For vaseq (resp. vbseq), if the corresponding gene/allele names in va (resp. vb) are provided, the sequences will be ignored. If the names can not be found in the reference data provided in ./validation_data, the corresponding records will be dropped. Only when the names are not supplied will we utilize the information given in these two columns.

Warning

For mhcseq, if the corresponding gene/allele names in mhc are provided, we will first look up the corresponding sequences using our own reference data. Only when the names can not be found in our reference data will we utilize the information given in this column.

Overall

The one-letter code of the following 20 Amino Acids are accepted by pMTnet Omni.

../_images/CDR_letters.png

Note

If a letter in a provided sequence is not of the 20 amino acids (this includes cases such as unknown amnio acids: X, white spaces: , special characters: _, +, / etc.), it will first be replaced by _. When converting to a matrix of Atchley Factors, those _ will be interpreted as 0,0,0,0,0.

vaseq, vbseq

Warning

If va (resp. vb) is provided, we will perform a look-up using our reference database provided in ./validation_data regardless of the presence of vaseq (resp. vbseq). Only when va (resp. vb) is missing will the algorithm utilize the actual sequence.

Tip

For new TCR sequences (for instance when performing TCR optimization), simply supply vaseq and vbseq and leave va and vb blank.

When using information provided in these two columns, minimal data curation will be performed. Therefore, it’s vital to make sure that the format of your input sequences conforms with ours. One such mismatch could happen when the users truncate the CDR3 part of a sequence.

Warning

Do not truncate the CDR3 part of a sequence.

cdr3a, cdr3b

CDR3s usually start from C and end with F. We are aware of different definitions of CDR3s that result in slightly different start and end boundaries of CDR3s.

Note

We will directly use the sequences provided in these two columns with minimal data curation. Therefore, please use the definition that is consistent with ours.

peptide

Warning

Any AA sequence in the peptide column that contains more than 30 amino acids will be dropped.

mhcseq

We have already computed the ESM embeddings of around 20,000 MHCs. A value (one or two sequences) in this column is used only when we can not find the corresponding value in the mhc column in our database or is missing. When this occurs, the ESM2 algorithm will be invoked to encode the sequences. Here we elaborate on the requirements we impose on the format of MHC amino acid sequences. For the MHC names, please refer to MHC Format.

mhcseq requirements

Human Class I

Only the sequence for the Alpha chain is needed. Our program will impute the Beta chain as human_microglobulin, which is already included in our database. Hence, no additional sequence is needed.

Human Class II HLA that starts with DP or DQ

Here we need the information on both chains. The format we assume is Alpha AA sequence followed by a forward slash /, which is then followed by Beta AA sequence.

Human Class II HLA that starts with DR

There are two possible scenarios that we take into account. If both the user provided information on both chains, then the inference method follows that of the HLA DP and DQ. On the other hand, if only the information on Beta chain is supplied, then only the sequence for the Beta chain is needed. Our program will impute the Alpha chain as DRA*01:01, which is already included in our database. Ergo, no additional sequence is needed.

Mouse Class I

Only the sequence for the Alpha chain is needed. Our program will impute the Beta chain as mouse_microglobulin, which is already included in our database. Therefore, no additional sequence is needed.

Mouse Class II

Although the mhc column only requires one single name. The user needs to supplement the sequences on both chains. The format is Alpha AA sequence followed by a forward slash /, which is then followed by Beta AA sequence.

Sample Input

Class

mhc

mhcseq

Human Class I

A*01:01

MAVMA…TACKV

Human Class II: Only DRB

DRB1*01:01

MVCLK…TGFLS

Human Class II: DRA and DRB

DRA*01:01/DRB1*01:01

MAISG…RRGPL/MVCLK…TGFLS

Human Class II: DP

DPA1*04:02/DPB1*01:01

MRPED…AQGPL/MMVLQ…QRGSA

Human Class II: DQ

DQA1*06:04/DQB1*02:07

DHVAS…HQGPL/MSWKK…KGLLH

Mouse Class I

H-2-Db

MGAMA…RDCKA

Mouse Class II

H-2-IAk

MPRSR…HPGPL/MALQI…AGLLQ

The following table provides a brief summary.

MHC Classes and Inference Methods

Class

Inference method

Human Class I

The entire sequence will be interpreted as the sequence for the Alpha chain.

Human Class II: Only DRB

The entire sequence will be interpreted as the sequence for the Beta chain.

Human Class II: Other

Sequences for the Alpha chain and the Beta chain should be separated by /

Mouse Class I

The entire sequence will be interpreted as the sequence for the Alpha chain.

Mouse Class II

The entire sequence will be interpreted as the sequence for both Alpha and Beta chains.