Amino Acids Sequences Format
Columns that read_file expect to contain sequences of
amino acids are
Name |
Meaning |
|---|---|
vaseq |
The AA sequence the Alpha chain for the V segment |
vbseq |
The AA sequence the Beta chain for the V segment |
cdr3a |
The sequence of amino acids for the CDR3 region on the Alpha chain |
cdr3b |
The sequence of amino acids for the CDR3 region on the Beta chain |
peptide |
The sequence of amino acids presented by the MHC |
mhcseq |
The sequence(s) of amino acids of the corresponding MHC |
Warning
For vaseq (resp. vbseq), if the corresponding gene/allele names in
va (resp. vb) are provided, the sequences will be ignored. If the names
can not be found in the reference data provided in ./validation_data, the
corresponding records will be dropped. Only when
the names are not supplied will we utilize the information given in these two columns.
Warning
For mhcseq, if the corresponding gene/allele names in
mhc are provided, we will first look up the corresponding
sequences using our own reference data. Only when
the names can not be found in our reference data will we utilize
the information given in this column.
Overall
The one-letter code of the following 20 Amino Acids are accepted by pMTnet Omni.
Note
If a letter in a provided sequence is not of the 20 amino acids
(this includes cases such as unknown amnio acids: X,
white spaces: , special characters: _, +, / etc.),
it will first be replaced by _. When converting to a matrix of
Atchley Factors, those _ will be interpreted as 0,0,0,0,0.
vaseq, vbseq
Warning
If va (resp. vb) is provided, we will perform a
look-up using our reference database provided in
./validation_data regardless
of the presence of vaseq (resp. vbseq). Only when
va (resp. vb) is missing will the algorithm utilize the
actual sequence.
Tip
For new TCR sequences (for instance when performing TCR optimization),
simply supply vaseq and vbseq and leave va and
vb blank.
When using information provided in these two columns, minimal data curation will be performed. Therefore, it’s vital to make sure that the format of your input sequences conforms with ours. One such mismatch could happen when the users truncate the CDR3 part of a sequence.
Warning
Do not truncate the CDR3 part of a sequence.
cdr3a, cdr3b
CDR3s usually start from C and end with F. We are aware of different definitions of CDR3s that result in slightly different start and end boundaries of CDR3s.
Note
We will directly use the sequences provided in these two columns with minimal data curation. Therefore, please use the definition that is consistent with ours.
peptide
Warning
Any AA sequence in the peptide column that contains
more than 30 amino acids will be dropped.
mhcseq
We have already computed the ESM embeddings of around 20,000
MHCs. A value (one or two sequences) in this column is used only when we can not find
the corresponding value in the mhc column in our database or is missing.
When this occurs, the ESM2 algorithm will be invoked to encode
the sequences. Here we elaborate on the requirements we impose on the format
of MHC amino acid sequences. For the MHC names, please refer to
MHC Format.
mhcseq requirements
- Human Class I
Only the sequence for the Alpha chain is needed. Our program will impute the Beta chain as
human_microglobulin, which is already included in our database. Hence, no additional sequence is needed.- Human Class II HLA that starts with DP or DQ
Here we need the information on both chains. The format we assume is
Alpha AA sequencefollowed by a forward slash/, which is then followed byBeta AA sequence.- Human Class II HLA that starts with DR
There are two possible scenarios that we take into account. If both the user provided information on both chains, then the inference method follows that of the HLA DP and DQ. On the other hand, if only the information on Beta chain is supplied, then only the sequence for the Beta chain is needed. Our program will impute the Alpha chain as
DRA*01:01, which is already included in our database. Ergo, no additional sequence is needed.- Mouse Class I
Only the sequence for the Alpha chain is needed. Our program will impute the Beta chain as
mouse_microglobulin, which is already included in our database. Therefore, no additional sequence is needed.- Mouse Class II
Although the
mhccolumn only requires one single name. The user needs to supplement the sequences on both chains. The format isAlpha AA sequencefollowed by a forward slash/, which is then followed byBeta AA sequence.
Class |
mhc |
mhcseq |
|---|---|---|
Human Class I |
A*01:01 |
MAVMA…TACKV |
Human Class II: Only DRB |
DRB1*01:01 |
MVCLK…TGFLS |
Human Class II: DRA and DRB |
DRA*01:01/DRB1*01:01 |
MAISG…RRGPL/MVCLK…TGFLS |
Human Class II: DP |
DPA1*04:02/DPB1*01:01 |
MRPED…AQGPL/MMVLQ…QRGSA |
Human Class II: DQ |
DQA1*06:04/DQB1*02:07 |
DHVAS…HQGPL/MSWKK…KGLLH |
Mouse Class I |
H-2-Db |
MGAMA…RDCKA |
Mouse Class II |
H-2-IAk |
MPRSR…HPGPL/MALQI…AGLLQ |
The following table provides a brief summary.
Class |
Inference method |
|---|---|
Human Class I |
The entire sequence will be interpreted as the sequence for the Alpha chain. |
Human Class II: Only DRB |
The entire sequence will be interpreted as the sequence for the Beta chain. |
Human Class II: Other |
Sequences for the Alpha chain and the Beta chain should be separated by |
Mouse Class I |
The entire sequence will be interpreted as the sequence for the Alpha chain. |
Mouse Class II |
The entire sequence will be interpreted as the sequence for both Alpha and Beta chains. |