https://arxiv.org/abs/2402.01439
1. Molecule encoding
1) Representations
(1) Fingerprint
- binary string
(2) Sequential
- SMILES (1988): 문제 많아
- SELFIES (2020): 문제 개선함
- InChI (2013): focuses on uniqueness
(3) Graph
2) Tokenization
(1) char level
- 이상하게 tokenize됨에도 불구하고 성능 좋아
(2) atom level
- customized atom-level tokenizers (2020)
(3) motif level
a. chemistry-driven
- break molecules into chemically meaningful substructures
- graph에서 주로 쓰이는 듯 -> cost 증가. 어렵고. expert knowledge 필요.
b. data-driven
- BPE랑 비슷: iteratively merging the most frequent pairs of characters into a single token
2. Methods
1) Langauge Modelling Objectives
(1) Masked Language Modelling (MLM)
- prevalent pretraining objective for LLMs
- MLM is conducted on molecular sequential representations (e.g. SMILES, SELFIES, etc.)
- SMIELS-BERT, ChemBERTa, MG-BERT, Molformer, Selformer, T5 Chem, Chemformer, BARTSmiles
(2) Molecular Property Prediction (MPP)
- predict molecular properties given molecular sequential representations
- generates properties using cheminformatics tools (e.g. RDKit) -> manual 라벨링 필요없음
- ChemBERTa-2, SPT
- due to the design of MPP, MPP-based models are primarily used for molecular representation learning -> cannot generate tokens
(3) Autoregressive Token Generation (ATG)