Paper Review

LLMs in Chemistry

Sungyeon Kim 2024. 4. 1. 20:29

https://arxiv.org/abs/2402.01439

1. Molecule encoding

1) Representations

(1) Fingerprint

- binary string

 

(2) Sequential 

- SMILES (1988): 문제 많아

- SELFIES (2020): 문제 개선함

- InChI (2013): focuses on uniqueness

 

(3) Graph

 

2) Tokenization

(1) char level

- 이상하게 tokenize됨에도 불구하고 성능 좋아

 

(2) atom level

- customized atom-level tokenizers (2020)

 

(3) motif level

a. chemistry-driven

- break molecules into chemically meaningful substructures

- graph에서 주로 쓰이는 듯 -> cost 증가. 어렵고. expert knowledge 필요.

b. data-driven

- BPE랑 비슷: iteratively merging the most frequent pairs of characters into a single token

 

2. Methods

1) Langauge Modelling Objectives

(1) Masked Language Modelling (MLM)

- prevalent pretraining objective for LLMs

- MLM is conducted on molecular sequential representations (e.g. SMILES, SELFIES, etc.)

- SMIELS-BERT, ChemBERTa, MG-BERT, Molformer, Selformer, T5 Chem, Chemformer, BARTSmiles

 

(2) Molecular Property Prediction (MPP)

- predict molecular properties given molecular sequential representations

- generates properties using cheminformatics tools (e.g. RDKit) -> manual 라벨링 필요없음

- ChemBERTa-2, SPT

- due to the design of MPP, MPP-based models are primarily used for molecular representation learning -> cannot generate tokens

 

(3) Autoregressive Token Generation (ATG)