Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the smiles strings with aromatic ring in the dataset are non-standard #3

Open
hengzzzhou opened this issue Apr 21, 2024 · 0 comments
Open

Comments

@hengzzzhou
Copy link

hengzzzhou commented Apr 21, 2024

Hello,

Thank you for your work.

I encountered an issue with the dataset provided at this link: https://zenodo.org/records/7928396. Specifically, after running the script python scripts/prepare_data.py --data_path examples/data_ir.pkl --output_path examples/train, the tgt.txt files generated contain aromatic SMILES strings that appear to be non-standard.

When attempting to convert these SMILES strings to SELFIES strings using the selfies library, errors occur. The problem seems to originate from the "[c]" in the SMILES strings, which does not seem to be a standard representation. For example, manually modifying C=C(C)C(=O)Oc1cc[c]cc1 to C=C(C)C(=O)Oc1ccccc1 resolves the issue and allows the conversion to proceed normally.

I suspect there might be an issue with the script used for generating IR spectra, particularly in how it handles aromatic rings.

I appreciate your attention to this matter and look forward to your response.

image-20240421104245604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant