Hi all, I guess there are probably several ways of doing this. I'm looking to get access to every Pubchem CID. All compounds are published for download, but the files are quite large in size and there are many of them; also, I don't need a good proportion of the data in those files.

Any idea how to get hold of every single CID in PubChem as a list?

Once I have a list of all valid CIDs, I can use pubchempy (python Pubchem API) to retrieve each SMILE and then use RDKit to generate a Fingerprint of my choosing.

I will then generate a molecular structure fingerprint for N case compounds (around 200 of them; I have the CIDs for these case compounds), then perform several similarity matches between each of the N case compounds and each fingerprint from the CIDs I managed to gain access to.

More Anthony Nash's questions See All
Similar questions and discussions