I need a dataset with known organism information (which organisms are in the dataset, it's best if the coverage of each organism is known). I did a search on the web but did not find any.
One thing you could do is to make your own. MetaSim is one package for this, but there are others, I'm sure. It allows you to set up a number of genome sequences and create a simulated set of sequence reads from a mixture of the genomes in known proportion.
Maybe I did not make myself clear. Here I mean real sequenced data. For example, a bacteria community with know organisms is sequenced.
Actually I used a simulator (grinder) to generate my own. The concern I had is that whether the simulated data is similar enough to the real sequencing data.
Ah I see. We have done this with synthetic mixed viral populations but not bacterial. It seems like it's something that someone would have needed to do as a control to demonstrate the effectiveness of their latest metagenomics toolkit...
We may want to check this article, they did exactly what you ask for: Assessment of metagenomic assembly using simulated next generation sequencing data. Best.
Shakya, Migun, et al. "Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities." Environmental microbiology (2013).