Interesting question. I think you'll find the following paper more on the topic, which isn't so new (see https://www.wired.com/2009/04/robotscientist/ or more recent https://www.theatlantic.com/science/archive/2017/04/can-scientific-discovery-be-automated/524136/ ).
Automated Hypothesis Testing with Large Scientific Data Repositories
Well, it depends on how the "scientific hypothesis" are formulated. If it's a Bayesian Robot, working with conditioned probabilities and probability distributions over models, the standard way to compare models is to use the Kullback–Leibler divergence (or relative entropy). For a short intro you might refer to https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence.
The title of the paper Rafal cited points into another direction of how "scientific hypothesis" can be tested. Each hypothesis is evaluated against the data and dependent on their evaluation statistics the "best" hypothesis is choosen. "Best" could mean with "highest accuracy", "less false positives", "larger set of cases explainable by the hypothesis", "simplest hypothesis" (aka "Occam's Razor"), "smallest deviation between predictions and data" (aka as minimal error), etc. Again, the right best measurre depends on the kind of data.
Anyway, I think the pure testing of hypothesis is not the big problem. The harder problem is probably the "creative step" of coming up with reasonable hypothesis. Obviously, any kind of exhaustive search through the space of hypothesis is probably impractical. And coming up with some more intelligent approach requires background knowledge about the domain or some kind of oracle-like intuition mechanism.