In a lot of machine learning projects, especially with real-world messy datasets, the issues of a model’s performance are usually traced back to either the model’s architecture or the quality and structure of the input data. It is critical to note though, that in trying to assess the two factors, the sources of error might not be as easy to delineate. Improvement can come from architectural changes, hyperparameter tuning and optimization heuristics, just as much as it can come from better data preprocessing, relabeling, or reconsidering the features used for representation.

How do you go about this decision? When do you reach the point where you stop refining the model and pivoting to concentrate on the dataset? Are there empirically defined learning curves, analytical tools, or other indicators that tell you whether you’ve reached a “data ceiling” instead of a “model ceiling?” I’d like to hear about frameworks, intuitions, or concrete examples across various domains of vision, language, and sensor data that you they have found helpful.

More Md Foysal Ahmed's questions See All
Similar questions and discussions