Crowd-sourced machine learning for NMR
The rise of machine learning (ML) has created an explosion in the potential strategies which may be used to learn from data in order to make scientific predictions. For physical scientists who wish to apply ML strategies to a particular domain, this has created a bewildering scenario, where it is difficult to make an a priori assessment of what strategy to adopt within a vast space of possibilities.
To address this search problem, we recently we teamed up with Kaggle to initiate a crowd-sourced community competition for searching and analysing the space of possible ML strategies in order to predict pairwise NMR properties. Over 3 months, we received 47,800 ML model submissions from 2,700 teams in 84 countries, surpassing anything we could have achieved on our own. Analysis of the results shows that it is possible to construct ensemble-based ML models as linear combinations of the top 50 submissions, which have a prediction accuracy better than any individual model, and are nearly 3 orders of magnitude better than our previous approaches.