Stop data sharing
The rapid developments in the field of machine learning have also brought along some existential challenges, which are in essence all related to the broad concept of ‘trust’. Aspects of this broad concept include trust in the output of any ML process (and the prevention of black boxes, hallucinations and so forth). The very trust in science is at stake, especially now that paper mills come up that also aggravate the perverse reward systems in current research environments, which are stuck in 20th (in fact 17th) century scholarly communication. The other side of the same coin is that ML, if not properly controlled, will also break through security and privacy barriers and violate GDPR and other Ethical, Legal and Societal barriers, including equitability. In addition, the ‘existence’ of data somewhere by no means implies its actual Reusability. This includes the by now well established four elements of the FAIR principles: Much data is not even Findable, if found, not Accessible under well defined conditions, and if accessed not Interoperable (understandable by third parties and machines) and this results in the vast majority of data and information not being Reusable without violation of copyrights, privacy regulations or the basic conceptual models that implicitly or explicitly underpin the query or the deep learning algorithm. This keynote will address how ‘data visiting’ as opposed to classical ‘data sharing’, which carries the connotation of data downloads, transport and losing control, mitigates most, if not all, the unwanted side effects of classical ‘data sharing’. For federated data visiting, the data should be FAIR in an additional sense or perspective, they should be ‘Federated, AI-Ready’, so that visiting algorithms can answer questions related to Access Control, Consent, Format, and can read rich (FAIR) metadata about the data itself to determine whether they are ‘fit for purpose’ and machine actionable (i.e. FAIR digital Objects, or Machine Actionable Units). The ‘fitness for purpose’ concept goes way beyond (but includes) information about methods, quality, error bars etc. The ‘immutable logging’ of all operation of visiting algorithms is crucial, especially when self learning algorithms in ‘swarm learning’ are being used. Enough to keep us busy for a while.