Project AI Grand Challenges
In this project we are exploring how the experience and approaches taken in AI grand challenges can provide data and knowledge that may help inform regulation. The project main aim is to collect information, opinions and provide an outline of relevant points. We plan to host several meetings and all content will be provided here.
The announcement page is here (link) with a summary of the topic.
We had a great first meeting with >80 participants.
Francesco Ciompi, PhD and Roberto Salgado, MD presented their ideas and if you were not able to make it – please check out the slides and recording below.
If you have additional questions – please feel free to email us.
We have already received questions and comments. At the end of the webinar, we agreed to start a project to enable all stakeholders to comment on this important topic.
Here we are.
There are three concrete next steps:
COMMENTS: Feel free to provide comments directly to us via e-mail. We can add these to the various topics outlined below – and really want to give everyone a chance to get their voice heard.
BRAINSTORMING: We plan to host a brainstorming meeting on Monday, March 28th, 2022 at 12:05 PM ET. This is an opportunity to add and voice your opinion, suggestions, and comments in a live session with all interested stakeholders.
CHALLENGE REVIEW: The presenters also proposed to have a formal meeting once the statistical analysis plan is final
Session 2: Discussion
April 25, 2022
Meeting content coming soon.
DISCLAIMER ABOUT THE FOLLOWING DISCUSSION
This statement has not been specifically authorized by the FDA
This statement reflects the shared questions and contributions from non-FDA stakeholders, solicited through a public convening (i.e., a webinar). While many have and will contribute, it is challenging to encompass all viewpoints. We will attempt to root our statement in fact wherever possible and captured our current discussion in a working group where we solicited input. We welcome additional comments via email: digipathalliance@gmail.com
The focus of PIcc is regulatory science not policy (=health policy). Health policy is within the interest of some of the PIcc members but is not addressed within this project.
This statement (as PIcc in general) focuses on regulatory science and specifically on how the experience and approached taken in AI grand challenges can derive data and knowledge that can help inform implementation, regulation, and policy. Keep in mind, medical research can be both the basis for innovation and for defining evidence-based regulatory frameworks.
The topic of AI and grand challenges is very broad. The following statement focuses on the regulatory scientific implications directly related to (*digital) pathology– not any of the other technologies or devices.
OVERVIEW OF THE DISCUSSION POINTS
Questions remain anonymous; answers provided by Francesco Ciompi (FC)
Hosting reader studies on the grand-challenge platform: Very interesting. Thanks for the info
FC: you're welcome, for the moment this feature is only available to trusted users, let me know if you would be interested in such a feature and I can discuss it internally.
You described two test sets: experimental for testing during the challenge (multiple runs), and final for testing at the end of the challenge (one run). Great strategy. You also described leaving the challenge open beyond the current timeline and not sharing the final test set. After the challenge, are you going to limit teams to one-run on the final test set? That would limit the training to the test and reduce any associated biases.
FC: That's correct, we are not planning to share the test set to make sure that TIGER can remain an unbiased benchmark for future evaluations also outside of the context of the challenge itself. We have not discussed a specific policy for the number of runs after the challenge, this is quite expensive in terms of computing power as algorithms might take >1 hour per slide, and with hundreds of slides it is not trivial to support this, so we are planning to see how the challenge goes and how demanding algorithms are and then decide. Offering only one run per algorithm on the final test set makes sense, and we can still use the experimental test set for more than one run, also to do some "sanity check" and identify potential technical issues in submitted algorithms.
I think the challenge rules indicated that participants cannot train on data external to the challenge. This, obviously, reduces the regulatory use. Can I assume that any algorithm can be tested after the challenge and perhaps performance results be private?
FC: I see your point, we had to set this rule for the duration of TIGER because some subsets in the test set might contain slides derived from the public domain, and therefore there is no way for us to control if some of them end up being part of the training set of some participants. This is of course a rule that applies to challenge, for a post-challenge evaluation it should be fine to have algorithms developed using other data as well, if this is clearly indicated in the submission and in the description of the method. About keeping the results private, we have not discussed any policy yet, it is something that we can discuss after the challenge ends.
Are you providing the actual performance results on the multiple runs on the experimental test set to the algorithm developers? We’ve done some work on multiple testing with the same dataset (see attached).
FC: the only type of results that developers will see because of the multiple runs on the experimental test set are the parameters that will be displayed on the leaderboard, which is the c-index for leaderboard 2 (with confidence interval) and the Dice and FROC-based score for leaderboard 1.
I’d like to see more details on the prognostic value (prediction of cancer recurrence) performance assessment “concordance index of multivariate Cox regression model”? Is there a script with an example implementing this assessment? Very interesting that end users don’t have cancer recurrence data for training.
FC: We are currently working on finalizing the R script for the survival analysis of leaderboard 2, we have had some technical issues because of the time algorithms take to process all slides and we had to postpone the opening of leaderboard 2, we hope we can do it before the end of this week. For leaderboard 1, we have released a Python script that helps participants to compute the FROC score, we have not discussed a similar thing for L2 yet, but I guess it should be possible to release the R script as well, although I can imagine it might be difficult for participants to test it locally as we do not share any training data with clinical and survival data. The idea is that participants should be creative and "engineer" a TILs score that has prognostic value, releasing a large set of cases with visual TILs scores or clinical endpoints would have led to many methods learning end-to-end to predict those endpoints maybe without considering the TILs, or reproducing visual scores in a non-explainable way, therefore the current design of TIGER.
Do you have details/protocol for how you determined truth?
FC: For the "computer vision" endpoints (tissue segmentation and TILs detection) we generated a reference standard via manual annotations of 5 pathologists from the I/O working group, who had a consensus meeting to resolve doubtful annotations, but apart from that they annotated independently after a couple of sessions of instructions and discussion with me; they annotated using the web-based viewer and reader studies tools provided by grand-challenge. For the survival part, the truth is given by follow-up data, in this case we consider disease-free survival, so we assess the prognostic value of the TILs score considering tumor recurrence.
If you have additional questions – please feel free to email us at digipathalliance@gmail.com
Links and References
https://tiger.grand-challenge.org/
https://panda.grand-challenge.org/
https://www.tilsinbreastcancer.org/
https://www.tilsinbreastcancer.org/tils-grand-challenge/
https://www.computationalpathologygroup.eu/
R. Salgado et al., "The evaluation of tumor-infiltrating lymphocytes (TILs) in breast cancer: recommendations by an International TILs Working Group 2014", Ann Oncol. 2015 Feb;26(2):259-71
C. Denkert et al., "Tumour-infiltrating lymphocytes and prognosis in different subtypes of breast cancer: a pooled analysis of 3771 patients treated with neoadjuvant therapy", The Lancet Oncology, 19, 2018, 40-50.
M. Amgad et al., "Joint Region and Nucleus Segmentation for Characterization of Tumor Infiltrating Lymphocytes in Breast Cancer", Proc SPIE Int Soc Opt Eng. 2019 Feb; 10956: 109560M.
K. AbdulJabbar et al, "Geospatial immune variability illuminates differential evolution of lung adenocarcinoma", Nature Medicine, 26, 1054–1062 (2020)
J. Saltz et al., "Spatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology Images", Cell Rep. 2018 Apr 3;23(1):181-193.e7.