Saturday, July 24, 2021

Beautiful, More Beautiful and Most Beautiful. The maxim of Nicolaus Steno

 If you don't know who is Nicolaus Steno, you can look at the below video. 


Now that you know who is Nicolaus Steno, you will appreciate his maxim about the importance of understanding and comprehending. In the previous post, we looked at how the paper by Sharma et al 2020 has many figures in the supplementary that are detected as putative duplicates of each other. While the imagededup package we used has been benchmarked by its authors, we would need to verify its abilities using a set of images that are known to be duplicated for use in Fakery.  Thankfully, the Bik et. al., 2016 paper does a very thorough job of grouping the types of fakery into the following classes:

  1. Category I: simple duplications (https://journals.asm.org/doi/10.1128/mBio.00809-16#fig2)
  2. Category II: duplication with repositioning (https://journals.asm.org/doi/10.1128/mBio.00809-16#fig3)
  3. Category III: duplication with alteration (https://journals.asm.org/doi/10.1128/mBio.00809-16#fig4)

It is possible to test the imagededup package on these verified fakeries. However, to do that we need to cut out each of these image parts identified by Bik et. al., 2016 and create a dataset on which the code can be executed. This dataset is uploaded on github (https://github.com/Corvus7/Fakery.git) and may serve as a training dataset for future efforts at developing AI-based solutions. The full code used and the results are provided below:
 cd  
 git clone https://github.com/Corvus7/Fakery.git  
 image_dir='~/Fakery/bik2016/Fig2/'  
 from imagededup.methods import CNN   
 cnn_encoder = CNN()   
 duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, scores=True)   
 duplicates_cnn   
 {'Blue_1.png': [('Blue_2.png', 0.9211361)], 'Blue_2.png': [('Blue_1.png', 0.9211361)], 'Figure_2_cut_out.jpeg': [], 'Green_1.png': [('Green_2.png', 0.95247614)], 'Green_2.png': [('Green_1.png', 0.95247614)], 'Red_1.png': [('Red_2.png', 0.90385926)], 'Red_2.png': [('Red_1.png', 0.90385926)]}  
 image_dir='~/Fakery/bik2016/Fig3/'  
 from imagededup.methods import CNN   
 cnn_encoder = CNN()   
 duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, scores=True)   
 duplicates_cnn   
 {'Blue_1.png': [], 'Blue_2.png': [], 'Figure_3_cut_out.jpeg': [], 'Green_1.png': [], 'Green_2.png': [], 'Red_1.png': [], 'Red_2.png': []}  
 image_dir='~/Fakery/bik2016/Fig4/'  
 from imagededup.methods import CNN   
 cnn_encoder = CNN()   
 duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, scores=True)   
 duplicates_cnn   
 {'Blue_first.png': [('Blue_second.png', 0.97069657)], 'Blue_second.png': [('Blue_first.png', 0.97069657)], 'Full_screenshot.png': [], 'Green_first.png': [('Green_second.png', 0.98090243)], 'Green_second.png': [('Green_first.png', 0.98090243)], 'Lane_10.png': [('Lane_9.png', 0.90004313)], 'Lane_9.png': [('Lane_10.png', 0.90004313)], 'Orange_first.png': [('Orange_second.png', 0.9222448)], 'Orange_second.png': [('Orange_first.png', 0.9222448)], 'Pink_first.png': [('Pink_second.png', 0.9387925)], 'Pink_second.png': [('Pink_first.png', 0.9387925)], 'Purple_first.png': [('Purple_second.png', 0.95602536)], 'Purple_second.png': [('Purple_first.png', 0.95602536)], 'Red_first.png': [('Red_second.png', 0.9667898)], 'Red_second.png': [('Red_first.png', 0.9667898)], 'orange_first.png': [('orange_second.png', 0.91980916)], 'orange_second.png': [('orange_first.png', 0.91980916)]}  
The results are a bit surprising. Category II duplicates are not at all picked up by the CNN method. The correct figures are tagged as duplicates in Category I and Category III. However, the scores range from 0.9 all the way up to 0.98. This suggests that these scores themselves are not any reliable indicators and manual inspection is definitely required. AI-based methods like imagededup need a lot more sophistication to allow their widespread use in anti-fakery approaches. As far as the paper by Sharma et al 2020 is concerned, the raw data in the form of SAM files are provided in the supplementary materials. Running the samtools split command will separate out each of the sub-components by read group. 


No comments: