Tuesday, July 27, 2021

Chop, crop and search with a custom cutoff criteria

We have been investigating the presence of putative image duplicates in the paper Sharma et al., 2020 using AI based methods that we evaluated using the established examples from Bik et al., 2016 paper. Despite an exhaustive search of all the supplementary figures, we have not been able to find any exact duplicates in this paper. However, it is possible that parts of the figures are chopped, cropped and pasted in different combinations. This would be similar to cutting out the lanes of a gel image and pasting them together into a new image. As we saw in the previous post, the imagededup package does not perform well when faced with duplication with repositioning (category II).

Today, we try to find a simple solution to this problem by chopping up each image into many small pieces and searching them for presence of putative duplicates. The linux utility "convert" is a very powerful tool with many image manipulation abilities. We use the below code snippet to chop each of the images into five almost equally sized parts with vertical lines.

 for img in `ls -1 *`  
 do  
 echo $img  
 convert $img -crop 5x1@ +repage +adjoin "$img"_%d.png  
 mv "$img"_*.png crop1  
 done  

Each of the images now have 5 parts with filenames that mention the old image id and the part number. For instance, image98.png is cut into five parts named as image98.png_0.png, image98.png_1.png, image98.png_2.png, image98.png_3.png and image98.png_4.png. The for loop in the above code does this chopping for each image and moves the chopped files into the crop1 folder. After the images have been chopped using the crop option in convert utility, we can use the imagededup package to look for putative duplicates.

Approximately 300 images are present in the original dataset obtained from Sharma et al., 2020. After chopping each image into five parts, we have 1500 images to deal with. Manually parsing the output of imagededup for high similarity scores is laborious and best avoided. The code given below detects putative image duplicates among the files located in the image_dir folder and stores the results in the duplicates_cnn dictionary. The first for loop in this case iterates through the keys of this dictionary. The second for loop iterates through the values that are stored for each of these keys. Each value is actually a tuple with the first element being the image file name and the second element being the score. We look for scores greater than the cutoff value defined before the for loop and print the key and keyvalue.

 #Find duplicates using CNN along with scores   
 from imagededup.methods import CNN   
 cnn_encoder = CNN()   
 duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, scores=True)   
 #arbitrary cutoff score  
 cutoff=0.97  
 for key in duplicates_cnn:  
      for keyvals in duplicates_cnn[key]:  
           if keyvals[1] > cutoff:  
                print(key,keyvals)  

The code above will provide us a shorter version of the output listing only the images that are detected to be putative duplicates with very high scores.Even with a high cutoff score of 0.97, we find more than 200 putative duplicates.

 image100.png_4.png ('image93.png_4.png', 0.9714619)  
 image102.png_0.png ('image103.png_0.png', 0.98250186)  
 image103.png_0.png ('image102.png_0.png', 0.98250186)  
 image11.png_0.png ('image12.png_0.png', 0.9850818)  
 image11.png_0.png ('image14.png_0.png', 0.97351915)  
 image11.png_0.png ('image16.png_0.png', 0.9819039)  
 image11.png_0.png ('image20.png_0.png', 0.9708375)  
 image11.png_0.png ('image7.png_0.png', 0.98577213)  
 image114.png_0.png ('image115.png_0.png', 0.9871844)  
 image114.png_0.png ('image116.png_0.png', 0.977131)  
 image115.png_0.png ('image114.png_0.png', 0.9871844)  
 image115.png_0.png ('image116.png_0.png', 0.982214)  
 image115.png_2.png ('image115.png_3.png', 0.97350514)  
 image115.png_3.png ('image115.png_2.png', 0.97350514)  
 image116.png_0.png ('image114.png_0.png', 0.977131)  
 image116.png_0.png ('image115.png_0.png', 0.982214)  
 image118.png_0.png ('image120.png_0.png', 0.988398)  
 image118.png_0.png ('image121.png_0.png', 0.9891088)  
 image12.png_0.png ('image11.png_0.png', 0.9850818)  
 image12.png_0.png ('image14.png_0.png', 0.97201216)  
 image12.png_0.png ('image16.png_0.png', 0.98974717)  
 image12.png_0.png ('image20.png_0.png', 0.9705702)  
 image12.png_0.png ('image7.png_0.png', 0.9938463)  
 image120.png_0.png ('image118.png_0.png', 0.988398)  
 image120.png_0.png ('image121.png_0.png', 0.9912125)  
 image121.png_0.png ('image118.png_0.png', 0.9891088)  
 image121.png_0.png ('image120.png_0.png', 0.9912125)  
 image122.png_0.png ('image123.png_0.png', 0.98453903)  
 image122.png_0.png ('image124.png_0.png', 0.99413013)  
 image123.png_0.png ('image122.png_0.png', 0.98453903)  
 image123.png_0.png ('image124.png_0.png', 0.98252416)  
 image124.png_0.png ('image122.png_0.png', 0.99413013)  
 image124.png_0.png ('image123.png_0.png', 0.98252416)  
 image124.png_2.png ('image124.png_3.png', 0.9753659)  
 image124.png_3.png ('image124.png_2.png', 0.9753659)  
 image13.png_3.png ('image13.png_4.png', 0.99999994)  
 image13.png_4.png ('image13.png_3.png', 0.99999994)  
 image138.png_0.png ('image139.png_0.png', 0.9742407)  
 image139.png_0.png ('image138.png_0.png', 0.9742407)  
 image139.png_0.png ('image141.png_0.png', 0.9892728)  
 image139.png_0.png ('image174.png_0.png', 0.97267145)  
 image14.png_0.png ('image11.png_0.png', 0.97351915)  
 image14.png_0.png ('image12.png_0.png', 0.97201216)  
 image14.png_0.png ('image16.png_0.png', 0.98304456)  
 image14.png_0.png ('image20.png_0.png', 0.9994699)  
 image14.png_0.png ('image7.png_0.png', 0.9742934)  
 image141.png_0.png ('image139.png_0.png', 0.9892728)  
 image141.png_0.png ('image174.png_0.png', 0.9725107)  
 image141.png_1.png ('image141.png_2.png', 0.97022605)  
 image141.png_2.png ('image141.png_1.png', 0.97022605)  
 image142.png_0.png ('image144.png_0.png', 0.986072)  
 image143.png_0.png ('image145.png_0.png', 0.9743204)  
 image144.png_0.png ('image142.png_0.png', 0.986072)  
 image145.png_0.png ('image143.png_0.png', 0.9743204)  
 image15.png_4.png ('image18.png_4.png', 0.9827007)  
 image151.png_0.png ('image154.png_0.png', 0.97797024)  
 image152.png_0.png ('image153.png_0.png', 0.9902647)  
 image153.png_0.png ('image152.png_0.png', 0.9902647)  
 image153.png_0.png ('image154.png_0.png', 0.97013867)  
 image154.png_0.png ('image151.png_0.png', 0.97797024)  
 image154.png_0.png ('image153.png_0.png', 0.97013867)  
 image156.png_0.png ('image158.png_0.png', 0.97740877)  
 image158.png_0.png ('image156.png_0.png', 0.97740877)  
 image159.png_0.png ('image160.png_0.png', 0.9743622)  
 image159.png_0.png ('image161.png_0.png', 0.9805405)  
 image159.png_1.png ('image159.png_3.png', 0.9810934)  
 image159.png_3.png ('image159.png_1.png', 0.9810934)  
 image16.png_0.png ('image11.png_0.png', 0.9819039)  
 image16.png_0.png ('image12.png_0.png', 0.98974717)  
 image16.png_0.png ('image14.png_0.png', 0.98304456)  
 image16.png_0.png ('image20.png_0.png', 0.98187333)  
 image16.png_0.png ('image7.png_0.png', 0.9959326)  
 image160.png_0.png ('image159.png_0.png', 0.9743622)  
 image161.png_0.png ('image159.png_0.png', 0.9805405)  
 image164.jpg_0.png ('image287.jpg_0.png', 0.9828615)  
 image168.png_0.png ('image169.png_0.png', 0.9929953)  
 image169.png_0.png ('image168.png_0.png', 0.9929953)  
 image174.png_0.png ('image139.png_0.png', 0.97267145)  
 image174.png_0.png ('image141.png_0.png', 0.9725107)  
 image178.png_0.png ('image179.png_0.png', 0.9757037)  
 image179.png_0.png ('image178.png_0.png', 0.9757037)  
 image18.png_4.png ('image15.png_4.png', 0.9827007)  
 image182.png_0.png ('image183.png_0.png', 0.98172903)  
 image183.png_0.png ('image182.png_0.png', 0.98172903)  
 image188.png_0.png ('image189.png_0.png', 0.98023623)  
 image189.png_0.png ('image188.png_0.png', 0.98023623)  
 image192.png_0.png ('image194.png_0.png', 0.9978157)  
 image194.png_0.png ('image192.png_0.png', 0.9978157)  
 image20.png_0.png ('image11.png_0.png', 0.9708375)  
 image20.png_0.png ('image12.png_0.png', 0.9705702)  
 image20.png_0.png ('image14.png_0.png', 0.9994699)  
 image20.png_0.png ('image16.png_0.png', 0.98187333)  
 image20.png_0.png ('image7.png_0.png', 0.9730121)  
 image202.png_0.png ('image203.png_0.png', 0.9940014)  
 image203.png_0.png ('image202.png_0.png', 0.9940014)  
 image211.png_0.png ('image212.png_0.png', 0.98358434)  
 image212.png_0.png ('image211.png_0.png', 0.98358434)  
 image212.png_0.png ('image214.png_0.png', 0.9713317)  
 image214.png_0.png ('image212.png_0.png', 0.9713317)  
 image215.jpeg_0.png ('image215.jpeg_1.png', 0.9919382)  
 image215.jpeg_1.png ('image215.jpeg_0.png', 0.9919382)  
 image219.png_2.png ('image219.png_3.png', 0.97034454)  
 image219.png_3.png ('image219.png_2.png', 0.97034454)  
 image220.png_0.png ('image221.png_0.png', 0.9760274)  
 image221.png_0.png ('image220.png_0.png', 0.9760274)  
 image221.png_2.png ('image221.png_3.png', 0.97269195)  
 image221.png_3.png ('image221.png_2.png', 0.97269195)  
 image23.png_0.png ('image25.png_0.png', 0.97705656)  
 image23.png_2.png ('image26.png_2.png', 0.97086954)  
 image234.png_0.png ('image235.png_0.png', 0.99267024)  
 image234.png_1.png ('image234.png_2.png', 0.97507715)  
 image234.png_2.png ('image234.png_1.png', 0.97507715)  
 image235.png_0.png ('image234.png_0.png', 0.99267024)  
 image237.png_0.png ('image238.png_0.png', 0.9760754)  
 image238.png_0.png ('image237.png_0.png', 0.9760754)  
 image238.png_0.png ('image239.png_0.png', 0.9754666)  
 image239.png_0.png ('image238.png_0.png', 0.9754666)  
 image242.png_0.png ('image243.png_0.png', 0.9714698)  
 image243.png_0.png ('image242.png_0.png', 0.9714698)  
 image245.png_0.png ('image246.png_0.png', 0.97148263)  
 image245.png_0.png ('image247.png_0.png', 0.97165793)  
 image246.png_0.png ('image245.png_0.png', 0.97148263)  
 image246.png_0.png ('image247.png_0.png', 0.9809675)  
 image247.png_0.png ('image245.png_0.png', 0.97165793)  
 image247.png_0.png ('image246.png_0.png', 0.9809675)  
 image25.png_0.png ('image23.png_0.png', 0.97705656)  
 image257.png_0.png ('image258.png_0.png', 0.98425305)  
 image257.png_0.png ('image259.png_0.png', 0.9732449)  
 image258.png_0.png ('image257.png_0.png', 0.98425305)  
 image258.png_0.png ('image259.png_0.png', 0.98271394)  
 image259.png_0.png ('image257.png_0.png', 0.9732449)  
 image259.png_0.png ('image258.png_0.png', 0.98271394)  
 image26.png_2.png ('image23.png_2.png', 0.97086954)  
 image262.png_0.png ('image263.png_0.png', 0.9708605)  
 image263.png_0.png ('image262.png_0.png', 0.9708605)  
 image265.png_0.png ('image266.png_0.png', 0.985058)  
 image265.png_0.png ('image267.png_0.png', 0.97712684)  
 image266.png_0.png ('image265.png_0.png', 0.985058)  
 image266.png_0.png ('image267.png_0.png', 0.9758965)  
 image267.png_0.png ('image265.png_0.png', 0.97712684)  
 image267.png_0.png ('image266.png_0.png', 0.9758965)  
 image270.png_1.png ('image270.png_2.png', 0.97145426)  
 image270.png_2.png ('image270.png_1.png', 0.97145426)  
 image271.png_1.png ('image271.png_2.png', 0.97522026)  
 image271.png_2.png ('image271.png_1.png', 0.97522026)  
 image273.png_0.png ('image274.png_0.png', 0.9760238)  
 image274.png_0.png ('image273.png_0.png', 0.9760238)  
 image274.png_0.png ('image275.png_0.png', 0.97969353)  
 image274.png_1.png ('image274.png_3.png', 0.9715067)  
 image274.png_3.png ('image274.png_1.png', 0.9715067)  
 image275.png_0.png ('image274.png_0.png', 0.97969353)  
 image277.png_0.png ('image278.png_0.png', 0.9858266)  
 image277.png_0.png ('image279.png_0.png', 0.98131776)  
 image278.png_0.png ('image277.png_0.png', 0.9858266)  
 image278.png_0.png ('image279.png_0.png', 0.9779868)  
 image279.png_0.png ('image277.png_0.png', 0.98131776)  
 image279.png_0.png ('image278.png_0.png', 0.9779868)  
 image287.jpg_0.png ('image164.jpg_0.png', 0.9828615)  
 image63.png_0.png ('image71.png_0.png', 0.99036366)  
 image7.png_0.png ('image11.png_0.png', 0.98577213)  
 image7.png_0.png ('image12.png_0.png', 0.9938463)  
 image7.png_0.png ('image14.png_0.png', 0.9742934)  
 image7.png_0.png ('image16.png_0.png', 0.9959326)  
 image7.png_0.png ('image20.png_0.png', 0.9730121)  
 image71.png_0.png ('image63.png_0.png', 0.99036366)  
 image83.png_0.png ('image86.png_0.png', 0.9774096)  
 image83.png_0.png ('image92.png_0.png', 0.9701129)  
 image83.png_0.png ('image98.png_0.png', 0.9754302)  
 image83.png_2.png ('image86.png_2.png', 0.97005486)  
 image83.png_3.png ('image86.png_3.png', 0.9776021)  
 image83.png_4.png ('image86.png_4.png', 0.97311985)  
 image85.png_0.png ('image87.png_0.png', 0.9724677)  
 image85.png_0.png ('image93.png_0.png', 0.9737506)  
 image85.png_0.png ('image96.png_0.png', 0.9903543)  
 image85.png_0.png ('image99.png_0.png', 0.98590815)  
 image85.png_2.png ('image96.png_2.png', 0.98394686)  
 image85.png_3.png ('image96.png_3.png', 0.9750851)  
 image86.png_0.png ('image83.png_0.png', 0.9774096)  
 image86.png_0.png ('image92.png_0.png', 0.9725857)  
 image86.png_0.png ('image98.png_0.png', 0.97047156)  
 image86.png_2.png ('image83.png_2.png', 0.97005486)  
 image86.png_3.png ('image83.png_3.png', 0.9776021)  
 image86.png_4.png ('image83.png_4.png', 0.97311985)  
 image87.png_0.png ('image85.png_0.png', 0.9724677)  
 image87.png_0.png ('image89.png_0.png', 0.97378576)  
 image87.png_0.png ('image99.png_0.png', 0.9758906)  
 image89.png_0.png ('image87.png_0.png', 0.97378576)  
 image90.png_3.png ('image98.png_3.png', 0.9723841)  
 image92.png_0.png ('image83.png_0.png', 0.9701129)  
 image92.png_0.png ('image86.png_0.png', 0.9725857)  
 image93.png_0.png ('image85.png_0.png', 0.9737506)  
 image93.png_0.png ('image96.png_0.png', 0.9713357)  
 image93.png_4.png ('image100.png_4.png', 0.9714619)  
 image96.png_0.png ('image85.png_0.png', 0.9903543)  
 image96.png_0.png ('image93.png_0.png', 0.9713357)  
 image96.png_0.png ('image99.png_0.png', 0.9820196)  
 image96.png_2.png ('image85.png_2.png', 0.98394686)  
 image96.png_3.png ('image85.png_3.png', 0.9750851)  
 image98.png_0.png ('image83.png_0.png', 0.9754302)  
 image98.png_0.png ('image86.png_0.png', 0.97047156)  
 image98.png_3.png ('image90.png_3.png', 0.9723841)  
 image99.png_0.png ('image85.png_0.png', 0.98590815)  
 image99.png_0.png ('image87.png_0.png', 0.9758906)  
 image99.png_0.png ('image96.png_0.png', 0.9820196)  

The use of a more stringent criteria of 0.99 results in a more managable list:

 image12.png_0.png ('image7.png_0.png', 0.9938463)  
 image120.png_0.png ('image121.png_0.png', 0.9912125)  
 image121.png_0.png ('image120.png_0.png', 0.9912125)  
 image122.png_0.png ('image124.png_0.png', 0.99413013)  
 image124.png_0.png ('image122.png_0.png', 0.99413013)  
 image13.png_3.png ('image13.png_4.png', 0.99999994)  
 image13.png_4.png ('image13.png_3.png', 0.99999994)  
 image14.png_0.png ('image20.png_0.png', 0.9994699)  
 image152.png_0.png ('image153.png_0.png', 0.9902647)  
 image153.png_0.png ('image152.png_0.png', 0.9902647)  
 image16.png_0.png ('image7.png_0.png', 0.9959326)  
 image168.png_0.png ('image169.png_0.png', 0.9929953)  
 image169.png_0.png ('image168.png_0.png', 0.9929953)  
 image192.png_0.png ('image194.png_0.png', 0.9978157)  
 image194.png_0.png ('image192.png_0.png', 0.9978157)  
 image20.png_0.png ('image14.png_0.png', 0.9994699)  
 image202.png_0.png ('image203.png_0.png', 0.9940014)  
 image203.png_0.png ('image202.png_0.png', 0.9940014)  
 image215.jpeg_0.png ('image215.jpeg_1.png', 0.9919382)  
 image215.jpeg_1.png ('image215.jpeg_0.png', 0.9919382)  
 image234.png_0.png ('image235.png_0.png', 0.99267024)  
 image235.png_0.png ('image234.png_0.png', 0.99267024)  
 image63.png_0.png ('image71.png_0.png', 0.99036366)  
 image7.png_0.png ('image12.png_0.png', 0.9938463)  
 image7.png_0.png ('image16.png_0.png', 0.9959326)  
 image71.png_0.png ('image63.png_0.png', 0.99036366)  
 image85.png_0.png ('image96.png_0.png', 0.9903543)  
 image96.png_0.png ('image85.png_0.png', 0.9903543)  

Even this shorter list has 28 hits. Closer manual inspection of these hits highlights the difficulties involved in this chop and search stratergy. The extremely promising hit of image 13 parts 3 and 4 is actually due to these two images being just the canvas. Similarly, the hit seen between part 0 of image 14 and image 20 are due to the presence of the same human karyotype in both these images. However, the original images are distinct enough as they are screenshots showing the synteny relationship of the same human chromosome with very different species.

Self-criticism seems to be trending this month. Many thanks to the brave/crazy Nicholas P. Holmes for sharing his thougts on this. Even seemingly disastorous events such as retracting a paper have achieved glory. Cost of doing science continues to increase. Whether we will see a long-term change and how the reward system will evaluate honest science vs exceptional science is unclear.



No comments: