Monday, November 8, 2021

Scattering of genes from evolutionary breakpoint region due to chromosomal rearrangements

Lemaitre et al., (2009) define Evolutionary Breakpoint Regions (EBRs) as those genomic regions that have undergone at least one structural change that results in an altered karyotype between lineages. Characteristic features of EBRs have been analyzed using large-scale datasets to identify the prevalence of repeat regions, GC features, and epigenetic attributes associated with EBRs. Most changes in the gene order occur due to the chromosomal rearrangements that occur at the EBR loci. In addition to these changes in gene order, it has been proposed that gene loss can occur at these loci. For instance, the loss of approximately ~2000 genes is thought to have occurred in the ancestor of all birds. A vast majority of these gene loss events are proposed to coincide with EBRs. However, the challenges involved in sequencing and assembling the EBR regions have made it challenging to verify the validity of these claims. 

Similar to birds, the evolution of EBRs in rodent genomes has proved difficult to analyze and interpret. A recent study by Shinde et al. explores the EBR corresponding to the human chr7p13 in rodents and marsupials by a careful comparison of gene orders in several closely related species. Interestingly, they find that the same region has undergone rearrangement in both rodents and marsupials. However, the result of the rearrangement is slightly different in the two groups. While in rodents, both the STK17A and COA1/MITRAC15 genes are likely lost, the STK17A gene is retained after the rearrangement in marsupial species. 

The study is not entirely focused on the EBR though. Shinde et al. investigate the evolutionary history of the COA1 gene in various vertebrate species, more than 300 by their count. Recurrent loss of this gene is noted in galliform birds, several rodent species, and cheetah. Functional studies have implicated a role for COA1/MITRAC15 in promoting mitochondrial translation and complex I and IV biogenesis (Wang et al., 2020). Although COA1/MITRAC15 gene is widely conserved among vertebrate species, knockout studies exhibit a mild effect on function and can easily be compensated by overexpression of other genes (Pierrel et al., 2007; Hess et al., 2009). However, the prevalence of positive selection in primates suggests that the COA1/MITRAC15 can contribute to adaptation in the OXPHOS pathway (Van Der Lee et al., 2017). The loss of this gene following relaxed selection in the cheetah, Galliform birds, and several rodent species provides an example of gene dispensability in the OXPHOS pathway.

Salient features:

1.     Verification of the base-pair level changes leading to gene loss utilizes genome sequencing reads and transcriptomes.

2.   Several precautions based on the 5-step procedure proposed recently by Sharma et al., 2020 ensure gene loss validity.

3.     The timing of gene loss is estimated based on the widely used method proposed by Meredith et al. (Meredith et al., 2009). Signatures of relaxed selection characterized using the latest methods implemented in the HyPhy package and the models available in codeml.

4.    The role of evolutionary breakpoint regions (EBR) in gene loss is explored by investigating gene loss across multiple rodent species. Genomic regions that have translocated to different chromosomes after the rearrangement are studied in detail by comparing several pre-CR and post-CR species.

Novelty:

  1. This study is probably the first report documenting the loss of a known oncogene in birds and might help understand the lower prevalence of cancer in birds than mammals.
  2. Shinde et al., identify the origin of novel isoforms of COA1/MITRAC15 in Carnivore species through alternative splicing.

After two rounds of review, this manuscript is finally published in the journal Scientific Reports with the title "Recurrent erosion of COA1/MITRAC15 exemplifies conditional gene dispensability in oxidative phosphorylation". For any press releases or promotions, please note that the correct citation of the journal is “Scientific Reports” not “Nature Scientific Reports”.

References

 Hess, D. C. et al. (2009) ‘Computationally Driven, Quantitative Experiments Discover Genes Required for Mitochondrial Biogenesis’, PLoS Genetics. Edited by S. K. Kim, 5(3), p. e1000407. doi: 10.1371/journal.pgen.1000407.

Van Der Lee, R. et al. (2017) ‘Genome-scale detection of positive selection in nine primates predicts human-virus evolutionary conflicts’, Nucleic Acids Research, 45(18), pp. 10634–10648. doi: 10.1093/nar/gkx704.

Meredith, R. W. et al. (2009) ‘Molecular decay of the tooth gene enamelin (ENAM) mirrors the loss of enamel in the fossil record of placental mammals’, PLoS Genetics, 5(9). doi: 10.1371/journal.pgen.1000634.

Pierrel, F. et al. (2007) ‘Coa1 links the Mss51 post-translational function to Cox1 cofactor insertion in cytochrome c oxidase assembly’, EMBO Journal, 26(20), pp. 4335–4346. doi: 10.1038/sj.emboj.7601861.

Sharma, S. et al. (2020) ‘Evidence for the loss of plasminogen receptor KT gene in chicken’, Immunogenetics, 72(9–10), pp. 507–515. doi: 10.1007/s00251-020-01186-2.

Wang, C. et al. (2020) ‘MITRAC15/COA1 promotes mitochondrial translation in a ND2 ribosome–nascent chain complex’, EMBO reports, 21(1). doi: 10.15252/embr.201948833.

 

Tuesday, July 27, 2021

Chop, crop and search with a custom cutoff criteria

We have been investigating the presence of putative image duplicates in the paper Sharma et al., 2020 using AI based methods that we evaluated using the established examples from Bik et al., 2016 paper. Despite an exhaustive search of all the supplementary figures, we have not been able to find any exact duplicates in this paper. However, it is possible that parts of the figures are chopped, cropped and pasted in different combinations. This would be similar to cutting out the lanes of a gel image and pasting them together into a new image. As we saw in the previous post, the imagededup package does not perform well when faced with duplication with repositioning (category II).

Today, we try to find a simple solution to this problem by chopping up each image into many small pieces and searching them for presence of putative duplicates. The linux utility "convert" is a very powerful tool with many image manipulation abilities. We use the below code snippet to chop each of the images into five almost equally sized parts with vertical lines.

 for img in `ls -1 *`  
 do  
 echo $img  
 convert $img -crop 5x1@ +repage +adjoin "$img"_%d.png  
 mv "$img"_*.png crop1  
 done  

Each of the images now have 5 parts with filenames that mention the old image id and the part number. For instance, image98.png is cut into five parts named as image98.png_0.png, image98.png_1.png, image98.png_2.png, image98.png_3.png and image98.png_4.png. The for loop in the above code does this chopping for each image and moves the chopped files into the crop1 folder. After the images have been chopped using the crop option in convert utility, we can use the imagededup package to look for putative duplicates.

Approximately 300 images are present in the original dataset obtained from Sharma et al., 2020. After chopping each image into five parts, we have 1500 images to deal with. Manually parsing the output of imagededup for high similarity scores is laborious and best avoided. The code given below detects putative image duplicates among the files located in the image_dir folder and stores the results in the duplicates_cnn dictionary. The first for loop in this case iterates through the keys of this dictionary. The second for loop iterates through the values that are stored for each of these keys. Each value is actually a tuple with the first element being the image file name and the second element being the score. We look for scores greater than the cutoff value defined before the for loop and print the key and keyvalue.

 #Find duplicates using CNN along with scores   
 from imagededup.methods import CNN   
 cnn_encoder = CNN()   
 duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, scores=True)   
 #arbitrary cutoff score  
 cutoff=0.97  
 for key in duplicates_cnn:  
      for keyvals in duplicates_cnn[key]:  
           if keyvals[1] > cutoff:  
                print(key,keyvals)  

The code above will provide us a shorter version of the output listing only the images that are detected to be putative duplicates with very high scores.Even with a high cutoff score of 0.97, we find more than 200 putative duplicates.

 image100.png_4.png ('image93.png_4.png', 0.9714619)  
 image102.png_0.png ('image103.png_0.png', 0.98250186)  
 image103.png_0.png ('image102.png_0.png', 0.98250186)  
 image11.png_0.png ('image12.png_0.png', 0.9850818)  
 image11.png_0.png ('image14.png_0.png', 0.97351915)  
 image11.png_0.png ('image16.png_0.png', 0.9819039)  
 image11.png_0.png ('image20.png_0.png', 0.9708375)  
 image11.png_0.png ('image7.png_0.png', 0.98577213)  
 image114.png_0.png ('image115.png_0.png', 0.9871844)  
 image114.png_0.png ('image116.png_0.png', 0.977131)  
 image115.png_0.png ('image114.png_0.png', 0.9871844)  
 image115.png_0.png ('image116.png_0.png', 0.982214)  
 image115.png_2.png ('image115.png_3.png', 0.97350514)  
 image115.png_3.png ('image115.png_2.png', 0.97350514)  
 image116.png_0.png ('image114.png_0.png', 0.977131)  
 image116.png_0.png ('image115.png_0.png', 0.982214)  
 image118.png_0.png ('image120.png_0.png', 0.988398)  
 image118.png_0.png ('image121.png_0.png', 0.9891088)  
 image12.png_0.png ('image11.png_0.png', 0.9850818)  
 image12.png_0.png ('image14.png_0.png', 0.97201216)  
 image12.png_0.png ('image16.png_0.png', 0.98974717)  
 image12.png_0.png ('image20.png_0.png', 0.9705702)  
 image12.png_0.png ('image7.png_0.png', 0.9938463)  
 image120.png_0.png ('image118.png_0.png', 0.988398)  
 image120.png_0.png ('image121.png_0.png', 0.9912125)  
 image121.png_0.png ('image118.png_0.png', 0.9891088)  
 image121.png_0.png ('image120.png_0.png', 0.9912125)  
 image122.png_0.png ('image123.png_0.png', 0.98453903)  
 image122.png_0.png ('image124.png_0.png', 0.99413013)  
 image123.png_0.png ('image122.png_0.png', 0.98453903)  
 image123.png_0.png ('image124.png_0.png', 0.98252416)  
 image124.png_0.png ('image122.png_0.png', 0.99413013)  
 image124.png_0.png ('image123.png_0.png', 0.98252416)  
 image124.png_2.png ('image124.png_3.png', 0.9753659)  
 image124.png_3.png ('image124.png_2.png', 0.9753659)  
 image13.png_3.png ('image13.png_4.png', 0.99999994)  
 image13.png_4.png ('image13.png_3.png', 0.99999994)  
 image138.png_0.png ('image139.png_0.png', 0.9742407)  
 image139.png_0.png ('image138.png_0.png', 0.9742407)  
 image139.png_0.png ('image141.png_0.png', 0.9892728)  
 image139.png_0.png ('image174.png_0.png', 0.97267145)  
 image14.png_0.png ('image11.png_0.png', 0.97351915)  
 image14.png_0.png ('image12.png_0.png', 0.97201216)  
 image14.png_0.png ('image16.png_0.png', 0.98304456)  
 image14.png_0.png ('image20.png_0.png', 0.9994699)  
 image14.png_0.png ('image7.png_0.png', 0.9742934)  
 image141.png_0.png ('image139.png_0.png', 0.9892728)  
 image141.png_0.png ('image174.png_0.png', 0.9725107)  
 image141.png_1.png ('image141.png_2.png', 0.97022605)  
 image141.png_2.png ('image141.png_1.png', 0.97022605)  
 image142.png_0.png ('image144.png_0.png', 0.986072)  
 image143.png_0.png ('image145.png_0.png', 0.9743204)  
 image144.png_0.png ('image142.png_0.png', 0.986072)  
 image145.png_0.png ('image143.png_0.png', 0.9743204)  
 image15.png_4.png ('image18.png_4.png', 0.9827007)  
 image151.png_0.png ('image154.png_0.png', 0.97797024)  
 image152.png_0.png ('image153.png_0.png', 0.9902647)  
 image153.png_0.png ('image152.png_0.png', 0.9902647)  
 image153.png_0.png ('image154.png_0.png', 0.97013867)  
 image154.png_0.png ('image151.png_0.png', 0.97797024)  
 image154.png_0.png ('image153.png_0.png', 0.97013867)  
 image156.png_0.png ('image158.png_0.png', 0.97740877)  
 image158.png_0.png ('image156.png_0.png', 0.97740877)  
 image159.png_0.png ('image160.png_0.png', 0.9743622)  
 image159.png_0.png ('image161.png_0.png', 0.9805405)  
 image159.png_1.png ('image159.png_3.png', 0.9810934)  
 image159.png_3.png ('image159.png_1.png', 0.9810934)  
 image16.png_0.png ('image11.png_0.png', 0.9819039)  
 image16.png_0.png ('image12.png_0.png', 0.98974717)  
 image16.png_0.png ('image14.png_0.png', 0.98304456)  
 image16.png_0.png ('image20.png_0.png', 0.98187333)  
 image16.png_0.png ('image7.png_0.png', 0.9959326)  
 image160.png_0.png ('image159.png_0.png', 0.9743622)  
 image161.png_0.png ('image159.png_0.png', 0.9805405)  
 image164.jpg_0.png ('image287.jpg_0.png', 0.9828615)  
 image168.png_0.png ('image169.png_0.png', 0.9929953)  
 image169.png_0.png ('image168.png_0.png', 0.9929953)  
 image174.png_0.png ('image139.png_0.png', 0.97267145)  
 image174.png_0.png ('image141.png_0.png', 0.9725107)  
 image178.png_0.png ('image179.png_0.png', 0.9757037)  
 image179.png_0.png ('image178.png_0.png', 0.9757037)  
 image18.png_4.png ('image15.png_4.png', 0.9827007)  
 image182.png_0.png ('image183.png_0.png', 0.98172903)  
 image183.png_0.png ('image182.png_0.png', 0.98172903)  
 image188.png_0.png ('image189.png_0.png', 0.98023623)  
 image189.png_0.png ('image188.png_0.png', 0.98023623)  
 image192.png_0.png ('image194.png_0.png', 0.9978157)  
 image194.png_0.png ('image192.png_0.png', 0.9978157)  
 image20.png_0.png ('image11.png_0.png', 0.9708375)  
 image20.png_0.png ('image12.png_0.png', 0.9705702)  
 image20.png_0.png ('image14.png_0.png', 0.9994699)  
 image20.png_0.png ('image16.png_0.png', 0.98187333)  
 image20.png_0.png ('image7.png_0.png', 0.9730121)  
 image202.png_0.png ('image203.png_0.png', 0.9940014)  
 image203.png_0.png ('image202.png_0.png', 0.9940014)  
 image211.png_0.png ('image212.png_0.png', 0.98358434)  
 image212.png_0.png ('image211.png_0.png', 0.98358434)  
 image212.png_0.png ('image214.png_0.png', 0.9713317)  
 image214.png_0.png ('image212.png_0.png', 0.9713317)  
 image215.jpeg_0.png ('image215.jpeg_1.png', 0.9919382)  
 image215.jpeg_1.png ('image215.jpeg_0.png', 0.9919382)  
 image219.png_2.png ('image219.png_3.png', 0.97034454)  
 image219.png_3.png ('image219.png_2.png', 0.97034454)  
 image220.png_0.png ('image221.png_0.png', 0.9760274)  
 image221.png_0.png ('image220.png_0.png', 0.9760274)  
 image221.png_2.png ('image221.png_3.png', 0.97269195)  
 image221.png_3.png ('image221.png_2.png', 0.97269195)  
 image23.png_0.png ('image25.png_0.png', 0.97705656)  
 image23.png_2.png ('image26.png_2.png', 0.97086954)  
 image234.png_0.png ('image235.png_0.png', 0.99267024)  
 image234.png_1.png ('image234.png_2.png', 0.97507715)  
 image234.png_2.png ('image234.png_1.png', 0.97507715)  
 image235.png_0.png ('image234.png_0.png', 0.99267024)  
 image237.png_0.png ('image238.png_0.png', 0.9760754)  
 image238.png_0.png ('image237.png_0.png', 0.9760754)  
 image238.png_0.png ('image239.png_0.png', 0.9754666)  
 image239.png_0.png ('image238.png_0.png', 0.9754666)  
 image242.png_0.png ('image243.png_0.png', 0.9714698)  
 image243.png_0.png ('image242.png_0.png', 0.9714698)  
 image245.png_0.png ('image246.png_0.png', 0.97148263)  
 image245.png_0.png ('image247.png_0.png', 0.97165793)  
 image246.png_0.png ('image245.png_0.png', 0.97148263)  
 image246.png_0.png ('image247.png_0.png', 0.9809675)  
 image247.png_0.png ('image245.png_0.png', 0.97165793)  
 image247.png_0.png ('image246.png_0.png', 0.9809675)  
 image25.png_0.png ('image23.png_0.png', 0.97705656)  
 image257.png_0.png ('image258.png_0.png', 0.98425305)  
 image257.png_0.png ('image259.png_0.png', 0.9732449)  
 image258.png_0.png ('image257.png_0.png', 0.98425305)  
 image258.png_0.png ('image259.png_0.png', 0.98271394)  
 image259.png_0.png ('image257.png_0.png', 0.9732449)  
 image259.png_0.png ('image258.png_0.png', 0.98271394)  
 image26.png_2.png ('image23.png_2.png', 0.97086954)  
 image262.png_0.png ('image263.png_0.png', 0.9708605)  
 image263.png_0.png ('image262.png_0.png', 0.9708605)  
 image265.png_0.png ('image266.png_0.png', 0.985058)  
 image265.png_0.png ('image267.png_0.png', 0.97712684)  
 image266.png_0.png ('image265.png_0.png', 0.985058)  
 image266.png_0.png ('image267.png_0.png', 0.9758965)  
 image267.png_0.png ('image265.png_0.png', 0.97712684)  
 image267.png_0.png ('image266.png_0.png', 0.9758965)  
 image270.png_1.png ('image270.png_2.png', 0.97145426)  
 image270.png_2.png ('image270.png_1.png', 0.97145426)  
 image271.png_1.png ('image271.png_2.png', 0.97522026)  
 image271.png_2.png ('image271.png_1.png', 0.97522026)  
 image273.png_0.png ('image274.png_0.png', 0.9760238)  
 image274.png_0.png ('image273.png_0.png', 0.9760238)  
 image274.png_0.png ('image275.png_0.png', 0.97969353)  
 image274.png_1.png ('image274.png_3.png', 0.9715067)  
 image274.png_3.png ('image274.png_1.png', 0.9715067)  
 image275.png_0.png ('image274.png_0.png', 0.97969353)  
 image277.png_0.png ('image278.png_0.png', 0.9858266)  
 image277.png_0.png ('image279.png_0.png', 0.98131776)  
 image278.png_0.png ('image277.png_0.png', 0.9858266)  
 image278.png_0.png ('image279.png_0.png', 0.9779868)  
 image279.png_0.png ('image277.png_0.png', 0.98131776)  
 image279.png_0.png ('image278.png_0.png', 0.9779868)  
 image287.jpg_0.png ('image164.jpg_0.png', 0.9828615)  
 image63.png_0.png ('image71.png_0.png', 0.99036366)  
 image7.png_0.png ('image11.png_0.png', 0.98577213)  
 image7.png_0.png ('image12.png_0.png', 0.9938463)  
 image7.png_0.png ('image14.png_0.png', 0.9742934)  
 image7.png_0.png ('image16.png_0.png', 0.9959326)  
 image7.png_0.png ('image20.png_0.png', 0.9730121)  
 image71.png_0.png ('image63.png_0.png', 0.99036366)  
 image83.png_0.png ('image86.png_0.png', 0.9774096)  
 image83.png_0.png ('image92.png_0.png', 0.9701129)  
 image83.png_0.png ('image98.png_0.png', 0.9754302)  
 image83.png_2.png ('image86.png_2.png', 0.97005486)  
 image83.png_3.png ('image86.png_3.png', 0.9776021)  
 image83.png_4.png ('image86.png_4.png', 0.97311985)  
 image85.png_0.png ('image87.png_0.png', 0.9724677)  
 image85.png_0.png ('image93.png_0.png', 0.9737506)  
 image85.png_0.png ('image96.png_0.png', 0.9903543)  
 image85.png_0.png ('image99.png_0.png', 0.98590815)  
 image85.png_2.png ('image96.png_2.png', 0.98394686)  
 image85.png_3.png ('image96.png_3.png', 0.9750851)  
 image86.png_0.png ('image83.png_0.png', 0.9774096)  
 image86.png_0.png ('image92.png_0.png', 0.9725857)  
 image86.png_0.png ('image98.png_0.png', 0.97047156)  
 image86.png_2.png ('image83.png_2.png', 0.97005486)  
 image86.png_3.png ('image83.png_3.png', 0.9776021)  
 image86.png_4.png ('image83.png_4.png', 0.97311985)  
 image87.png_0.png ('image85.png_0.png', 0.9724677)  
 image87.png_0.png ('image89.png_0.png', 0.97378576)  
 image87.png_0.png ('image99.png_0.png', 0.9758906)  
 image89.png_0.png ('image87.png_0.png', 0.97378576)  
 image90.png_3.png ('image98.png_3.png', 0.9723841)  
 image92.png_0.png ('image83.png_0.png', 0.9701129)  
 image92.png_0.png ('image86.png_0.png', 0.9725857)  
 image93.png_0.png ('image85.png_0.png', 0.9737506)  
 image93.png_0.png ('image96.png_0.png', 0.9713357)  
 image93.png_4.png ('image100.png_4.png', 0.9714619)  
 image96.png_0.png ('image85.png_0.png', 0.9903543)  
 image96.png_0.png ('image93.png_0.png', 0.9713357)  
 image96.png_0.png ('image99.png_0.png', 0.9820196)  
 image96.png_2.png ('image85.png_2.png', 0.98394686)  
 image96.png_3.png ('image85.png_3.png', 0.9750851)  
 image98.png_0.png ('image83.png_0.png', 0.9754302)  
 image98.png_0.png ('image86.png_0.png', 0.97047156)  
 image98.png_3.png ('image90.png_3.png', 0.9723841)  
 image99.png_0.png ('image85.png_0.png', 0.98590815)  
 image99.png_0.png ('image87.png_0.png', 0.9758906)  
 image99.png_0.png ('image96.png_0.png', 0.9820196)  

The use of a more stringent criteria of 0.99 results in a more managable list:

 image12.png_0.png ('image7.png_0.png', 0.9938463)  
 image120.png_0.png ('image121.png_0.png', 0.9912125)  
 image121.png_0.png ('image120.png_0.png', 0.9912125)  
 image122.png_0.png ('image124.png_0.png', 0.99413013)  
 image124.png_0.png ('image122.png_0.png', 0.99413013)  
 image13.png_3.png ('image13.png_4.png', 0.99999994)  
 image13.png_4.png ('image13.png_3.png', 0.99999994)  
 image14.png_0.png ('image20.png_0.png', 0.9994699)  
 image152.png_0.png ('image153.png_0.png', 0.9902647)  
 image153.png_0.png ('image152.png_0.png', 0.9902647)  
 image16.png_0.png ('image7.png_0.png', 0.9959326)  
 image168.png_0.png ('image169.png_0.png', 0.9929953)  
 image169.png_0.png ('image168.png_0.png', 0.9929953)  
 image192.png_0.png ('image194.png_0.png', 0.9978157)  
 image194.png_0.png ('image192.png_0.png', 0.9978157)  
 image20.png_0.png ('image14.png_0.png', 0.9994699)  
 image202.png_0.png ('image203.png_0.png', 0.9940014)  
 image203.png_0.png ('image202.png_0.png', 0.9940014)  
 image215.jpeg_0.png ('image215.jpeg_1.png', 0.9919382)  
 image215.jpeg_1.png ('image215.jpeg_0.png', 0.9919382)  
 image234.png_0.png ('image235.png_0.png', 0.99267024)  
 image235.png_0.png ('image234.png_0.png', 0.99267024)  
 image63.png_0.png ('image71.png_0.png', 0.99036366)  
 image7.png_0.png ('image12.png_0.png', 0.9938463)  
 image7.png_0.png ('image16.png_0.png', 0.9959326)  
 image71.png_0.png ('image63.png_0.png', 0.99036366)  
 image85.png_0.png ('image96.png_0.png', 0.9903543)  
 image96.png_0.png ('image85.png_0.png', 0.9903543)  

Even this shorter list has 28 hits. Closer manual inspection of these hits highlights the difficulties involved in this chop and search stratergy. The extremely promising hit of image 13 parts 3 and 4 is actually due to these two images being just the canvas. Similarly, the hit seen between part 0 of image 14 and image 20 are due to the presence of the same human karyotype in both these images. However, the original images are distinct enough as they are screenshots showing the synteny relationship of the same human chromosome with very different species.

Self-criticism seems to be trending this month. Many thanks to the brave/crazy Nicholas P. Holmes for sharing his thougts on this. Even seemingly disastorous events such as retracting a paper have achieved glory. Cost of doing science continues to increase. Whether we will see a long-term change and how the reward system will evaluate honest science vs exceptional science is unclear.



Saturday, July 24, 2021

Beautiful, More Beautiful and Most Beautiful. The maxim of Nicolaus Steno

 If you don't know who is Nicolaus Steno, you can look at the below video. 


Now that you know who is Nicolaus Steno, you will appreciate his maxim about the importance of understanding and comprehending. In the previous post, we looked at how the paper by Sharma et al 2020 has many figures in the supplementary that are detected as putative duplicates of each other. While the imagededup package we used has been benchmarked by its authors, we would need to verify its abilities using a set of images that are known to be duplicated for use in Fakery.  Thankfully, the Bik et. al., 2016 paper does a very thorough job of grouping the types of fakery into the following classes:

  1. Category I: simple duplications (https://journals.asm.org/doi/10.1128/mBio.00809-16#fig2)
  2. Category II: duplication with repositioning (https://journals.asm.org/doi/10.1128/mBio.00809-16#fig3)
  3. Category III: duplication with alteration (https://journals.asm.org/doi/10.1128/mBio.00809-16#fig4)

It is possible to test the imagededup package on these verified fakeries. However, to do that we need to cut out each of these image parts identified by Bik et. al., 2016 and create a dataset on which the code can be executed. This dataset is uploaded on github (https://github.com/Corvus7/Fakery.git) and may serve as a training dataset for future efforts at developing AI-based solutions. The full code used and the results are provided below:
 cd  
 git clone https://github.com/Corvus7/Fakery.git  
 image_dir='~/Fakery/bik2016/Fig2/'  
 from imagededup.methods import CNN   
 cnn_encoder = CNN()   
 duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, scores=True)   
 duplicates_cnn   
 {'Blue_1.png': [('Blue_2.png', 0.9211361)], 'Blue_2.png': [('Blue_1.png', 0.9211361)], 'Figure_2_cut_out.jpeg': [], 'Green_1.png': [('Green_2.png', 0.95247614)], 'Green_2.png': [('Green_1.png', 0.95247614)], 'Red_1.png': [('Red_2.png', 0.90385926)], 'Red_2.png': [('Red_1.png', 0.90385926)]}  
 image_dir='~/Fakery/bik2016/Fig3/'  
 from imagededup.methods import CNN   
 cnn_encoder = CNN()   
 duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, scores=True)   
 duplicates_cnn   
 {'Blue_1.png': [], 'Blue_2.png': [], 'Figure_3_cut_out.jpeg': [], 'Green_1.png': [], 'Green_2.png': [], 'Red_1.png': [], 'Red_2.png': []}  
 image_dir='~/Fakery/bik2016/Fig4/'  
 from imagededup.methods import CNN   
 cnn_encoder = CNN()   
 duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, scores=True)   
 duplicates_cnn   
 {'Blue_first.png': [('Blue_second.png', 0.97069657)], 'Blue_second.png': [('Blue_first.png', 0.97069657)], 'Full_screenshot.png': [], 'Green_first.png': [('Green_second.png', 0.98090243)], 'Green_second.png': [('Green_first.png', 0.98090243)], 'Lane_10.png': [('Lane_9.png', 0.90004313)], 'Lane_9.png': [('Lane_10.png', 0.90004313)], 'Orange_first.png': [('Orange_second.png', 0.9222448)], 'Orange_second.png': [('Orange_first.png', 0.9222448)], 'Pink_first.png': [('Pink_second.png', 0.9387925)], 'Pink_second.png': [('Pink_first.png', 0.9387925)], 'Purple_first.png': [('Purple_second.png', 0.95602536)], 'Purple_second.png': [('Purple_first.png', 0.95602536)], 'Red_first.png': [('Red_second.png', 0.9667898)], 'Red_second.png': [('Red_first.png', 0.9667898)], 'orange_first.png': [('orange_second.png', 0.91980916)], 'orange_second.png': [('orange_first.png', 0.91980916)]}  
The results are a bit surprising. Category II duplicates are not at all picked up by the CNN method. The correct figures are tagged as duplicates in Category I and Category III. However, the scores range from 0.9 all the way up to 0.98. This suggests that these scores themselves are not any reliable indicators and manual inspection is definitely required. AI-based methods like imagededup need a lot more sophistication to allow their widespread use in anti-fakery approaches. As far as the paper by Sharma et al 2020 is concerned, the raw data in the form of SAM files are provided in the supplementary materials. Running the samtools split command will separate out each of the sub-components by read group. 


Friday, July 23, 2021

Fakery, some more fakery and even more fakery. The fakir is not the most fake

The works "Faker" and "Fakir" are thought to be semantically related in the English language. While a faker is "One who makes false claims", the term fakir/faqir/faqeer is used to denote a holy man in the Indian subcontinent. Why these two words are semantically related is something one could speculate about. This post is not about this. The lack of evidence to support the claims that the great British statesman Sir Winston Leonard Spencer Churchill ever referred to Mahatma Gandhi, as a "seditious fakir" should give us enough insight.

Now coming to the main topic of this post, fakery in science is one of the most perplexing things i have encountered. Yet, claiming to have invented or discovered something which is not true is surprisingly common. Thanks to the selfless efforts of intelligent people like Jennifer Byrne and Elisabeth Bik, many instances of such fakery have begun to come to light. The paper titled "The Prevalence of Inappropriate Image Duplication in Biomedical Research Publications" is probably the most influential among these efforts. By analyzing a large enough dataset of published papers, the authors try to decipher the major patterns in fakery.

  1. The clearest pattern they demonstrate is the effect of the impact factor (https://journals.asm.org/doi/10.1128/mBio.00809-16#fig6). Journals with lower impact factors seem to have more fakery than those with higher impact factors. The most fakery is seen in journals with impact factors between 2 and 3.
  2. The country of origin also seems to have a strong enough effect (https://journals.asm.org/doi/10.1128/mBio.00809-16#fig7). India and China (including Taiwan) have a greater fraction of papers with problematic images than expected. In the words of Bik et al., "Countries plotted above the blue line had a higher-than-expected proportion of problematic papers; countries plotted below the line had a lower-than-expected ratio." 
  3. Another interesting trend identified by Jennifer Byrne is that such fakery involves "targeting less-well-known human genes to produce low-value and possibly fraudulent papers".

After having read this, I realized that the paper by Sharma et.al., 2020 meets all three criteria. It is published in a journal with an impact factor between 2 and 3, all authors are having affiliations from India and the paper is about a less well-known human gene. Hence, it is very likely that this paper may have employed large-scale image duplication and other forms of fakery to get past the review process. This would make it an ideal candidate to investigate some more fakery.

Not everybody is as gifted as Bik, who has done most of her sleuthing just using her eyes. Therefore, the use of AI-based methods is expected to help relieve the pressure on human volunteers and allow the large-scale search of the scientific literature. Unfortunately, AI is yet to deliver on this promise. The python package imagededup (https://github.com/idealo/imagededup) makes major strides in this direction. It provides an interface to search for image duplication in a large set of files using algorithms like Perceptual hashing and CNN. A score is also provided to help prioritize the putative duplicates identified. Further progress in tools such as these would be able to prevent fraudulent use of image duplication.

Returning to the paper published by Sharma et.al., 2020, an astounding 287 supplementary figures are reported by this paper. The code given below is used to look for fakery in this paper's supplementary figures.

 wget https://static-content.springer.com/esm/art%3A10.1007%2Fs00251-020-01186-2/MediaObjects/251_2020_1186_MOESM9_ESM.pptx  
 mv 251_2020_1186_MOESM9_ESM.pptx 251_2020_1186_MOESM9_ESM.zip  
 unzip 251_2020_1186_MOESM9_ESM.zip  

The above code downloads the PowerPoint file containing all the supplementary figures and renames it as a zip file and extracts its contents. By doing these three steps, we have all the images present in the PowerPoint now available in a single folder under ppt/media/. The next steps would involve installing the imagededup package with all the required pre-requisites.

After everything is installed and working properly, the following code should work. Make sure the correct version of NumPy is installed. Fire up the python console and the correct path to the image directory (the folder in which all the images have been extracted by the unzip command above.)

 #Find duplicates using CNN along with scores  
 from imagededup.methods import CNN  
 cnn_encoder = CNN()  
 duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, scores=True)  
 duplicates_cnn  
After the above code executes (you don't need a GPU, it just complaints that you lack a GPU), the set of putative duplicate images are identified and printed on the screen with the scores.
 {'image1.png': [], 'image10.png': [], 'image100.png': [('image101.png', 0.9183681), ('image83.png', 0.9336556), ('image84.png', 0.94029236), ('image85.png', 0.96624136), ('image86.png', 0.95270556), ('image87.png', 0.91503084), ('image88.png', 0.9674068), ('image89.png', 0.9193216), ('image91.png', 0.95968944), ('image92.png', 0.9491019), ('image93.png', 0.9689957), ('image94.png', 0.9400719), ('image96.png', 0.947449), ('image99.png', 0.9547266)], 'image101.png': [('image100.png', 0.9183681), ('image84.png', 0.92484504), ('image85.png', 0.921789), ('image86.png', 0.90466857), ('image87.png', 0.92616916), ('image88.png', 0.91239095), ('image91.png', 0.92148745), ('image96.png', 0.92620283), ('image97.png', 0.9051003), ('image98.png', 0.91163087), ('image99.png', 0.91708857)], 'image102.png': [('image103.png', 0.9484605)], 'image103.png': [('image102.png', 0.9484605)], 'image104.png': [], 'image105.png': [('image112.png', 0.9010905)], 'image106.png': [], 'image107.png': [('image110.png', 0.90515953)], 'image108.png': [('image110.png', 0.90387726)], 'image109.png': [('image110.png', 0.9003114)], 'image11.png': [('image14.png', 0.9102166), ('image16.png', 0.92004216), ('image20.png', 0.93255997), ('image7.png', 0.9118265), ('image8.png', 0.9021534)], 'image110.png': [('image107.png', 0.90515953), ('image108.png', 0.90387726), ('image109.png', 0.9003114)], 'image111.png': [], 'image112.png': [('image105.png', 0.9010905)], 'image113.jpeg': [('image150.jpeg', 0.9157622)], 'image114.png': [('image171.png', 0.91326845)], 'image115.png': [('image121.png', 0.92301464)], 'image116.png': [], 'image117.png': [('image120.png', 0.90136623)], 'image118.png': [], 'image119.png': [('image120.png', 0.92832935)], 'image12.png': [('image16.png', 0.91794515), ('image7.png', 0.9085101), ('image8.png', 0.9045775)], 'image120.png': [('image117.png', 0.90136623), ('image119.png', 0.92832935), ('image139.png', 0.9278433), ('image140.png', 0.91120446), ('image152.png', 0.9259067), ('image153.png', 0.90862787)], 'image121.png': [('image115.png', 0.92301464), ('image141.png', 0.9063702)], 'image122.png': [('image125.png', 0.9004039)], 'image123.png': [('image201.png', 0.9129802)], 'image124.png': [], 'image125.png': [('image122.png', 0.9004039), ('image202.png', 0.9160283), ('image203.png', 0.9110071), ('image212.png', 0.91093236)], 'image126.png': [], 'image127.png': [], 'image128.png': [], 'image129.png': [], 'image13.png': [], 'image130.png': [], 'image131.png': [], 'image132.png': [], 'image133.jpeg': [], 'image134.png': [], 'image135.png': [], 'image136.png': [], 'image137.png': [], 'image138.png': [], 'image139.png': [('image120.png', 0.9278433)], 'image14.png': [('image11.png', 0.9102166), ('image16.png', 0.94823205), ('image20.png', 0.9343158), ('image7.png', 0.94182235)], 'image140.png': [('image120.png', 0.91120446)], 'image141.png': [('image121.png', 0.9063702), ('image74.png', 0.91188), ('image76.png', 0.92340136), ('image78.png', 0.90045446)], 'image142.png': [], 'image143.png': [('image144.png', 0.9097591)], 'image144.png': [('image143.png', 0.9097591)], 'image145.png': [], 'image146.png': [], 'image147.png': [('image148.png', 0.91800094)], 'image148.png': [('image147.png', 0.91800094)], 'image149.png': [], 'image15.png': [('image18.png', 0.9242378), ('image26.png', 0.91410726)], 'image150.jpeg': [('image113.jpeg', 0.9157622)], 'image151.png': [], 'image152.png': [('image120.png', 0.9259067), ('image153.png', 0.9593252)], 'image153.png': [('image120.png', 0.90862787), ('image152.png', 0.9593252), ('image157.png', 0.9117504)], 'image154.png': [('image158.png', 0.9086372)], 'image155.png': [], 'image156.png': [('image157.png', 0.9200652)], 'image157.png': [('image153.png', 0.9117504), ('image156.png', 0.9200652)], 'image158.png': [('image154.png', 0.9086372)], 'image159.png': [], 'image16.png': [('image11.png', 0.92004216), ('image12.png', 0.91794515), ('image14.png', 0.94823205), ('image20.png', 0.9272426), ('image7.png', 0.959764)], 'image160.png': [], 'image161.png': [], 'image162.png': [], 'image163.png': [], 'image164.jpg': [], 'image165.png': [('image199.png', 0.90960884), ('image209.png', 0.9267041)], 'image166.png': [('image194.png', 0.9086143)], 'image167.png': [], 'image168.png': [], 'image169.png': [], 'image17.png': [('image19.png', 0.90715945)], 'image170.png': [], 'image171.png': [('image114.png', 0.91326845)], 'image172.png': [('image173.png', 0.9054203)], 'image173.png': [('image172.png', 0.9054203)], 'image174.png': [], 'image175.png': [('image211.png', 0.9042617)], 'image176.png': [], 'image177.png': [], 'image178.png': [], 'image179.png': [], 'image18.png': [('image15.png', 0.9242378), ('image26.png', 0.9157932), ('image7.png', 0.9010002)], 'image180.png': [], 'image181.png': [], 'image182.png': [], 'image183.png': [], 'image184.png': [], 'image185.png': [('image191.png', 0.92406636), ('image209.png', 0.90291137)], 'image186.png': [('image194.png', 0.90917015)], 'image187.png': [], 'image188.png': [], 'image189.png': [], 'image19.png': [('image17.png', 0.90715945)], 'image190.png': [], 'image191.png': [('image185.png', 0.92406636)], 'image192.png': [], 'image193.png': [], 'image194.png': [('image166.png', 0.9086143), ('image186.png', 0.90917015)], 'image195.png': [], 'image196.png': [], 'image197.png': [], 'image198.png': [], 'image199.png': [('image165.png', 0.90960884)], 'image2.png': [], 'image20.png': [('image11.png', 0.93255997), ('image14.png', 0.9343158), ('image16.png', 0.9272426), ('image7.png', 0.9131277), ('image8.png', 0.9061092)], 'image200.png': [], 'image201.png': [('image123.png', 0.9129802)], 'image202.png': [('image125.png', 0.9160283), ('image203.png', 0.94643486)], 'image203.png': [('image125.png', 0.9110071), ('image202.png', 0.94643486)], 'image204.png': [], 'image205.png': [], 'image206.png': [], 'image207.png': [], 'image208.png': [], 'image209.png': [('image165.png', 0.9267041), ('image185.png', 0.90291137)], 'image21.png': [('image36.png', 0.9179164)], 'image210.png': [], 'image211.png': [('image175.png', 0.9042617)], 'image212.png': [('image125.png', 0.91093236)], 'image213.png': [], 'image214.png': [], 'image215.jpeg': [], 'image216.png': [], 'image217.png': [('image218.png', 0.91770554)], 'image218.png': [('image217.png', 0.91770554), ('image224.png', 0.9038681)], 'image219.png': [], 'image22.png': [('image35.png', 0.90540993)], 'image220.png': [], 'image221.png': [], 'image222.png': [], 'image223.png': [], 'image224.png': [('image218.png', 0.9038681)], 'image225.png': [], 'image226.png': [], 'image227.png': [], 'image228.png': [], 'image229.png': [], 'image23.png': [('image25.png', 0.9128949), ('image26.png', 0.9407302)], 'image230.jpg': [], 'image231.jpg': [], 'image232.png': [], 'image233.png': [], 'image234.png': [('image235.png', 0.95958734)], 'image235.png': [('image234.png', 0.95958734)], 'image236.png': [], 'image237.png': [('image238.png', 0.9559316), ('image239.png', 0.95943606), ('image257.png', 0.9001226), ('image259.png', 0.91905767)], 'image238.png': [('image237.png', 0.9559316), ('image239.png', 0.93660516), ('image257.png', 0.9084082), ('image259.png', 0.93839127)], 'image239.png': [('image237.png', 0.95943606), ('image238.png', 0.93660516), ('image250.png', 0.9003399), ('image259.png', 0.9138143)], 'image24.png': [], 'image240.png': [], 'image241.png': [('image242.png', 0.9320455), ('image243.png', 0.92742676), ('image250.png', 0.90400875), ('image251.png', 0.90990597)], 'image242.png': [('image241.png', 0.9320455), ('image243.png', 0.9628831)], 'image243.png': [('image241.png', 0.92742676), ('image242.png', 0.9628831)], 'image244.png': [], 'image245.png': [('image246.png', 0.9437679), ('image247.png', 0.93426013), ('image250.png', 0.9031271), ('image253.png', 0.91433334)], 'image246.png': [('image245.png', 0.9437679), ('image247.png', 0.9361171), ('image250.png', 0.9113657)], 'image247.png': [('image245.png', 0.93426013), ('image246.png', 0.9361171), ('image250.png', 0.90679574)], 'image248.png': [], 'image249.png': [], 'image25.png': [('image23.png', 0.9128949), ('image26.png', 0.92126876)], 'image250.png': [('image239.png', 0.9003399), ('image241.png', 0.90400875), ('image245.png', 0.9031271), ('image246.png', 0.9113657), ('image247.png', 0.90679574), ('image251.png', 0.9163149)], 'image251.png': [('image241.png', 0.90990597), ('image250.png', 0.9163149)], 'image252.png': [], 'image253.png': [('image245.png', 0.91433334)], 'image254.png': [], 'image255.png': [], 'image256.png': [], 'image257.png': [('image237.png', 0.9001226), ('image238.png', 0.9084082), ('image258.png', 0.91473335), ('image259.png', 0.9384123)], 'image258.png': [('image257.png', 0.91473335), ('image259.png', 0.92645323)], 'image259.png': [('image237.png', 0.91905767), ('image238.png', 0.93839127), ('image239.png', 0.9138143), ('image257.png', 0.9384123), ('image258.png', 0.92645323)], 'image26.png': [('image15.png', 0.91410726), ('image18.png', 0.9157932), ('image23.png', 0.9407302), ('image25.png', 0.92126876)], 'image260.png': [], 'image261.png': [], 'image262.png': [('image263.png', 0.92793417)], 'image263.png': [('image262.png', 0.92793417)], 'image264.png': [], 'image265.png': [('image266.png', 0.93153894), ('image267.png', 0.94791853)], 'image266.png': [('image265.png', 0.93153894), ('image267.png', 0.9459948)], 'image267.png': [('image265.png', 0.94791853), ('image266.png', 0.9459948)], 'image268.png': [], 'image269.png': [], 'image27.png': [('image28.png', 0.9027702), ('image40.png', 0.91454256)], 'image270.png': [], 'image271.png': [], 'image272.png': [], 'image273.png': [('image274.png', 0.9425941), ('image275.png', 0.9263323)], 'image274.png': [('image273.png', 0.9425941), ('image275.png', 0.9559473)], 'image275.png': [('image273.png', 0.9263323), ('image274.png', 0.9559473)], 'image276.png': [], 'image277.png': [('image278.png', 0.94348145), ('image279.png', 0.93136877)], 'image278.png': [('image277.png', 0.94348145), ('image279.png', 0.9180116)], 'image279.png': [('image277.png', 0.93136877), ('image278.png', 0.9180116)], 'image28.png': [('image27.png', 0.9027702), ('image30.png', 0.9038577)], 'image280.png': [], 'image281.png': [], 'image282.png': [], 'image283.png': [], 'image284.png': [], 'image285.png': [], 'image286.png': [], 'image287.jpg': [], 'image29.png': [('image34.png', 0.90683985)], 'image3.png': [], 'image30.png': [('image28.png', 0.9038577), ('image33.png', 0.9480987), ('image39.png', 0.94268227), ('image40.png', 0.93139154)], 'image31.png': [('image40.png', 0.919406)], 'image32.png': [], 'image33.png': [('image30.png', 0.9480987), ('image34.png', 0.9020339), ('image39.png', 0.9246298), ('image40.png', 0.9213065)], 'image34.png': [('image29.png', 0.90683985), ('image33.png', 0.9020339), ('image40.png', 0.92046106)], 'image35.png': [('image22.png', 0.90540993)], 'image36.png': [('image21.png', 0.9179164)], 'image37.png': [], 'image38.png': [], 'image39.png': [('image30.png', 0.94268227), ('image33.png', 0.9246298), ('image40.png', 0.9188498)], 'image4.png': [], 'image40.png': [('image27.png', 0.91454256), ('image30.png', 0.93139154), ('image31.png', 0.919406), ('image33.png', 0.9213065), ('image34.png', 0.92046106), ('image39.png', 0.9188498)], 'image41.png': [], 'image42.png': [], 'image43.png': [], 'image44.png': [], 'image45.png': [('image47.png', 0.90898323)], 'image46.png': [], 'image47.png': [('image45.png', 0.90898323)], 'image48.png': [], 'image49.png': [], 'image5.png': [('image6.png', 0.93074346)], 'image50.png': [], 'image51.png': [], 'image52.jpg': [], 'image53.png': [('image54.png', 0.9346998), ('image55.png', 0.9301629), ('image56.png', 0.90333104), ('image58.png', 0.9169454), ('image59.png', 0.9067757)], 'image54.png': [('image53.png', 0.9346998), ('image55.png', 0.9548632), ('image56.png', 0.9364393), ('image58.png', 0.95981777), ('image59.png', 0.9343269)], 'image55.png': [('image53.png', 0.9301629), ('image54.png', 0.9548632), ('image56.png', 0.9410248), ('image58.png', 0.94894063), ('image59.png', 0.9302525)], 'image56.png': [('image53.png', 0.90333104), ('image54.png', 0.9364393), ('image55.png', 0.9410248), ('image58.png', 0.918398), ('image59.png', 0.94681996)], 'image57.png': [], 'image58.png': [('image53.png', 0.9169454), ('image54.png', 0.95981777), ('image55.png', 0.94894063), ('image56.png', 0.918398), ('image59.png', 0.92951506)], 'image59.png': [('image53.png', 0.9067757), ('image54.png', 0.9343269), ('image55.png', 0.9302525), ('image56.png', 0.94681996), ('image58.png', 0.92951506)], 'image6.png': [('image5.png', 0.93074346)], 'image60.png': [], 'image61.png': [], 'image62.png': [('image65.png', 0.92089856), ('image69.png', 0.90719336), ('image71.png', 0.91296417)], 'image63.png': [('image66.png', 0.9318356), ('image67.png', 0.9502576), ('image69.png', 0.93360347), ('image70.png', 0.9029176), ('image71.png', 0.9614921)], 'image64.png': [('image65.png', 0.9162164), ('image67.png', 0.9098237), ('image73.png', 0.9010482)], 'image65.png': [('image62.png', 0.92089856), ('image64.png', 0.9162164), ('image67.png', 0.9024062), ('image71.png', 0.92213106), ('image72.png', 0.92419463)], 'image66.png': [('image63.png', 0.9318356), ('image67.png', 0.9185381), ('image69.png', 0.9501172), ('image70.png', 0.90259856), ('image71.png', 0.9121348), ('image73.png', 0.9006342)], 'image67.png': [('image63.png', 0.9502576), ('image64.png', 0.9098237), ('image65.png', 0.9024062), ('image66.png', 0.9185381), ('image71.png', 0.92976534), ('image73.png', 0.9187012)], 'image68.png': [], 'image69.png': [('image62.png', 0.90719336), ('image63.png', 0.93360347), ('image66.png', 0.9501172), ('image70.png', 0.9008339), ('image71.png', 0.9222581)], 'image7.png': [('image11.png', 0.9118265), ('image12.png', 0.9085101), ('image14.png', 0.94182235), ('image16.png', 0.959764), ('image18.png', 0.9010002), ('image20.png', 0.9131277)], 'image70.png': [('image63.png', 0.9029176), ('image66.png', 0.90259856), ('image69.png', 0.9008339), ('image71.png', 0.9152819)], 'image71.png': [('image62.png', 0.91296417), ('image63.png', 0.9614921), ('image65.png', 0.92213106), ('image66.png', 0.9121348), ('image67.png', 0.92976534), ('image69.png', 0.9222581), ('image70.png', 0.9152819), ('image72.png', 0.9081404)], 'image72.png': [('image65.png', 0.92419463), ('image71.png', 0.9081404)], 'image73.png': [('image64.png', 0.9010482), ('image66.png', 0.9006342), ('image67.png', 0.9187012)], 'image74.png': [('image141.png', 0.91188), ('image76.png', 0.9384106), ('image77.png', 0.90596235), ('image78.png', 0.95519), ('image97.png', 0.91846454)], 'image75.png': [('image79.png', 0.9054158)], 'image76.png': [('image141.png', 0.92340136), ('image74.png', 0.9384106), ('image77.png', 0.93369615), ('image78.png', 0.9442819)], 'image77.png': [('image74.png', 0.90596235), ('image76.png', 0.93369615), ('image78.png', 0.9032801), ('image87.png', 0.90282905), ('image99.png', 0.90046567)], 'image78.png': [('image141.png', 0.90045446), ('image74.png', 0.95519), ('image76.png', 0.9442819), ('image77.png', 0.9032801), ('image97.png', 0.9011791)], 'image79.png': [('image75.png', 0.9054158)], 'image8.png': [('image11.png', 0.9021534), ('image12.png', 0.9045775), ('image20.png', 0.9061092)], 'image80.png': [], 'image81.png': [], 'image82.png': [], 'image83.png': [('image100.png', 0.9336556), ('image84.png', 0.9320098), ('image85.png', 0.9206197), ('image86.png', 0.9754471), ('image87.png', 0.9008194), ('image88.png', 0.9269708), ('image89.png', 0.94526726), ('image90.png', 0.94901687), ('image92.png', 0.948261), ('image93.png', 0.9211924), ('image94.png', 0.9406297), ('image95.png', 0.95429873), ('image97.png', 0.90625954), ('image98.png', 0.9336565), ('image99.png', 0.90938056)], 'image84.png': [('image100.png', 0.94029236), ('image101.png', 0.92484504), ('image83.png', 0.9320098), ('image85.png', 0.9357996), ('image86.png', 0.94560987), ('image87.png', 0.9238595), ('image88.png', 0.9233467), ('image89.png', 0.93499565), ('image90.png', 0.9193237), ('image91.png', 0.9257249), ('image92.png', 0.92898804), ('image93.png', 0.9164334), ('image94.png', 0.9218082), ('image95.png', 0.9160108), ('image96.png', 0.91751677), ('image97.png', 0.90513825), ('image98.png', 0.92819315), ('image99.png', 0.9185119)], 'image85.png': [('image100.png', 0.96624136), ('image101.png', 0.921789), ('image83.png', 0.9206197), ('image84.png', 0.9357996), ('image86.png', 0.94672084), ('image87.png', 0.93584263), ('image88.png', 0.95474243), ('image89.png', 0.9393886), ('image91.png', 0.9575047), ('image92.png', 0.9566721), ('image93.png', 0.94614804), ('image94.png', 0.94766164), ('image96.png', 0.9661529), ('image98.png', 0.9093032), ('image99.png', 0.9608704)], 'image86.png': [('image100.png', 0.95270556), ('image101.png', 0.90466857), ('image83.png', 0.9754471), ('image84.png', 0.94560987), ('image85.png', 0.94672084), ('image87.png', 0.9302621), ('image88.png', 0.93556523), ('image89.png', 0.9451473), ('image90.png', 0.93358594), ('image91.png', 0.92195714), ('image92.png', 0.9587965), ('image93.png', 0.9311488), ('image94.png', 0.94426256), ('image95.png', 0.9393131), ('image96.png', 0.909492), ('image97.png', 0.9209131), ('image98.png', 0.9426492), ('image99.png', 0.9371631)], 'image87.png': [('image100.png', 0.91503084), ('image101.png', 0.92616916), ('image77.png', 0.90282905), ('image83.png', 0.9008194), ('image84.png', 0.9238595), ('image85.png', 0.93584263), ('image86.png', 0.9302621), ('image88.png', 0.9144976), ('image89.png', 0.9421051), ('image90.png', 0.9068817), ('image91.png', 0.91225964), ('image92.png', 0.92346567), ('image94.png', 0.9416356), ('image95.png', 0.90827584), ('image96.png', 0.9235568), ('image97.png', 0.94440794), ('image98.png', 0.94083), ('image99.png', 0.94340634)], 'image88.png': [('image100.png', 0.9674068), ('image101.png', 0.91239095), ('image83.png', 0.9269708), ('image84.png', 0.9233467), ('image85.png', 0.95474243), ('image86.png', 0.93556523), ('image87.png', 0.9144976), ('image89.png', 0.93033636), ('image90.png', 0.91487217), ('image91.png', 0.9425154), ('image92.png', 0.9378558), ('image93.png', 0.9692812), ('image94.png', 0.94958806), ('image95.png', 0.90385324), ('image96.png', 0.94041604), ('image98.png', 0.911821), ('image99.png', 0.94753677)], 'image89.png': [('image100.png', 0.9193216), ('image83.png', 0.94526726), ('image84.png', 0.93499565), ('image85.png', 0.9393886), ('image86.png', 0.9451473), ('image87.png', 0.9421051), ('image88.png', 0.93033636), ('image90.png', 0.9433856), ('image92.png', 0.9473567), ('image93.png', 0.9181183), ('image94.png', 0.96665895), ('image95.png', 0.9607937), ('image96.png', 0.90655136), ('image97.png', 0.9302972), ('image98.png', 0.936584), ('image99.png', 0.9141985)], 'image9.png': [], 'image90.png': [('image83.png', 0.94901687), ('image84.png', 0.9193237), ('image86.png', 0.93358594), ('image87.png', 0.9068817), ('image88.png', 0.91487217), ('image89.png', 0.9433856), ('image92.png', 0.9033547), ('image94.png', 0.9209361), ('image95.png', 0.9587287), ('image97.png', 0.932611), ('image98.png', 0.94770867)], 'image91.png': [('image100.png', 0.95968944), ('image101.png', 0.92148745), ('image84.png', 0.9257249), ('image85.png', 0.9575047), ('image86.png', 0.92195714), ('image87.png', 0.91225964), ('image88.png', 0.9425154), ('image92.png', 0.92670405), ('image93.png', 0.94980544), ('image94.png', 0.91421396), ('image96.png', 0.97193605), ('image99.png', 0.9469298)], 'image92.png': [('image100.png', 0.9491019), ('image83.png', 0.948261), ('image84.png', 0.92898804), ('image85.png', 0.9566721), ('image86.png', 0.9587965), ('image87.png', 0.92346567), ('image88.png', 0.9378558), ('image89.png', 0.9473567), ('image90.png', 0.9033547), ('image91.png', 0.92670405), ('image93.png', 0.9380702), ('image94.png', 0.96137464), ('image95.png', 0.91555953), ('image96.png', 0.92193484), ('image98.png', 0.9120532), ('image99.png', 0.9409566)], 'image93.png': [('image100.png', 0.9689957), ('image83.png', 0.9211924), ('image84.png', 0.9164334), ('image85.png', 0.94614804), ('image86.png', 0.9311488), ('image88.png', 0.9692812), ('image89.png', 0.9181183), ('image91.png', 0.94980544), ('image92.png', 0.9380702), ('image94.png', 0.9362874), ('image96.png', 0.9362149), ('image99.png', 0.930895)], 'image94.png': [('image100.png', 0.9400719), ('image83.png', 0.9406297), ('image84.png', 0.9218082), ('image85.png', 0.94766164), ('image86.png', 0.94426256), ('image87.png', 0.9416356), ('image88.png', 0.94958806), ('image89.png', 0.96665895), ('image90.png', 0.9209361), ('image91.png', 0.91421396), ('image92.png', 0.96137464), ('image93.png', 0.9362874), ('image95.png', 0.9345614), ('image96.png', 0.91518164), ('image97.png', 0.9218683), ('image98.png', 0.926751), ('image99.png', 0.94931686)], 'image95.png': [('image83.png', 0.95429873), ('image84.png', 0.9160108), ('image86.png', 0.9393131), ('image87.png', 0.90827584), ('image88.png', 0.90385324), ('image89.png', 0.9607937), ('image90.png', 0.9587287), ('image92.png', 0.91555953), ('image94.png', 0.9345614), ('image97.png', 0.92592674), ('image98.png', 0.94172406)], 'image96.png': [('image100.png', 0.947449), ('image101.png', 0.92620283), ('image84.png', 0.91751677), ('image85.png', 0.9661529), ('image86.png', 0.909492), ('image87.png', 0.9235568), ('image88.png', 0.94041604), ('image89.png', 0.90655136), ('image91.png', 0.97193605), ('image92.png', 0.92193484), ('image93.png', 0.9362149), ('image94.png', 0.91518164), ('image99.png', 0.93704534)], 'image97.png': [('image101.png', 0.9051003), ('image74.png', 0.91846454), ('image78.png', 0.9011791), ('image83.png', 0.90625954), ('image84.png', 0.90513825), ('image86.png', 0.9209131), ('image87.png', 0.94440794), ('image89.png', 0.9302972), ('image90.png', 0.932611), ('image94.png', 0.9218683), ('image95.png', 0.92592674), ('image98.png', 0.9679376)], 'image98.png': [('image101.png', 0.91163087), ('image83.png', 0.9336565), ('image84.png', 0.92819315), ('image85.png', 0.9093032), ('image86.png', 0.9426492), ('image87.png', 0.94083), ('image88.png', 0.911821), ('image89.png', 0.936584), ('image90.png', 0.94770867), ('image92.png', 0.9120532), ('image94.png', 0.926751), ('image95.png', 0.94172406), ('image97.png', 0.9679376), ('image99.png', 0.91562355)], 'image99.png': [('image100.png', 0.9547266), ('image101.png', 0.91708857), ('image77.png', 0.90046567), ('image83.png', 0.90938056), ('image84.png', 0.9185119), ('image85.png', 0.9608704), ('image86.png', 0.9371631), ('image87.png', 0.94340634), ('image88.png', 0.94753677), ('image89.png', 0.9141985), ('image91.png', 0.9469298), ('image92.png', 0.9409566), ('image93.png', 0.930895), ('image94.png', 0.94931686), ('image96.png', 0.93704534), ('image98.png', 0.91562355)]}  

This is a surprisingly long list of putative duplicate images identified by the CNN implementation which is benchmarked to be very good at finding near duplicates. The highest score for duplication is 0.9754471 and occurs twice. The two high scores are seen when image86.png is detected as a putative duplicate of image83.png and vice-versa. Now, it looks like we may have a hit for image duplication in this paper !!!

Manual inspection of this close hit will help resolve this issue better. The python package has the functionality to allow such inspection.

 # plot duplicates obtained for a given file using the duplicates dictionary  
 from imagededup.utils import plot_duplicates  
 image_to_check='image83.png'  
 plot_duplicates(image_dir=image_dir,duplicate_map=duplicates_cnn,filename=image_to_check)  

The above code will open a GUI interface (matplotlib) that shows the original image at the top center. All of the putative duplicate images are shown below this image with the scores mentioned in the brackets. Luckily for these faqeers, the images are not exact duplicates. However, the images are pretty darn close as they are screenshots of the IGV software depicting the RNAseq data for the same PLGRKT gene in very different tissues and species. These analyses suggest that none of the supplementary figures are exact duplicates.

Monday, May 17, 2021

On the virtues of identifying the correct publication units and the ills of Salami slicing

Publishing a paper is much more than just doing good research and writing it up. It needs a lot of thought into identifying the correct publication unit, crafting a captivating story, and delivering it with the right tone. A vivid memory that has stuck in my mind is a lecture on publication ethics that discussed the idea of salami slicing. The phrase "salami-slicing" of a paper refers to splitting up a manuscript into numerous small pieces to increase publication count. At the time it seemed to me like the evilest thing that a scientist could do and it reeked of greed and cunningness. 

Recent events have resulted in introspection on this quick judgment that I had jumped to. Alternative reasons for why "salami-slicing" could happen are listed here (not aimed at justifying):

  1. A story can become too long and convoluted without the proper amount of content.
  2. Reviewers might be inclined to comment stuff like "This manuscript is about multiple things,  and although the subjects are certainly appropriate for XYZ journal" etc. "At least 5 disparate projects are included in the paper...". Such comments can motivate or rather ensure splitting the manuscript into multiple parts. 
  3. The cost of doing research continues to increase in most biology-related domains. Pouring all of these resources into one mega monolith might not be liked by funding bodies or other relevant authorities. Focus on paper count rather than quality or thoroughness of the research is a worrying prospect. 
Having explained some background that doesn't justify "Salami-slicing", let me provide details of what Patil et. al., did. First, the manuscript titled "CoalQC - Quality control while inferring demographic histories from genomic data: Application to forest tree genomes"  dealing with various technical aspects of PSMC was posted on the Biorxiv repository in March 2020. Next, Patil et.al. managed to publish the first part of the study in the journal Gene, titled "The genome sequence of Mesua ferrea and comparative demographic histories of forest trees" in October 2020. However, the technical parts dealing with repeats, genome assembly, and parameter settings remained unscrutinized by the powerful gaze of the intellects of peer reviewers. After struggling through numerous journals that were willing to publish the manuscript without demanding article processing charges (APCs), the second part is now published in the journal Heredity, titled "Repetitive genomic regions and the inference of demographic history".

The date of acceptance (17th April 2021) for this second part is of great significance. It was the 130th birth anniversary (on 14th April) of an Indian anthropologist who wrote the book "WWTS?". Obviously, he is better known for his many other achievements. This book, published in the year 1946 is almost 300 pages long (including the appendices) long and was sold at a cost of Rs. 12/8. Many things have changed in the years since. We now have WGS data to tell us about human population history. However, the spirit of the initial book and its relevance continues to haunt India. If nothing else, the book delves into the past and challenges many ideas held dearly and venerated by a few. This possibility of being able to challenge and question dogma is what distinguishes scientific thought from non-scientific thought. The second part of the CoalQC manuscript is now published. In some ways, this manuscript challenges the existing demographic inference methodology. The fact that such critical evaluations of widely used methods are accepted and add to the discussion is of great value. This article by Patil et.al. is now available online as "Repetitive genomic regions and the inference of demographic history". Some additional material that we never published from the pre-print forms the basis for a blog post (Leaping from frogs to plants - in quest of repeats) at the Nature Ecology and Evolution community.

Sunday, May 9, 2021

Coelacanth helps in the fight against HIV

The coelacanth has been called a "living fossil" in popular media as it is thought to have barely evolved compared to the fossil record. Despite morphological similarities of the extant coelacanth to the fossil specimen, considerable molecular evolution is likely to have occurred as seen in the case of species that have been called living fossils. However, a comparison of nucleotide sequence has shown that the rate of evolution in this lineage is significantly lower than other tetrapod lineages. The reasons for the morphological stasis have been the focus of speculation and need greater investigation. Some groups of species (including the Coelacanth) that have been characterized as living fossils are species-poor and not easily amenable to molecular evolutionary analysis. 

More than 95% of all extant fish species belong to the infraclass Teleostei (teleost fish) and abundant fish species that are commercially important or serve as model organisms (such as the zebrafish) belong to this group. These teleost fish belong to the class Actinopterygii (ray-finned fishes) and are known to have had a third round of whole-genome-duplication (3R-WGD). Phylogenetic studies have consistently found that the Coelacanths and lungfish belong to the clade of Sarcopterygii (lobe-finned fish) and share a more recent common ancestor with Tetrapods than Actinopterygii. Importantly, the Coelacanths share the two rounds of whole-genome duplication (2R-WGD) found in other tetrapods and lack the third round (3R-WGD) of whole-genome duplication found in teleost fish. This close evolutionary relationship of Coelacanths with tetrapods and its phylogenetic position has made it a useful model to study the transition of vertebrates from water to land. 

Despite the challenges associated with the study of molecular evolution of Coelacanths, sequencing of its genome in 2013 helped uncover many interesting aspects. Availability of the Coelacanth genome played an important role in timing the whole genome duplication events and provided clearer evidence to support the occurrence of vertebrate whole-genome duplication events. Identification of gene loss events in tetrapods compared to the Coelacanth highlighted several adaptive events that occurred during the transition from water to land. One of the interesting finds reported was the lack of IgM (Immunoglobulin-M) in the Coelacanth genome. Due to its strategic phylogenetic location, Coelacanth genes have been studied to understand the origin and diversification of gene families. A prominent example, the origin of the restriction factor tetherin and more recently HERC's have made use of the Coelacanth gene sequences.

Ramdas et al., use an elaborate study design to investigate the SERINC family of restriction factors. During the course of their investigation, they find that one of the human paralogs SERINC2 is not able to fight HIV while all the other four SERINCs do a good job of fighting HIV. Upon further investigation, they find that SERINC2 from the Coelacanth is able to deal with HIV and this activity was lost in other lineages. One of the most interesting aspects of this study is the use of foamy viruses similar to the endogenous one recovered from the Coelacanth genome to evaluate the ability of SERINC2. The mechanism of action is also deciphered using sophisticated assays.  You can read the final published version titled "Coelacanth SERINC2 inhibits HIV-1 infectivity and is counteracted by envelope glycoprotein from foamy virus" on the website of JVI.