By Mark Gross, President, Data Conversion Laboratory, appearing in Book Business
Plagiarism in the scientific community is not new, but has become a recurring theme in the past few years due to a few scandals both in STM journals and in publications that are more general. Why now? I can offer a few explanations. For one, there is greater pressure in the academic community to publish more and more content. This also means there is increased access to content over the internet with more people looking at the content. Additionally, computers now can do automated match-ups on the fly. To counter embarrassment, loss of prestige, and economic harm, testing for text-based plagiarism especially in the STM world is now a regular practice. Many journals and societies test all incoming articles against publication databases to avoid future problems.
While text analysis has become common, identifying image-based plagiarism is more complex, and not currently done. We have recently completed work on a pilot system demonstrating a process in which this can now be done.
Don’t Google and Facebook Already Do This With Facial Recognition?
What Google, Facebook, and others do in recognizing faces, landmarks, etc., is truly amazing, but recognizing plagiarized images is even more complex. Why? For facial recognition, you can model the target you are matching – faces have common features, which can be mapped and matched against the image collection being searched. In addition, facial recognition applications do not usually require comparing the image to a multi-million image collection. Facebook, for example, would know who your friends are, simplifying the matching process to the nearest neighbors. When trying to find a match, friends in the social network are prioritized over non-friends, and friends of friends. Likewise, with security applications at airports and other sensitive locations, you have reasonable knowledge of what you are looking for; the database of “persons of interest” is relatively small (10,000’s, not millions).
However, when identifying images for potential plagiarism, there is no a priori model to which to compare – the image might be anything.
“Identity” vs. “Similarity”
Another complication is that the image may have been altered. Identifying two identical images is relatively easy. Identifying variations is more complex. Remember in sixth grade when kids would copy a paragraph from an encyclopedia (or now Wikipedia), try to change a few words, and assume the teacher wouldn’t notice? That is essentially what we are trying to catch, but with images. Of course, determining that two similar images are copies vs. happenstance can be quite subjective.
To be useful the process we needed a process that would match up images that have altered, not just identical images. The following are image transformations we would want to consider:
- Occlusion (a portion of the image is blocked or covered)
- Color remapping
- Cropping (just a portion of the image appears)
- Translation (an image has been moved within the frame)
- Rotation or flipping to a mirror Image
Our approach used a technique called Perceptual Hashing. Hashing is a mathematical technique (algorithm) we used to break up a complex image into a signature called the “hash value”. There are many different algorithms for hashing, each with different properties. To improve the quality of matching we used a combination of algorithms. The system as currently constructed allows a user to upload a batch of images which are then compared against the entire database, with the resulting “candidate matches” returned to the user. A match score is computed; the lower the number, the more likely the match.
The following illustrates an example result. The first image is the uploaded image, and the system identified the four most likely matches. The first matched image, with a score of zero, is actually an identical image that was on the database; the next three images have minor variations, with progressively higher scores of 2, 10, and 13.
While the focus of this system, and this article, is identifying plagiarized images, a similar technique would be applicable to other applications where a need exists to find matching images in large databases and in the big data world. Some potential applications include:
- Identifying when songs or advertisements are played on radio
- Finding unauthorized images on the internet
- Authentication of signatures and documents
- Detection of watermarks
- Searching the web for trademark usage
- Signature verification
We are taught early on in our schooling that plagiarism is an unacceptable practice; a violation of the basic principles of intellectual thought and research. But, the definition of plagiarism and what it extends to is evolving – and fast. Content is, more than ever, image heavy. This is because every click counts and presentation (images) has an increased importance to that end. Because of this, expect to see image plagiarism emerge as an important topic in the coming months and years. The hunt is on.
About the Author
Mark Gross, president of Data Conversion Laboratory (DCL), is an authority on XML implementation and document conversion. Prior to joining DCL in 1981, Gross was with the consulting practice of Arthur Young & Co. He has a B.S. degree in Engineering from Columbia University and an MBA from New York University. He also has taught at the New York University Graduate School of Business, the New School, and Pace University. He is a frequent speaker on the topic of automated conversions to XML and SGML.