Paper
24 March 2014 Extraction and labeling high-resolution images from PDF documents
Author Affiliations +
Proceedings Volume 9021, Document Recognition and Retrieval XXI; 90210Q (2014) https://doi.org/10.1117/12.2042336
Event: IS&T/SPIE Electronic Imaging, 2014, San Francisco, California, United States
Abstract
Accuracy of content-based image retrieval is affected by image resolution among other factors. Higher resolution images enable extraction of image features that more accurately represent the image content. In order to improve the relevance of search results for our biomedical image search engine, Open-I, we have developed techniques to extract and label high-resolution versions of figures from biomedical articles supplied in the PDF format. Open-I uses the open-access subset of biomedical articles from the PubMed Central repository hosted by the National Library of Medicine. Articles are available in XML and in publisher supplied PDF formats. As these PDF documents contain little or no meta-data to identify the embedded images, the task includes labeling images according to their figure number in the article after they have been successfully extracted. For this purpose we use the labeled small size images provided with the XML web version of the article. This paper describes the image extraction process and two alternative approaches to perform image labeling that measure the similarity between two images based upon the image intensity projection on the coordinate axes and similarity based upon the normalized cross-correlation between the intensities of two images. Using image identification based on image intensity projection, we were able to achieve a precision of 92.84% and a recall of 82.18% in labeling of the extracted images.
© (2014) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Suchet K. Chachra, Zhiyun Xue, Sameer Antani, Dina Demner-Fushman, and George R. Thoma "Extraction and labeling high-resolution images from PDF documents", Proc. SPIE 9021, Document Recognition and Retrieval XXI, 90210Q (24 March 2014); https://doi.org/10.1117/12.2042336
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Feature extraction

Image resolution

Biomedical optics

Image processing

Visualization

Image retrieval

Standards development

Back to Top