Welcome, Guest
Username: Password: Remember me
  • Page:
  • 1

TOPIC:

Programmatic way to obtain list of typefaces in an image 4 years 4 months ago #1101

  • EdwardBishop
  • EdwardBishop's Avatar Topic Author
  • Offline
  • Fresh Fonter
  • Fresh Fonter
  • Posts: 2
  • Thank you received: 0
My first post in this forum.

I am looking to write a program that will use OCR (the bit I don't want to and can't write) to list all the fonts found, and if possible their sizes, in an image. The point is to compare one image with another to calculate the probability that they are part of the same document.

Suppose I have a multi-page PDF where each page is just a TIFF, a scan of a paper document, but some sequences of pages are pages from the same document, these sub-documents being of varying number of pages. I want to try to find the places where one document ends and a new one begins, by comparing the fonts and sizes found in each page with those found in the next.

Does any version of Find my Font offer an OCR code library, API or similar I could call from my program to obtain this list?

My program would work better if I could find out how many characters in each font/size combination were found.

I'd be very grateful for thoughts and suggestions. Thank you

Please Log in or Create an account to join the conversation.

Programmatic way to obtain list of typefaces in an image 4 years 4 months ago #1102

  • fivos
  • fivos's Avatar
  • Offline
  • Administrator
  • Administrator
  • Posts: 287
  • Thank you received: 73
Hi Edward,

the short answer is: We don't offer any OCR code library or API for the Find my Font identification engine.
Even if we provided such an API I think it will not be adequate for your needs. You would need more than the Find my Font engine to achieve the goals of the program you have in mind.
OCR stands for Optical Character Recognition, i.e. the ability to identify the letters in an image. Find my Font doesn't do that. We ask the user to let us know the selected letters and then we use the Find my Font engine to identify the corresponding font of those letters.

Think also about these points:
=> A document having regular, bold & italic versions of the same family could be possibly erroneously identified under a different font name (although in most cases the corresponding fonts will be correctly have the same base family name)
=> If you only need to spot the font-changes (and not the exact font family names) you may be able to do so, by using an open source OCR engine (like Google's OCR or Tesseract) to identify the letters of each text line and then extract & directly compare the actual images of the same letters of each line, to see how they differ (this off-course will be effective only if the font-size does not change dramatically between the 2 lines).

I wish you all success in your project :)

PS1: If your budget allows it, consider also to try a commercial OCR engine like ABBY FindRrader - if my memory serves me well, I think some commercial OCRs will also give you some extra info about the font family type (like Serif, San Serif, Bold, Italic, etc.) which could prove to be very useful.
PS2: If your multi-page PDFs are not yet created, you should insist to scan the original pages in a relatively high resolution like 400dpi, 600dpi or more. If you have a PDF image page with a resolution of 100/200 dpi this will probably give you very poor results when trying to identify font family changes.

Fivos Vilanakis - Softonium Developments CTO

Please Log in or Create an account to join the conversation.

Fivos Vilanakis - Softonium Developments CTO

Programmatic way to obtain list of typefaces in an image 4 years 3 months ago #1104

  • EdwardBishop
  • EdwardBishop's Avatar Topic Author
  • Offline
  • Fresh Fonter
  • Fresh Fonter
  • Posts: 2
  • Thank you received: 0
Thank you very much for your generosity in giving such a detailed and thought-through reply Fivos. I will look into Google OCR and Tesseract. I do understand that font recognition is not the same as OCR and for this project do not need to recognise any text. The ideal would be for my or another program to be able to say, or example: 'the following 3 typefaces are used on this page: Arial 14pt, Arial bold 14pt, Courier 12pt"

Unfortunately the scanning has already been done - millions of pages of it - but the quality is fair.

With many thanks again and wishing you a Happy New Year

Please Log in or Create an account to join the conversation.

  • Page:
  • 1
Time to create page: 0.081 seconds

FmF Ajax Search