Goal: I am building an app that allows users to upload documents (PDFs) to, have them read by Ai, then populate the extracted values to the rest of the app.
Issue: ReTool's Ai action to Convert Document to Text is returning an empty string for some PDF files.
Additional Info: The files are Real Estate contracts and I can not share examples unfortunately because of a NDA. However, I did some investigating and believe I know what the issue is: The files that are having no trouble have text throughout that is highlightable (extractable) text, whereas the files that are having trouble lack this - They either are partially or fully scanned and have mixed or no "extractable text"
Steps Taken: I've tried leveraging other text extraction libraries (pypdf) and they are having similar problems with the unextractable documents. They are able to recognize the number of pages on the document, just no text at all.
Pretty sure that OCR is needed for these types of files, but would love some input if anyone can help me configure this to work with a text extraction Ai and not require an OCR setup.
Hi @daviswray, welcome to the forum!
I this isn't possible with any built-in Retool AI Resource. As you mentioned, we may need an OCR setup, we could create a Workflow in Retool that does that using a pyton library like:
Hey @Darren! A while indeed, forgot to update here - thanks for tagging me!
This became a much simpler task after I accepted that I'd need to use an OCR model. I looked into a few other OCR options (of which there are many now!) and ended up settling on using Google's DocumentAi api. I choose this one mainly for it's added features of being able to train and fine-tune parser models.
I was under the clock at the time to get that project out so I didn't take full advantage of the custom trained models and just used a generic contract reader pre-trained model, which for my purposes works more than fine and leaves the option to upgrade in the future.