Converting a complex form to JSON Schema with ResponseVault

They're actually PDF files containing insurance policies, some of which have closer to 300 pages. It's a little different, since I'm not using forms, but I think the PDF page problem (and solution) might be the same for both of us. I've almost got something I can test, so I'll see how that goes but I'll def hit ya up in a couple days to see what else we could come up with!

Right now, I can open a PDF using a base64 string then I go through every page and pull the text and create an image. With the page images, if there are only a couple I'll use gpt-4o to perform OCR on each page but, if there are a lot of pages I cheat cause I'm cheap :joy: and use PDFMiner.six to try and get text, layout, metadata and whatever else i can.

I think all that's left is to combine OCR and original text i pulled and try to clean it up a bit before using it to create my embeddings which I can use to filter out everything unrelated then pass this and the query to a model to form a response.

I haven't quite figured out if I should use each page as a 'chunk' itself or if I should combine everything from all pages then chunk it by sections. Sometimes a section covers many pages, so if each page is a chunk I'm worried the 2nd or 3rd page of a section might not be strongly associated with the 1st page resulting in a query being answered using 1 out of 3 pages.... but if i chunk based on 'sections' it's possible all 200 pages could be accidently assumed as 1 section or actually is 1 section which would result in the same problem (context too large)

I also haven't considered how to handle holding hundreds of images in memory, I'll probly hit a Retool/workflow limit at some point

3 Likes