Upload PDF so AI can read

Hey everyone!
I'm trying to create an app which i can upload a few PDF files and let the AI to go through over them and summarize them. but facing a problem which the PDF files after upload are just too big for the AI models.

I tried so far to play in the Workflow area to see if i can implement some JS code to cut the base64 into chunks but had no success.

I am using the upload component and trying to receive a feedback from AI right after uploading.
Anyone have an idea how to overcome the character limit from the AI? I

I thnk you could use pdf-lib to split the file:

import { PDFDocument } from 'pdf-lib';

async function splitPdf(pdfBytes, pagesPerFile) {
  const pdfDoc = await PDFDocument.load(pdfBytes);
  const totalPages = pdfDoc.getPageCount();
  const numFiles = Math.ceil(totalPages / pagesPerFile);
  const result = [];

  for (let i = 0; i < numFiles; i++) {
    const newPdf = await PDFDocument.create();
    for (let j = i * pagesPerFile; j < Math.min((i + 1) * pagesPerFile, totalPages); j++) {
      const [copiedPage] = await newPdf.copyPages(pdfDoc, [j]);
      newPdf.addPage(copiedPage);
    }
    const newPdfBytes = await newPdf.save();
    result.push(newPdfBytes);
  }
    return result;
}

you will need to use a library though as just splitting the base64 won't work. this is because part of the base64 string includes data specific for 1 pdf file. you can visualize it like below:

const my_base64_str = 'asdfghjklqwertyuiop1234567890kljhgfd'

// if the above is your base64 string then
// asdfghj     | klqwertyuiop1234567890k | ljhgfd
// file info   | file data               | file closer

// if you were to split this:
// split0:  asdfghj     | klqwertyuiop
// split1:  1234567890k | ljhgfd

// we can try and designate each split as a .pdf, but the OS is looking for the 'file info' and 'file closer' sections of both splits and is unable to find it so you end up with 2 corrupted files.
// to do this correctly, you would need to put the files back together on the server side...
// so you would split the pdf on the client/browser side, then send both splits to the server/backend which combines all the splits into a single file to send to a vector service.

// since we're unable to manage the backend you need to create 2 new pdf files from 1 and then send both files, all from the client/browser.
1 Like

If i understand correctly i need to run the AI on each chunk of the code, to have reference from all of the chunks runs and merge them. right?

Hi @rone,

Did you already move forward with testing out this workaround? Thanks for your input, @bobthebear !