Crawl to vector -feature. Does it work work for you?

Hi All,

vector crawler (cloud version) questions:

  • does it work for you?
  • it most definitely does not go "1000 pages deep" --> it seems there is aprox. 50 pages limit, after which the crawler stops.
  • does the excluding work only for sub-domain? not for sub folders or specific urls?
  • is the only way to crawl consistently to use third-party api and workflow (with crawl4ai or firecrawl etc)?
    image

Hi @Kimi! Thanks for reaching out.

In my relatively limited experience with this feature, I haven't historically seen the issues that you're describing. That said, I tried it out this morning and the behavior does seem to have changed a bit.

I reached out to the team and got some clarification - essentially, these published parameters were having an outsized impact upon our cloud infrastructure and performance and have become more restrictive, as a result. The new limit is 50 total pages, with a 15 second timeout per page. The banner in the UI will be updated to reflect these changes in the next cloud release!

On the other hand, there haven't been any changes to the way certain pages are excluded, meaning Retool does simple string matching in order to determine whether a given page should be vectorized or not. Any page whose URL starts with the specified string will be excluded.

Given these restrictions and the specifics of your use case, it may or may not make sense to instead utilize a third-party API. I'm happy to answer any additional questions you might have in order to find a solution for your particular use case!