I’d like to explore a project for NLP based search on the fediverse. But I’m a fediverse beginner and am not sure if it’s possible to index fediverse content.

My general idea is -

  1. Set up my own read-only instance, let’s say of kbin. I’m not sure if the concept of a read-only instance makes sense. It’s read-only because the instance only needs to be able to read the content already on the fediverse and doesn’t need the ability to post content.
  2. At some regular interval, let’s say once a day, monitor any changes in the content from the previous run. I’m not sure if there is a single “fediverse” where all the content can be read from. If not, then I can start with tracking the same content as on kbin.social. Is it possible to monitor changes to content on a kbin instance?
  3. I’ll convert the content into vector embeddings by a using an NLP ML model like CLIP. The embeddings will be stored in a vector store. The vector store will also include the url of the content as metadata.
  4. When a user requests a search, the search term is converted to its vector embedding using the same ML model and the most similar vectors are identified.
  5. The user gets the search results as urls of the most relevant content, and perhaps a preview of the content. The user can then access the full content from where it’s originally posted using its url.

I’m comfortable with setting up steps 3 and 4. But I do not know the fediverse enough to answer whether steps 1, 2, and 5 would work or even make sense how I’m envisioning them.

Can some of the fediverse veterans help me understand if this is a feasible approach or if I’ve got it all wrong?

  • ofcourse@kbin.socialOP
    link
    fedilink
    arrow-up
    1
    ·
    edit-2
    2 years ago

    Thanks for sharing your insights.

    I’m curious why instances offering free search are defederated? I would have guessed everyone wants better search. Is it because of privacy concerns or instances don’t want to be indexed or have traffic directed elsewhere?

    I was hoping that if I index only for the purpose of embeddings (which would prevent recreating the original content) and only share urls to the content that it should eliminate privacy and traffic concerns.

    I’m still in the process of understanding how and if this would work. It’s only a personal project at this stage but you are right cpu/gpu and vector stores would be things I’d need to consider.