Improve bleve search accuracy #9

Open
opened 11 months ago by decentral1se · 3 comments
Owner

https://git.vvvvvvaria.org/varia/go-sh-manymanuals/issues/2#issuecomment-955

I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like:
"TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"

It should really only return tms44100.pdf, but it returned 5 pdfs.
There should be a way to understand when something should be loose and when something should be strict.

But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.

https://git.vvvvvvaria.org/varia/go-sh-manymanuals/issues/2#issuecomment-955 > I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like: > "TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES" > > It should really only return tms44100.pdf, but it returned 5 pdfs. > There should be a way to understand when something should be loose and when something should be strict. > > But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.
decentral1se added the
bug
label 11 months ago
Poster
Owner

I suspect it is due to this naive approach to indexing in 9a8ff220d2/exp/bleve.go (L87-L89) which is just key = filename, value = plain text contents of file. Do we need to process a bit the content of the file and then generate indexes from that? I'm really not good at informational retrieval, maybe someone can help us on this 🤔

I suspect it is due to this naive approach to indexing in https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/commit/9a8ff220d28e41b8351d187e12c43faded43315f/exp/bleve.go#L87-L89 which is just key = filename, value = plain text contents of file. Do we need to process a bit the content of the file and then generate indexes from that? I'm really not good at informational retrieval, maybe someone can help us on this 🤔
Poster
Owner

Just saw this fly by and is related: https://github.com/PaperCutSoftware/pdfsearch

One interesting part is that it can match terms from an index and then generate a PDF of the relevant pages on-the-fly for review. Unsure if they also include highlighting the actual text, but that could be possible also. Unsure how that could translate to a terminal environment.

And the code here uses Bleve too.

Should we ever return to hack again 😆

Just saw this fly by and is related: https://github.com/PaperCutSoftware/pdfsearch One interesting part is that it can match terms from an index and then generate a PDF of the relevant pages on-the-fly for review. Unsure if they also include highlighting the actual text, but that could be possible also. Unsure how that could translate to a terminal environment. And the code [here](https://github.com/peterwilliams97/pdf-search) uses Bleve too. Should we ever return to hack again 😆
Owner

Should we ever return to hack again 😆

2024 is the promise

> Should we ever return to hack again 😆 ![2024 is the promise](https://thumbs.dreamstime.com/b/hacking-future-hack-concept-hacker-using-laptop-digital-business-interface-double-exposure-136506720.jpg)
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.