Improve bleve search accuracy #9

New Issue

decentral1se · 2023-06-03T19:51:07+02:00

decentral1se commented

2023-06-03 19:51:07 +02:00

I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like:
"TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"

It should really only return tms44100.pdf, but it returned 5 pdfs.
There should be a way to understand when something should be loose and when something should be strict.

But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.

https://git.vvvvvvaria.org/varia/go-sh-manymanuals/issues/2#issuecomment-955 > I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like: > "TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES" > > It should really only return tms44100.pdf, but it returned 5 pdfs. > There should be a way to understand when something should be loose and when something should be strict. > > But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.

decentral1se added the

bug

label 2023-06-03 19:51:08 +02:00

decentral1se commented

2023-06-03 19:53:14 +02:00

I suspect it is due to this naive approach to indexing in

		exp/bleve.go
		Lines 87 to 89 in 9a8ff220d2
	
						if err := index.Index(datasheet.filename, contents); err != nil {

							log.Fatal(err)

						}

which is just key = filename, value = plain text contents of file. Do we need to process a bit the content of the file and then generate indexes from that? I'm really not good at informational retrieval, maybe someone can help us on this 🤔

I suspect it is due to this naive approach to indexing in https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/commit/9a8ff220d28e41b8351d187e12c43faded43315f/exp/bleve.go#L87-L89 which is just key = filename, value = plain text contents of file. Do we need to process a bit the content of the file and then generate indexes from that? I'm really not good at informational retrieval, maybe someone can help us on this 🤔

decentral1se commented

2023-12-16 12:33:50 +01:00

Just saw this fly by and is related: https://github.com/PaperCutSoftware/pdfsearch

One interesting part is that it can match terms from an index and then generate a PDF of the relevant pages on-the-fly for review. Unsure if they also include highlighting the actual text, but that could be possible also. Unsure how that could translate to a terminal environment.

And the code here uses Bleve too.

Should we ever return to hack again 😆

Just saw this fly by and is related: https://github.com/PaperCutSoftware/pdfsearch One interesting part is that it can match terms from an index and then generate a PDF of the relevant pages on-the-fly for review. Unsure if they also include highlighting the actual text, but that could be possible also. Unsure how that could translate to a terminal environment. And the code [here](https://github.com/peterwilliams97/pdf-search) uses Bleve too. Should we ever return to hack again 😆

crunk commented

2023-12-22 11:41:27 +01:00

Should we ever return to hack again 😆

> Should we ever return to hack again 😆 ![2024 is the promise](https://thumbs.dreamstime.com/b/hacking-future-hack-concept-hacker-using-laptop-digital-business-interface-double-exposure-136506720.jpg)

😆 1

Sign in to join this conversation.