Use Bleve for indexing and search of PDF contents #2

New Issue

crunk · 2022-05-02T16:22:34+02:00

crunk commented

2022-05-02 16:22:34 +02:00

http://blevesearch.com/
look well documented and well developed/maintained.

Maybe a bit large?

http://blevesearch.com/ look well documented and well developed/maintained. Maybe a bit large?

crunk added the

discussion

label 2022-05-02 16:22:41 +02:00

decentral1se commented

2022-05-04 11:55:27 +02:00

Cool! I know gitea uses this actually.... we can include the import & make it do a thing and then simply check how big the binary blows up to. It might not be that much / acceptable for the features we get? We can always use https://upx.github.io to reduce binary size also. API looks good!

decentral1se commented

2022-06-05 17:07:50 +02:00

Related: https://github.com/zinclabs/zinc (uses https://github.com/blugelabs/bluge)

decentral1se commented

2023-05-10 02:01:31 +02:00

The one thing I'm struggling to understand is how we will get Bleve to build a searchable interface for multiple documents. We're gonna parse the PDFs and then load them into Bleve, as I understand it. Then a single search interface will filter a list based on that indexed content.

It seems that the bleve bulk command loads multiple files which might be a good start. If we could generate a Bleve index for some PDFs, then we could play around with writing some scripts to query against the content.

The one thing I'm struggling to understand is how we will get Bleve to build a searchable interface for multiple documents. We're gonna parse the PDFs and then load them into Bleve, as I understand it. Then a single search interface will filter a list based on that indexed content. It seems that the `bleve bulk` command loads multiple files which might be a good start. If we could generate a Bleve index for some PDFs, then we could play around with writing some scripts to query against the content.

decentral1se added

enhancement

and removed

discussion

labels 2023-05-10 15:03:32 +02:00

decentral1se changed title from ~~use Bleve for indexing and search~~ to Use Bleve for indexing and search

2023-05-10 15:03:42 +02:00

decentral1se changed title from ~~Use Bleve for indexing and search~~ to Use Bleve for indexing and search of PDF contents

2023-05-10 15:03:49 +02:00

decentral1se referenced this issue

2023-05-10 15:09:09 +02:00

Search UI switcher (filename, contents) #5

decentral1se commented

2023-05-10 19:33:01 +02:00

Holy fucking shit...

https://mirror.as35701.net/video.fosdem.org/2015/devroom-go/bleve.mp4

The Google search bar style diffs that it produces by default... (~ 12:37m)

Unexpectedly excited about writing a tool that searches things 🙃

Holy fucking shit... https://mirror.as35701.net/video.fosdem.org/2015/devroom-go/bleve.mp4 The Google search bar style diffs that it produces by default... (~ 12:37m) Unexpectedly excited about writing a tool that searches things 🙃

decentral1se commented

2023-05-11 16:10:51 +02:00

OK, here's a test script we can use:

8c9ec38faf/exp/bleve.go

It assumes you have a ../datasheets directory full of PDFs.

Just cd exp && go run bleve.go. You can change the search term over on:

exp/bleve.go Line 89 in 8c9ec38faf

query := bleve.NewMatchQuery("Enhanced-Page-Mode")

See https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/branch/main/exp#bleve-go

OK, here's a test script we can use: > https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/commit/8c9ec38faf15400373f7d99db6f16bc8bce00f13/exp/bleve.go It assumes you have a `../datasheets` directory full of PDFs. Just `cd exp && go run bleve.go`. You can change the search term over on: > https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/commit/8c9ec38faf15400373f7d99db6f16bc8bce00f13/exp/bleve.go#L89 ![image](/attachments/7acfbd5d-cb44-42c8-8fb2-fb1dac240e0e) See https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/branch/main/exp#bleve-go

image.png

52 KiB

decentral1se commented

2023-05-15 21:16:39 +02:00

Alright, the exp/bleve.go example seems to be giving back something useful. I will try to integrate this into the content filter mode via #5.

Alright, the `exp/bleve.go` example seems to be giving back something useful. I will try to integrate this into the content filter mode via https://git.vvvvvvaria.org/varia/go-sh-manymanuals/issues/5.

decentral1se closed this issue

2023-05-15 21:16:39 +02:00

crunk commented

2023-06-03 11:31:03 +02:00

I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like:
"TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"

It should really only return tms44100.pdf, but it returned 5 pdfs.
There should be a way to understand when something should be loose and when something should be strict.

But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.

I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like: `"TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"` It should really only return tms44100.pdf, but it returned 5 pdfs. There should be a way to understand when something should be loose and when something should be strict. But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.

decentral1se referenced this issue

2023-06-03 19:51:08 +02:00

Improve bleve search accuracy #9

Sign in to join this conversation.