Use Bleve for indexing and search of PDF contents #2

Closed
opened 2022-05-02 16:22:34 +02:00 by crunk · 7 comments
Owner

http://blevesearch.com/
look well documented and well developed/maintained.

Maybe a bit large?

http://blevesearch.com/ look well documented and well developed/maintained. Maybe a bit large?
crunk added the
discussion
label 2022-05-02 16:22:41 +02:00
Owner

Cool! I know gitea uses this actually.... we can include the import & make it do a thing and then simply check how big the binary blows up to. It might not be that much / acceptable for the features we get? We can always use https://upx.github.io to reduce binary size also. API looks good!

Cool! I know gitea uses this actually.... we can include the import & make it do a thing and then simply check how big the binary blows up to. It might not be that much / acceptable for the features we get? We can always use https://upx.github.io to reduce binary size also. API looks good!
Owner
Related: https://github.com/zinclabs/zinc (uses https://github.com/blugelabs/bluge)
Owner

The one thing I'm struggling to understand is how we will get Bleve to build a searchable interface for multiple documents. We're gonna parse the PDFs and then load them into Bleve, as I understand it. Then a single search interface will filter a list based on that indexed content.

It seems that the bleve bulk command loads multiple files which might be a good start. If we could generate a Bleve index for some PDFs, then we could play around with writing some scripts to query against the content.

The one thing I'm struggling to understand is how we will get Bleve to build a searchable interface for multiple documents. We're gonna parse the PDFs and then load them into Bleve, as I understand it. Then a single search interface will filter a list based on that indexed content. It seems that the `bleve bulk` command loads multiple files which might be a good start. If we could generate a Bleve index for some PDFs, then we could play around with writing some scripts to query against the content.
decentral1se added
enhancement
and removed
discussion
labels 2023-05-10 15:03:32 +02:00
decentral1se changed title from use Bleve for indexing and search to Use Bleve for indexing and search 2023-05-10 15:03:42 +02:00
decentral1se changed title from Use Bleve for indexing and search to Use Bleve for indexing and search of PDF contents 2023-05-10 15:03:49 +02:00
Owner

Holy fucking shit...

https://mirror.as35701.net/video.fosdem.org/2015/devroom-go/bleve.mp4

The Google search bar style diffs that it produces by default... (~ 12:37m)

Unexpectedly excited about writing a tool that searches things 🙃

Holy fucking shit... https://mirror.as35701.net/video.fosdem.org/2015/devroom-go/bleve.mp4 The Google search bar style diffs that it produces by default... (~ 12:37m) Unexpectedly excited about writing a tool that searches things 🙃
Owner

OK, here's a test script we can use:

8c9ec38faf/exp/bleve.go

It assumes you have a ../datasheets directory full of PDFs.

Just cd exp && go run bleve.go. You can change the search term over on:

query := bleve.NewMatchQuery("Enhanced-Page-Mode")

image

See https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/branch/main/exp#bleve-go

OK, here's a test script we can use: > https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/commit/8c9ec38faf15400373f7d99db6f16bc8bce00f13/exp/bleve.go It assumes you have a `../datasheets` directory full of PDFs. Just `cd exp && go run bleve.go`. You can change the search term over on: > https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/commit/8c9ec38faf15400373f7d99db6f16bc8bce00f13/exp/bleve.go#L89 ![image](/attachments/7acfbd5d-cb44-42c8-8fb2-fb1dac240e0e) See https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/branch/main/exp#bleve-go
Owner

Alright, the exp/bleve.go example seems to be giving back something useful. I will try to integrate this into the content filter mode via #5.

Alright, the `exp/bleve.go` example seems to be giving back something useful. I will try to integrate this into the content filter mode via https://git.vvvvvvaria.org/varia/go-sh-manymanuals/issues/5.
Author
Owner

I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like:
"TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"

It should really only return tms44100.pdf, but it returned 5 pdfs.
There should be a way to understand when something should be loose and when something should be strict.

But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.

I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like: `"TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"` It should really only return tms44100.pdf, but it returned 5 pdfs. There should be a way to understand when something should be loose and when something should be strict. But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: varia/go-sh-manymanuals#2
No description provided.