Use Bleve for indexing and search of PDF contents #2

Closed
opened 2 years ago by crunk · 7 comments
crunk commented 2 years ago
Owner

http://blevesearch.com/
look well documented and well developed/maintained.

Maybe a bit large?

http://blevesearch.com/ look well documented and well developed/maintained. Maybe a bit large?
crunk added the
discussion
label 2 years ago
Owner

Cool! I know gitea uses this actually.... we can include the import & make it do a thing and then simply check how big the binary blows up to. It might not be that much / acceptable for the features we get? We can always use https://upx.github.io to reduce binary size also. API looks good!

Cool! I know gitea uses this actually.... we can include the import & make it do a thing and then simply check how big the binary blows up to. It might not be that much / acceptable for the features we get? We can always use https://upx.github.io to reduce binary size also. API looks good!
Owner
Related: https://github.com/zinclabs/zinc (uses https://github.com/blugelabs/bluge)
Owner

The one thing I'm struggling to understand is how we will get Bleve to build a searchable interface for multiple documents. We're gonna parse the PDFs and then load them into Bleve, as I understand it. Then a single search interface will filter a list based on that indexed content.

It seems that the bleve bulk command loads multiple files which might be a good start. If we could generate a Bleve index for some PDFs, then we could play around with writing some scripts to query against the content.

The one thing I'm struggling to understand is how we will get Bleve to build a searchable interface for multiple documents. We're gonna parse the PDFs and then load them into Bleve, as I understand it. Then a single search interface will filter a list based on that indexed content. It seems that the `bleve bulk` command loads multiple files which might be a good start. If we could generate a Bleve index for some PDFs, then we could play around with writing some scripts to query against the content.
decentral1se added
enhancement
and removed
discussion
labels 1 year ago
decentral1se changed title from use Bleve for indexing and search to Use Bleve for indexing and search 1 year ago
decentral1se changed title from Use Bleve for indexing and search to Use Bleve for indexing and search of PDF contents 1 year ago
Owner

Holy fucking shit...

https://mirror.as35701.net/video.fosdem.org/2015/devroom-go/bleve.mp4

The Google search bar style diffs that it produces by default... (~ 12:37m)

Unexpectedly excited about writing a tool that searches things 🙃

Holy fucking shit... https://mirror.as35701.net/video.fosdem.org/2015/devroom-go/bleve.mp4 The Google search bar style diffs that it produces by default... (~ 12:37m) Unexpectedly excited about writing a tool that searches things 🙃
Owner

OK, here's a test script we can use:

8c9ec38faf/exp/bleve.go

It assumes you have a ../datasheets directory full of PDFs.

Just cd exp && go run bleve.go. You can change the search term over on:

8c9ec38faf/exp/bleve.go (L89)

image

See https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/branch/main/exp#bleve-go

OK, here's a test script we can use: > https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/commit/8c9ec38faf15400373f7d99db6f16bc8bce00f13/exp/bleve.go It assumes you have a `../datasheets` directory full of PDFs. Just `cd exp && go run bleve.go`. You can change the search term over on: > https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/commit/8c9ec38faf15400373f7d99db6f16bc8bce00f13/exp/bleve.go#L89 ![image](/attachments/7acfbd5d-cb44-42c8-8fb2-fb1dac240e0e) See https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/branch/main/exp#bleve-go

Alright, the exp/bleve.go example seems to be giving back something useful. I will try to integrate this into the content filter mode via https://git.vvvvvvaria.org/varia/go-sh-manymanuals/issues/5.

Alright, the `exp/bleve.go` example seems to be giving back something useful. I will try to integrate this into the content filter mode via https://git.vvvvvvaria.org/varia/go-sh-manymanuals/issues/5.
decentral1se closed this issue 12 months ago
Poster
Owner

I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like:
"TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"

It should really only return tms44100.pdf, but it returned 5 pdfs.
There should be a way to understand when something should be loose and when something should be strict.

But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.

I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like: `"TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"` It should really only return tms44100.pdf, but it returned 5 pdfs. There should be a way to understand when something should be loose and when something should be strict. But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.