Use Bleve for indexing and search of PDF contents #2
Labels
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: varia/go-sh-manymanuals#2
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
http://blevesearch.com/
look well documented and well developed/maintained.
Maybe a bit large?
Cool! I know gitea uses this actually.... we can include the import & make it do a thing and then simply check how big the binary blows up to. It might not be that much / acceptable for the features we get? We can always use https://upx.github.io to reduce binary size also. API looks good!
Related: https://github.com/zinclabs/zinc (uses https://github.com/blugelabs/bluge)
The one thing I'm struggling to understand is how we will get Bleve to build a searchable interface for multiple documents. We're gonna parse the PDFs and then load them into Bleve, as I understand it. Then a single search interface will filter a list based on that indexed content.
It seems that the
bleve bulk
command loads multiple files which might be a good start. If we could generate a Bleve index for some PDFs, then we could play around with writing some scripts to query against the content.use Bleve for indexing and searchto Use Bleve for indexing and searchUse Bleve for indexing and searchto Use Bleve for indexing and search of PDF contentsHoly fucking shit...
https://mirror.as35701.net/video.fosdem.org/2015/devroom-go/bleve.mp4
The Google search bar style diffs that it produces by default... (~ 12:37m)
Unexpectedly excited about writing a tool that searches things 🙃
OK, here's a test script we can use:
It assumes you have a
../datasheets
directory full of PDFs.Just
cd exp && go run bleve.go
. You can change the search term over on:See https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/branch/main/exp#bleve-go
Alright, the
exp/bleve.go
example seems to be giving back something useful. I will try to integrate this into the content filter mode via #5.I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like:
"TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"
It should really only return tms44100.pdf, but it returned 5 pdfs.
There should be a way to understand when something should be loose and when something should be strict.
But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.