Cool! I know gitea uses this actually.... we can include the import & make it do a thing and then simply check how big the binary blows up to. It might not be that much / acceptable for the features we get? We can always use https://upx.github.io to reduce binary size also. API looks good!
Cool! I know gitea uses this actually.... we can include the import & make it do a thing and then simply check how big the binary blows up to. It might not be that much / acceptable for the features we get? We can always use https://upx.github.io to reduce binary size also. API looks good!
The one thing I'm struggling to understand is how we will get Bleve to build a searchable interface for multiple documents. We're gonna parse the PDFs and then load them into Bleve, as I understand it. Then a single search interface will filter a list based on that indexed content.
It seems that the bleve bulk command loads multiple files which might be a good start. If we could generate a Bleve index for some PDFs, then we could play around with writing some scripts to query against the content.
The one thing I'm struggling to understand is how we will get Bleve to build a searchable interface for multiple documents. We're gonna parse the PDFs and then load them into Bleve, as I understand it. Then a single search interface will filter a list based on that indexed content.
It seems that the `bleve bulk` command loads multiple files which might be a good start. If we could generate a Bleve index for some PDFs, then we could play around with writing some scripts to query against the content.
The Google search bar style diffs that it produces by default... (~ 12:37m)
Unexpectedly excited about writing a tool that searches things 🙃
Holy fucking shit...
https://mirror.as35701.net/video.fosdem.org/2015/devroom-go/bleve.mp4
The Google search bar style diffs that it produces by default... (~ 12:37m)
Unexpectedly excited about writing a tool that searches things 🙃
OK, here's a test script we can use:
> https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/commit/8c9ec38faf15400373f7d99db6f16bc8bce00f13/exp/bleve.go
It assumes you have a `../datasheets` directory full of PDFs.
Just `cd exp && go run bleve.go`. You can change the search term over on:
> https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/commit/8c9ec38faf15400373f7d99db6f16bc8bce00f13/exp/bleve.go#L89
![image](/attachments/7acfbd5d-cb44-42c8-8fb2-fb1dac240e0e)
See https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/branch/main/exp#bleve-go
Alright, the `exp/bleve.go` example seems to be giving back something useful. I will try to integrate this into the content filter mode via https://git.vvvvvvaria.org/varia/go-sh-manymanuals/issues/5.
I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like: "TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"
It should really only return tms44100.pdf, but it returned 5 pdfs.
There should be a way to understand when something should be loose and when something should be strict.
But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.
I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like:
`"TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"`
It should really only return tms44100.pdf, but it returned 5 pdfs.
There should be a way to understand when something should be loose and when something should be strict.
But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.
http://blevesearch.com/
look well documented and well developed/maintained.
Maybe a bit large?
Cool! I know gitea uses this actually.... we can include the import & make it do a thing and then simply check how big the binary blows up to. It might not be that much / acceptable for the features we get? We can always use https://upx.github.io to reduce binary size also. API looks good!
Related: https://github.com/zinclabs/zinc (uses https://github.com/blugelabs/bluge)
The one thing I'm struggling to understand is how we will get Bleve to build a searchable interface for multiple documents. We're gonna parse the PDFs and then load them into Bleve, as I understand it. Then a single search interface will filter a list based on that indexed content.
It seems that the
bleve bulk
command loads multiple files which might be a good start. If we could generate a Bleve index for some PDFs, then we could play around with writing some scripts to query against the content.use Bleve for indexing and searchto Use Bleve for indexing and search 1 year agoUse Bleve for indexing and searchto Use Bleve for indexing and search of PDF contents 1 year agoHoly fucking shit...
https://mirror.as35701.net/video.fosdem.org/2015/devroom-go/bleve.mp4
The Google search bar style diffs that it produces by default... (~ 12:37m)
Unexpectedly excited about writing a tool that searches things 🙃
OK, here's a test script we can use:
It assumes you have a
../datasheets
directory full of PDFs.Just
cd exp && go run bleve.go
. You can change the search term over on:See https://git.vvvvvvaria.org/varia/go-sh-manymanuals/src/branch/main/exp#bleve-go
Alright, the
exp/bleve.go
example seems to be giving back something useful. I will try to integrate this into the content filter mode via https://git.vvvvvvaria.org/varia/go-sh-manymanuals/issues/5.I tried a few search queries today and my initial impression is its way too loose with its results. Even when directly specifying complete sentences from one pdf like:
"TMS44100, TMS44100P, TMS46100, TMS46100P 4194304-WORD BY 1-BIT DYNAMIC RANDOM-ACCESS MEMORIES"
It should really only return tms44100.pdf, but it returned 5 pdfs.
There should be a way to understand when something should be loose and when something should be strict.
But thinking through what this software should do, I think we only need really strict search. If I want "Hex Schmitt Trigger" I don't want a synonyms or pseudonyms of Trigger.