Browse Source

add info about new script

master
rra 5 years ago
parent
commit
1b1e5b1e52
  1. 21
      README.md

21
README.md

@ -1,12 +1,15 @@
# doing frequent applied fediverse research # doing applied fediverse research
* hands-on approach to verify statistics on fediverse from fediverse.network and the-federation.info * hands-on approach to verify statistics on fediverse from fediverse.network and the-federation.info
* draws conclusions from that * draws conclusions from that
## methodology ## Methods
### Mapping the network ### Mapping the network
`fedicrawler.py` is a utility to map the Fediverse and save that data as a json file.
Currently the script starts from <https://post.lurk.org> and queries `/api/v1/instance/peers` to find servers it is peering with. For each of the peering servers it hasn't seen before it does the same. This from the assumption that getting peer lists from Mastodon & Pleroma gives enough of a view of 'known fediverse' to work with. Currently the script starts from <https://post.lurk.org> and queries `/api/v1/instance/peers` to find servers it is peering with. For each of the peering servers it hasn't seen before it does the same. This from the assumption that getting peer lists from Mastodon & Pleroma gives enough of a view of 'known fediverse' to work with.
> initial peer list > initial peer list
@ -14,16 +17,28 @@ Currently the script starts from <https://post.lurk.org> and queries `/api/v1/in
>>> all peers of the peers of the inintial peers >>> all peers of the peers of the inintial peers
------------------------------------------------- + ------------------------------------------------- +
the known fediverse? the known fediverse?
### Instance metadata #### Instance metadata
We try to query `/.well-known/nodeinfo` for instance meta-data such as software type etc. This is what both fediverse.network and the-federation.info do We try to query `/.well-known/nodeinfo` for instance meta-data such as software type etc. This is what both fediverse.network and the-federation.info do
When any request fails on a given instance it logs the raised `Exception`, if it is a HTTP error instead we currently log the answer. When any request fails on a given instance it logs the raised `Exception`, if it is a HTTP error instead we currently log the answer.
Latest scrape results can be found in `instance_scrape.json` Latest scrape results can be found in `instance_scrape.json`
### Gathering of ToS/CoC
`about_collector.py` is a utility to collect and document the `about/more/` pages of Mastodon instances found in `instance_scrape.json` data. It will collect three things per instance found: a screenshot of the page, the HTML of the page, and a separate file with the html containing the server description which usually also contains a ToS/CoC. The script makes a rough sorting based on whether an instance has any instance info at all.
You will find the collected results in `about_pages` and `about_pages/with_tos`.
This script requires `selenium` and the `gecko` web driver. See the [selenium documentation](https://selenium-python.readthedocs.io/installation.html) for more information.
## TODO FIXME ## TODO FIXME
* multithread the screenshotting
* ~~add detailed error message to json when we get one~~ * ~~add detailed error message to json when we get one~~
* ~~abstract the functions so we can multithread them~~ * ~~abstract the functions so we can multithread them~~

Loading…
Cancel
Save