dafr/README.md


# doing applied fediverse research

* hands-on approach to verify statistics on fediverse from fediverse.network and the-federation.info
* draws conclusions from that

## Methods

### Mapping the network

`fedicrawler.py` is a utility to map the Fediverse and save that data as a json file.

Currently the script starts from <https://post.lurk.org> and queries `/api/v1/instance/peers` to find servers it is peering with. For each of the peering servers it hasn't seen before it does the same. This from the assumption that getting peer lists from Mastodon & Pleroma gives enough of a view of 'known fediverse' to work with.

  > initial peer list
  >> all peers of the initial peers
  >>> all peers of the peers of the inintial peers
  
  ------------------------------------------------- +
  
   the known fediverse?

#### Instance metadata
We try to query `/.well-known/nodeinfo` for instance meta-data such as software type etc. This is what both fediverse.network and the-federation.info do

When any request fails on a given instance it logs the raised `Exception`, if it is a HTTP error instead we currently log the answer.

Latest scrape results can be found in `instance_scrape.json`


### Gathering of ToS/CoC

`about_collector.py` is a utility to collect and document the `about/more/` pages of Mastodon instances found in `instance_scrape.json` data. It will collect three things per instance found: a screenshot of the page, the HTML of the page, and a separate file with the html containing the server description which usually also contains a ToS/CoC. The script makes a rough sorting based on whether an instance has any instance info at all. 

You will find the collected results in `about_pages` and `about_pages/with_tos`.  

This script requires `selenium` and the `gecko` web driver. See the [selenium documentation](https://selenium-python.readthedocs.io/installation.html) for more information. 


## TODO FIXME
* multithread the screenshotting
* ~~add detailed error message to json when we get one~~ 
* ~~abstract the functions so we can multithread them~~
gitignore and readme 7 years ago
add info about new script 5 years ago			`# doing applied fediverse research`
gitignore and readme 7 years ago
update readme 5 years ago			`* hands-on approach to verify statistics on fediverse from fediverse.network and the-federation.info`
gitignore and readme 7 years ago			`* draws conclusions from that`

add info about new script 5 years ago			`## Methods`
gitignore and readme 7 years ago
update readme 5 years ago			`### Mapping the network`
add info about new script 5 years ago
			`fedicrawler.py` is a utility to map the Fediverse and save that data as a json file.

update readme 5 years ago			Currently the script starts from <https://post.lurk.org> and queries `/api/v1/instance/peers` to find servers it is peering with. For each of the peering servers it hasn't seen before it does the same. This from the assumption that getting peer lists from Mastodon & Pleroma gives enough of a view of 'known fediverse' to work with.
more info on methodology and where it is lacking 7 years ago
update readme 5 years ago			`> initial peer list`
			`>> all peers of the initial peers`
			`>>> all peers of the peers of the inintial peers`
update readme 5 years ago
update readme 5 years ago			`------------------------------------------------- +`
add info about new script 5 years ago
			`the known fediverse?`
more info on methodology and where it is lacking 7 years ago
add info about new script 5 years ago			`#### Instance metadata`
update readme 5 years ago			We try to query `/.well-known/nodeinfo` for instance meta-data such as software type etc. This is what both fediverse.network and the-federation.info do

			When any request fails on a given instance it logs the raised `Exception`, if it is a HTTP error instead we currently log the answer.
more info on methodology and where it is lacking 7 years ago
			Latest scrape results can be found in `instance_scrape.json`

add info about new script 5 years ago
			`### Gathering of ToS/CoC`

			`about_collector.py` is a utility to collect and document the `about/more/` pages of Mastodon instances found in `instance_scrape.json` data. It will collect three things per instance found: a screenshot of the page, the HTML of the page, and a separate file with the html containing the server description which usually also contains a ToS/CoC. The script makes a rough sorting based on whether an instance has any instance info at all.

			You will find the collected results in `about_pages` and `about_pages/with_tos`.

			This script requires `selenium` and the `gecko` web driver. See the [selenium documentation](https://selenium-python.readthedocs.io/installation.html) for more information.


more info on methodology and where it is lacking 7 years ago			`## TODO FIXME`
add info about new script 5 years ago			`* multithread the screenshotting`
update readme 5 years ago			`* ~~add detailed error message to json when we get one~~`
scraper now uses parallelism 7 years ago			`* ~~abstract the functions so we can multithread them~~`

gitignore and readme 7 years ago