|rra 27a5c9b1d7||5 months ago|
|.gitignore||5 months ago|
|README.md||5 months ago|
|about_collector.py||5 months ago|
|fedicrawler.py||5 months ago|
|instance_scrape.json||5 months ago|
|instances.txt||2 years ago|
fedicrawler.py is a utility to map the Fediverse and save that data as a json file.
Currently the script starts from https://post.lurk.org and queries
/api/v1/instance/peers to find servers it is peering with. For each of the peering servers it hasn’t seen before it does the same. This from the assumption that getting peer lists from Mastodon & Pleroma gives enough of a view of ‘known fediverse’ to work with.
initial peer list
all peers of the initial peers
all peers of the peers of the inintial peers
the known fediverse?
We try to query
/.well-known/nodeinfo for instance meta-data such as software type etc. This is what both fediverse.network and the-federation.info do
When any request fails on a given instance it logs the raised
Exception, if it is a HTTP error instead we currently log the answer.
Latest scrape results can be found in
about_collector.py is a utility to collect and document the
about/more/ pages of Mastodon instances found in
instance_scrape.json data. It will collect three things per instance found: a screenshot of the page, the HTML of the page, and a separate file with the html containing the server description which usually also contains a ToS/CoC. The script makes a rough sorting based on whether an instance has any instance info at all.
You will find the collected results in
This script requires
selenium and the
gecko web driver. See the selenium documentation for more information.