/home/htorquato | Bloodhound I - Ranking Web resources

/home/htorquato

Assortment of random things I work on

Bloodhound I - Ranking Web resources

Posted | Approx. 6 min read

Bloodhound Cybersecurity

$ tree .

Intro
What is Bloodhound?
Categorizing web resources
- About web resource naming
Ideation
- Understanding web resources
Structuring rules
Outro

Intro

I’ve been working on a web scanner over the last couple of weeks, and this wasn’t exactly what I had in mind when I first started working on this project.

This post is about the original problem, and how it evolved into the application that I call Bloodhound.

What is Bloodhound?

It started as a simple project, applying rules to rank and sort a list of URLs. It eventually turned into a multi-module CLI tool, with a integrated web crawler and vulnerability scanner.

Categorizing web resources

When I first investigate a website, I start with any OSINT that I can find. One of the tools that I use is a self-implementation of this waybackurls script. There is a problem with this method though: any website worth targeting will result in the output of millions of different URLs, containing multiple repetitions and dead routes.

What if there was a way to see how relevant each URL is without inspecting one by one?

About web resource naming

Throughout my time working with software I got to know a bit about how the mind of a software developer works. Two of those are:

Enterprise software tends to be developed and named in English;
Resources that relate in goal tend to be named the same or similarly.

There is a limitation on how many different ways you can name something (if you are a sane person). Take, for example, an authentication flow: how many different ways can you name a login page?

/login
/auth
/weird-system-name/authentication

Maybe something more specific to the technology you’re using:

/sso/google
/jwt:fetch

The point is that every language has a limited number of words you can use to identify something and how it relates to something else.

Yes, of course, someone might be very clever and try to hide this behind a random string of characters, but even if only 10% use this pattern (a value that is definitely exaggerated), 90% would still be discoverable.

Ideation

I wanted a way to create rules that could indicate how relevant a URL might be in a bug bounty scenario. Each rule would add to the score, and at the end, the URL list would be sorted, making it so that the most relevant targets are at the top. And since I’m going through every result anyway, I can already do some extra work and remove duplicates from the list, or even URLs that represent the same resource.

But wait, how can I know which URLs represent the same resource?

Understanding web resources

This problem sounds really easy, right? If the URLs share the same path, it’s the same resource. But what about query parameters? And what about API-style URLs? Where is the input value?

To be able to group these URLs, we have to break them down and try to understand the intention behind their creation.

We modern humans are very accustomed to how a URL works, even my mother knows how to use and save them to the browser bookmarks. But at its core, the goal of a URL is communication: Me, the client (browser), want something from you, the server. And the URL is just a structured way to ask this question.

Going back to the authentication flow example:

/login -> I would like to authenticate in your website
/logout -> I would like to deauthenticate from your website
/book/search?type=cooking -> Give me all the available cooking books

If one can break down the structure of these questions, then one can start trying to ask different questions.

If, for example, /book/search?type=cooking means “give me all the available cooking books”, then /book/search?type=<search> is a way to ask for books of different types. This is obvious due to how easy it is to identify that a query parameter acts as an input, and that we humans (get out of here AI) can easily understand why something might be named the way it is.

Structuring rules

Determining whether a web resource is more interesting than another is not that hard, it basically comes down to its potential for exploitation. For example, the 2021 OWASP Top 10 lists injection vulnerabilities as the third most critical risk. According to their own definition, injection occurs when “User-supplied data is not validated, filtered, or sanitized by the application”.

In simpler terms: Never trust user input. If you do, have a gun within arm’s reach.

My first rule draft looked something like this:

Description	Score
Contains query parameters	1
Contains words related to authentication (auth, login, logout, …)	2
Contains words related to external resource (url, path, file, filename, …)	2

With this system, URLs featuring query parameters in authentication contexts or referencing file imports would rank highest. URLs with just query parameters would fall in the middle, while all others would rank lower.

Over time, I expanded these rules to consider additional factors, such as:

Is the resource alive (does it return HTTP 200)?
What type of content does it return (HTML or JSON)?
Does the HTML content include a form?

And in full, it looked something like this:

Category	Subcategory	Description	Score
Known Pattern	Authentication	Url contains words related to authentication flow (login, logout, auth, password, …)	1
Known Pattern	Search	Url contains words related to search (search, query, graphql, …)	2
Known Pattern	Form	Contains form tag	1
Known Pattern	Hidden Input	Contains form tag with hidden input	1
Known Pattern	File Upload	Contains form tag with input of type file	2
Visibility	Private	Page returned 403 or related non-authenticated status	1
Visibility	Public	Page returned without any required authentication	2
Client-side Complexity	Static Page	Page only contains static information	1
Client-side Complexity	Simple JS	Page contains JS code	2
Client-side Complexity	API JS	Page contains JS code that includes external calls (fetch, axios, …)	3
Client-side Complexity	Shady JS	Page contains JS code that includes possibly vulnerable methods (eval, innerHTML, …)	4
Server-side Complexity	Retrieval Request	Request sent as GET with no parameters returns 200 with content	1
Server-side Complexity	Modifying Request	Request sent as GET responded with code 405, subsequent requests using POST, PUT, DELETE, … produced different 4xx error	2

Outro

As I got into evaluating what content the URL was returning, I started to think about the next steps: what to do once I found something that looked promising?

I eventually stumbled upon this article from James Kettle about Backslash Powered Scanning, which was what brought me into the rabbit hole of vulnerability scanning.

A topic I’ll be going more in depth on in the future.