Intro
I’ve been working on a web scanner over the last couple of weeks, and this wasn’t exactly what I had in mind when I first started working on this project.
This post is about the original problem, and how it evolved into the application that I call Bloodhound.
What is Bloodhound?
It started as a simple project, applying rules to rank and sort a list of URLs. It eventually turned into a multi-module CLI tool, with a integrated web crawler and vulnerability scanner.
Categorizing web resources
When I first investigate a website, I start with any OSINT that I can find. One of the tools that I use is a self-implementation of this waybackurls script. There is a problem with this method though: any website worth targeting will result in the output of millions of different URLs, containing multiple repetitions and dead routes.
What if there was a way to see how relevant each URL is without inspecting one by one?
About web resource naming
Throughout my time working with software I got to know a bit about how the mind of a software developer works. Two of those are:
- Enterprise software tends to be developed and named in English;
- Resources that relate in goal tend to be named the same or similarly.
There is a limitation on how many different ways you can name something (if you are a sane person). Take, for example, an authentication flow: how many different ways can you name a login page?
- /login
- /auth
- /weird-system-name/authentication
Maybe something more specific to the technology you’re using:
- /sso/google
- /jwt:fetch
The point is that every language has a limited number of words you can use to identify something and how it relates to something else.
Yes, of course, someone might be very clever and try to hide this behind a random string of characters, but even if only 10% use this pattern (a value that is definitely exaggerated), 90% would still be discoverable.
Ideation
I wanted a way to create rules that could indicate how relevant a URL might be in a bug bounty scenario. Each rule would add to the score, and at the end, the URL list would be sorted, making it so that the most relevant targets are at the top. And since I’m going through every result anyway, I can already do some extra work and remove duplicates from the list, or even URLs that represent the same resource.
But wait, how can I know which URLs represent the same resource?
Understanding web resources
This problem sounds really easy, right? If the URLs share the same path, it’s the same resource. But what about query parameters? And what about API-style URLs? Where is the input value?
To be able to group these URLs, we have to break them down and try to understand the intention behind their creation.
We modern humans are very accustomed to how a URL works, even my mother knows how to use and save them to the browser bookmarks. But at its core, the goal of a URL is communication: Me, the client (browser), want something from you, the server. And the URL is just a structured way to ask this question.
Going back to the authentication flow example:
- /login -> I would like to authenticate in your website
- /logout -> I would like to deauthenticate from your website
- /book/search?type=cooking -> Give me all the available cooking books
If one can break down the structure of these questions, then one can start trying to ask different questions.
If, for example, /book/search?type=cooking means “give me all the available cooking books”, then /book/search?type=<search> is a way to ask for books of different types. This is obvious due to how easy it is to identify that a query parameter acts as an input, and that we humans (get out of here AI) can easily understand why something might be named the way it is.
Structuring rules
Determining whether a web resource is more interesting than another is not that hard, it basically comes down to its potential for exploitation. For example, the 2021 OWASP Top 10 lists injection vulnerabilities as the third most critical risk. According to their own definition, injection occurs when “User-supplied data is not validated, filtered, or sanitized by the application”.
In simpler terms: Never trust user input. If you do, have a gun within arm’s reach.
My first rule draft looked something like this:
| Description | Score |
|---|---|
| Contains query parameters | 1 |
| Contains words related to authentication (auth, login, logout, …) | 2 |
| Contains words related to external resource (url, path, file, filename, …) | 2 |
With this system, URLs featuring query parameters in authentication contexts or referencing file imports would rank highest. URLs with just query parameters would fall in the middle, while all others would rank lower.
Over time, I expanded these rules to consider additional factors, such as:
- Is the resource alive (does it return HTTP 200)?
- What type of content does it return (HTML or JSON)?
- Does the HTML content include a form?
And in full, it looked something like this:
| Category | Subcategory | Description | Score |
|---|---|---|---|
| Known Pattern | Authentication | Url contains words related to authentication flow (login, logout, auth, password, …) | 1 |
| Known Pattern | Search | Url contains words related to search (search, query, graphql, …) | 2 |
| Known Pattern | Form | Contains form tag | 1 |
| Known Pattern | Hidden Input | Contains form tag with hidden input | 1 |
| Known Pattern | File Upload | Contains form tag with input of type file | 2 |
| Visibility | Private | Page returned 403 or related non-authenticated status | 1 |
| Visibility | Public | Page returned without any required authentication | 2 |
| Client-side Complexity | Static Page | Page only contains static information | 1 |
| Client-side Complexity | Simple JS | Page contains JS code | 2 |
| Client-side Complexity | API JS | Page contains JS code that includes external calls (fetch, axios, …) | 3 |
| Client-side Complexity | Shady JS | Page contains JS code that includes possibly vulnerable methods (eval, innerHTML, …) | 4 |
| Server-side Complexity | Retrieval Request | Request sent as GET with no parameters returns 200 with content | 1 |
| Server-side Complexity | Modifying Request | Request sent as GET responded with code 405, subsequent requests using POST, PUT, DELETE, … produced different 4xx error | 2 |
Outro
As I got into evaluating what content the URL was returning, I started to think about the next steps: what to do once I found something that looked promising?
I eventually stumbled upon this article from James Kettle about Backslash Powered Scanning, which was what brought me into the rabbit hole of vulnerability scanning.
A topic I’ll be going more in depth on in the future.