At work we recently went through a security audit on our application. Even though our application is behind a firewall, internal only and requires authentication, we still needed to make sure our application was protected. One of the things that came up has to do with being vulnerable to XSS. There are currently 2 places in the application that allows a user to create, save and view HTML. If an attacker got into a user’s computer, they could then post some HTML on this application and when another user ran the page with this HTML, that other user could get infected.
As a side note (which is relevant when talking code below), this application is built upon .Net and using C#.
We couldn’t just take away the ability to post HTML either. This was a widely used feature. We could do a search for a <script> tag but there are different variations and attributes someone could use to try and get it past our search and we could miss something. Besides the script tag, someone could post a link to a website that would infect a user and we can not block links either. So, what can you do about this? Here are the first ideas that I had come up with:
- Not allow HTML
- Use something like Markdown and convert that to allowed HTML
- Parse the HTML and compare against a whitelist
Like I said, #1 was not an option for us. We could do #2 but with a lot of our users not being very technical, this would make it difficult for them to use. We could do #2 and provide some kind of wysiwyg editor. This is probably a good option but we also wanted to give the people who knew HTML the ability to write HTML and not have to learn a new markup language. So, we opted for #3.
When I started researching this I came across something from called the AntiXssLibrary. Everythign I read suggested this was exactly what I was looking for. There was a function called GetSafeHtmlFragement which promised to do a lot of what I wanted. After playing around with this library I noticed that it wasn’t doing what it should. I thought I was doing something wrong so I kept trying different things.
Frustrated, I turned to the web some more. After further research, it turns out that Microsoft broke the functionality contained in that method and no word on when (or if) it would be fixed.
The next idea was to use the Encode method of HttpUtility and then replace the encoded values of allowed tags with the actual tag. So, something like this:
string encodedHtml = HttpUtility.Encode(htmlText);
StringBuilder sb = new StringBuilder(encodedHtml);
sb.Replace(“<b>”,”<b>”);
sb.Replace(“</b>”,”</b>”);
And so on. I liked this idea because it meant I was only allowing a whitelist set of HTML tags. As I started down this path some more, it got rather complicated. What about attributes? What about bad code inside of attributes? Well, maybe regular expressions could help with that!
My next test code looked like this:
Regex reg = new Regex(“<table\\s(((.+)="((?:.(?!"(?:\\S+)=|">))*.?)")|((.+)='((?:.(?!'(?:\\S+)=|'>))*.?)'))*(>)?”);
Match m = reg.Match(data);
if (m.Success == true)
{
string matchVal = m.Value;
string matchValReplaced = matchVal.Substring(4, matchVal.Length – 8);
string decoded = System.Web.HttpUtility.HtmlDecode(matchValReplaced);
data = data.Replace(matchVal, “<” + decoded + “>”);
}
Again, that seemed to work. I could create multiple regular expressions for the valid tags (or roll it into one expression or something like that). It still seemed rather complicated and like it could be prone to error or could be a hassle to maintain.
Before settling on this solution I wanted to check the web some more. I then came across a blog article (http://eksith.wordpress.com/2012/02/13/antixss-4-2-breaks-everything/) by eksith who was in the same boat. They needed to solve the same problem for the same reasons. And they solved it with a lot better code than I hacked together!!
This didn’t require much modifcation for our use. I did modify a few things though. I added a few more ValidHtmlTags to the dictionary as well as some other attributes that were being used by our users. I also still had to solve the issue of not allowing users to link outside of our website. To do that, I created a variable containing a list of strings that were valid strings for the href attribute.
Then I added this (after line 143 of the original code):
if (a.Name == “href”)
{
a.Value = a.Value.ToLower();
var validCount = ValidBaseUrls.Select(s => a.Value.StartsWith(s)).Where(r => r == true);
if (validCount.Count() <= 0)
a.Value = “#”;
}
If a user added an anchor tag that linked to somewhere other than our valid list, that href attribute would get replaced with a pound sign.
One of the cool things about this class is that everything other than the allowed HTML tags and attributes will get encoded. That way, when you display them on your page, only allowed tags will get rendered as HTML.