Resolve and shorten URLs in C#

Published Sep 13, 2007

Recently I’ve needed a method that would look at some text and automatically discover all URLs and turn them into hyperlinks. I’ve done that before so it was a matter of copy/paste. This time it was a little more complicated, because the resolved URLs could not be longer than 50 characters long. That was important because otherwise it would break the design. A long URL doesn’t word wrap so it would end up bleeding out of the design.

So, the challenge was to resolve the URLs and turn them into links, while keeping the anchor text at a max of 50 characters long. To shorten the URL is easy enough, but it all comes down to how you want it shortened.

The rules

1. If the URL is longer than 50 characters then remove “http://”.
2. If it still is longer than allowed it must compress the folder structure like shown below.

http://www.microsoft.com/windows/server/2003/compare.aspx -> http://www.microsoft.com/.../compare.aspx

3. If the URL is still longer, then it must look for query strings and fragments and remove them as well.

The code

[code:c#]

private static readonly Regex regex = new Regex("((http://|www\\.)([A-Z0-9.-:]{1,})\\.[0-9A-Z?;~&#=\\-_\\./]{2,})", RegexOptions.Compiled | RegexOptions.IgnoreCase);
private static readonly string link = "<a href=\"{0}{1}\">{2}</a>";

public static string ResolveLinks(string body)
{
if (string.IsNullOrEmpty(body))
return body;

foreach (Match match in regex.Matches(body))
{
    if (!match.Value.Contains("://"))
    {
      body = body.Replace(match.Value, string.Format(link, "http://", match.Value, ShortenUrl(match.Value, 50)));
    }
    else
    {
      body = body.Replace(match.Value, string.Format(link, string.Empty, match.Value, ShortenUrl(match.Value, 50)));
    }
}

return body;
}

private static string ShortenUrl(string url, int max)
{
if (url.Length <= max)
return url;

// Remove the protocal
int startIndex = url.IndexOf("://");
if (startIndex > -1)
url = url.Substring(startIndex + 3);

if (url.Length <= max)
return url;

// Remove the folder structure
int firstIndex = url.IndexOf("/") + 1;
int lastIndex = url.LastIndexOf("/");
if (firstIndex < lastIndex)
url = url.Replace(url.Substring(firstIndex, lastIndex - firstIndex), "...");

if (url.Length <= max)
return url;

// Remove URL parameters
int queryIndex = url.IndexOf("?");
if (queryIndex > -1)
url = url.Substring(0, queryIndex);

if (url.Length <= max)
return url;

// Remove URL fragment
int fragmentIndex = url.IndexOf("#");
if (fragmentIndex > -1)
url = url.Substring(0, fragmentIndex);

if (url.Length <= max)
return url;

// Shorten page
firstIndex = url.LastIndexOf("/") + 1;
lastIndex = url.LastIndexOf(".");
if (lastIndex - firstIndex > 10)
{
    string page = url.Substring(firstIndex, lastIndex - firstIndex);
    int length = url.Length - max + 3;
    url = url.Replace(page, "..." + page.Substring(length));
}

return url;
}

[/code]

Implementation

To use these methods, just call the ResolveLinks method like so:

[code:c#]

string body = ResolveLinks(txtComment.Text);

[/code]

It works on URLs with or without the http:// protocol prefix. In other words http://www.example.com/ and http://www.example.com/ resolves to the same URL. This technique is implemented in the comments on this blog. You can test it by writing a comment with a URL in it.

Cookies and Unicode characters

Published Sep 9, 2007

I’ve been having some issues with storing Unicode characters in cookies today. Whenever a cookie is set and the value filled with Unicode characters, the same characters cannot be retrieved from the cookie again. When they are retrieved from the requesting browser, they are changed into something totally unreadable.

Background

The cookie is set when a visitor enters some text into a textbox and submits the form. When the same visitor returns to that page I wanted to pre-fill the textbox with the value submitted earlier. Very easy and simple and not before someone noticed the strange behaviour with Unicode characters I thought it worked just fine.

Because the value was displayed in a textbox I thought that maybe HTML encoding could solve the issue. Don’t ever HTML encode a cookie in ASP.NET! It results in a yellow screen of death and an exception stating that the cookie contains dangerous characters. The dangerous character it was referring to was a HTML encoded representation of a Unicode character and looked something like this "#248;". The only thing to do is to delete your cookies in order to view that page again.

The solution

It took me a while to figure it out, but all you need to do is to URL encode the cookie value. It works no matter what encoding you use for the page. The example below illustrates the very simple solution:

private void SetCookie()
{
HttpCookie cookie = new HttpCookie("cookiename");
cookie.Expires = DateTime.Now.AddMonths(24);
cookie.Values.Add("name", Server.UrlEncode(txtName.Text));
Response.Cookies.Add(cookie);
}

private void GetCookie()
{
HttpCookie cookie = Request.Cookies["cookiename"];
if (cookie != null)
{
txtName.Text = Server.UrlDecode(cookie.Values["name"]);
}
}

It is so simple but caused me a lot of time investigating and clearing cookies from the browser.