I'd like to preface this entry by saying SEOmoz does not practice or endorse blackhat SEO techniques. This is not intended to be an instructional blackhat SEO article or a list of websites you should all go take advantage of. The goal of this post is, rather, to "out" a significant weakness that can be exploited by savvy users.

While reading EGOL's recent post on gov links, I started brainstorming possible ways to creatively acquire a few .gov links of my own. Thus was born my first foray into the world of blackhat SEO. About a year ago I heard about how webmasters were all running scared because malicious users could easily place HTML code into their form input boxes and manipulate the markup on their sites (aka XSS)  I was curious as to how difficult this actually was, so I decided to investigate.

After running a Yahoo site: command I was able to get a list of search forms from hundreds of .gov sites. I used the web developer toolbar to convert HTML form methods from POST to GET, making the search results link-able, inserted a few HTML tags into the search boxes, and voila: I had 20 links from .gov websites pointing to my site. Once these pages were created, in theory all I'd have to do is link to them from various other domains and they'd eventually get spidered and start passing link love.

In the list below I only linked to www.example.com (a domain reserved for documentation - RFC 2606) and used the anchor text "Look, I made a link" to make the links obvious to spot.  The list below shows the compromised pages:

  1. Environmental Protection Agency
  2. United States Department of Commerce
  3. NASA - This one was a bit tricky, I had to throw in some extra markup to make sure the HTML that was rendered wasn't mangled.  In the end I managed to get a link embedded inside a giant h1 tag.
  4. The Library of Congress - I even added an image of a kitty wearing a watermelon helmet
  5. US Securities and Exchange Commission
  6. Official California Legislative Information
  7. US Department of Labor 
  8. Office of Defects Investigation - Their website is the only thing that's defective :)
  9. National Institutes of Health
  10. US Dept of Health & Human Services
  11. Missouri Secretary of State
  12. California Department of Health Services
  13. Hawaii Department of Commerce and Consumer Affairs
  14. IDL Astronomy Library Search
  15. US Department of Treasury
  16. California State Legislature
  17. Office of Extramural Research
  18. Health Information Resource Database (health.gov)
  19. United States Postal Service  I had to jump through a few advanced forms before I found one I could use.
  20. North Dakota Legislative Branch

Many of these URLs ended up being very long and gnarly, possibly discounting any value they might pass.

I see a few possible solutions to the problem (Assuming this is a problem)

  1. All those sites need to be informed of the exploit and start validating form input.  Unfortunately this is just a quick list I put together this evening, I'm sure there are thousands (if not millions) of sites out there that are vulnerable.
  2. The SEs needs to de-value links that are found on a site search results page (perhaps they already do this?). These exploits aren't limited to search results, however; you could do this on any HTML form that wasn't properly validating input.
  3. The SEs could greatly de-value links that aren't linked to from the rest of the site. These injected pages are essentially "floaters": pages that are not linked to anywhere on the site but have incoming external links. Do the SEs already do this? 
  4. De-value pages that contain HTML in the URL (both encoded and unencoded), particularly if it contains A tags.
  5. Disallow indexing of any forms via robots.txt or a meta tag. Again, this would require work on the .gov webmasters part and changes are probably made at the speed of molasses (assuming the web departments of the goverment work as slowly as the rest of it).

What do you all think?  Would these injected links pass link love or is this simply something that search engines already account for and is a non-issue?

SEO aside, this could also be used for phishing scams. For example, an attacker could build a fake payment form on  the nasa.gov website asking for $100.00 for whatever reason. The form would then POST to another server, the payment data would be stored, and then they'd get forwarded back to another exploited nasa.gov page with a "thanks for your payment" message. The user would never know they'd been duped - frightening to say the least.

UPDATE (from Rand): We originally pulled this post on concerns that it could spark legal issues or create more problems then it helped solve. However, after consultation with several folks, we've decided that sweeping the problem under the carpet is more detrimental than getting it out in the open.