Technorati has a number of initiatives in the works to improve the data in our search indexes and analytics systems. Web spam sites (splogs) have long been an issue that we've been working to address. The days when pings came only from legitimate blogs are long gone. Including all of the spam and duplicates, Technorati receives over 8 million pings per day. Over 90% are recognized and blocked as soon as they're received. The remainder is allowed into the system and selectively processed - a large portion is determined to be spam later.
Recently, we've been focusing on link farms and pornography sites that have been getting into the system. Link farms are networks of sites linking to each other and other sites with the intention of raising search rankings. Sometimes, these sites link to legitimate blogs to "camouflage" these intentions or simply because the content has been stolen from another site. During a recent scrub of the system, a number of legitimate blogs were misidentified as spam. The flags set on those blogs were reversed, so going forward they are being indexed correctly again. However, some of the link and post data scrubbed from our search and analytics systems could not be reverted. We're working on upgrades to make that data handling better managed but in the meantime, there are some gaps in certain blog's data which may affect the authority of blogs they linked to. Additionally, some blogs suffered authority drops due to being the beneficiary of camouflaged links from spam sites being removed (wittingly or not); when those spam sites were removed, so was a portion of the authority of the legitimate blogs they linked to.
We have a number of technology initiatives in the works to improve the scaling characteristics and data quality of our systems. More news will be arriving on that in the weeks and months ahead.




