ACM Transactions on the Web, 2(1), Art. 4p. (2008) DOI:10.1145/1326561.1326565

Detecting splogs via temporal dynamics using self-similarity analysis

Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, B. L. Tseng

This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs not only degrade the quality of search engine results, but also waste network resources. The splog detection problem is made difficult due to the lack of stable content descriptors.

We have developed a new technique for detecting splogs, based on the observation that a blog is a dynamic, growing sequence of entries (or posts) rather than a collection of individual pages. In our approach, splogs are recognized by their temporal characteristics and content. There are three key ideas in our splog detection framework. (a) We represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts, to investigate the temporal changes of the post sequence. (b) We study the blog temporal characteristics using a visual representation derived from the self-similarity measures. The visual signature reveals correlation between attributes and posts, depending on the type of blogs (normal blogs and splogs). (c) We propose two types of novel temporal features to capture the splog temporal characteristics. In our splog detector, these novel features are combined with content based features. We extract a content based feature vector from blog home pages as well as from different parts of the blog. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM-based splog detector using proposed features on real world datasets, with appreciable results (90% accuracy).

back


Creative Commons License © 2017 SOME RIGHTS RESERVED
The content of this web site is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Germany License.

Please note: The abstracts of the bibliography database may underly other copyrights.

Ihr Browser versucht gerade eine Seite aus dem sogenannten Internet auszudrucken. Das Internet ist ein weltweites Netzwerk von Computern, das den Menschen ganz neue Möglichkeiten der Kommunikation bietet.

Da Politiker im Regelfall von neuen Dingen nichts verstehen, halten wir es für notwendig, sie davor zu schützen. Dies ist im beidseitigen Interesse, da unnötige Angstzustände bei Ihnen verhindert werden, ebenso wie es uns vor profilierungs- und machtsüchtigen Politikern schützt.

Sollten Sie der Meinung sein, dass Sie diese Internetseite dennoch sehen sollten, so können Sie jederzeit durch normalen Gebrauch eines Internetbrowsers darauf zugreifen. Dazu sind aber minimale Computerkenntnisse erforderlich. Sollten Sie diese nicht haben, vergessen Sie einfach dieses Internet und lassen uns in Ruhe.

Die Umgehung dieser Ausdrucksperre ist nach §95a UrhG verboten.

Mehr Informationen unter www.politiker-stopp.de.