Web Graph Sociology Research Initiative


Frequently Asked Questions

How can I view the SVG files on display in the research section?
How are web graphs generated?
How dependent on the specific URL seeds is the resulting web graph?
What is issue drift?

How can I view the SVG files on display in the research section?
Web graphs generated by the Issue Crawler provide a great deal of dynamic information that cannot be displayed in traditional image formats such as JPEG and GIF. SVG, a new, XML-based graphics format and web development language from Adobe, supports dynamic, interactive, data-intensive graphical display of Issue Crawler data. Viewing graphs in SVG format requires a special plug-in, the SVG Viewer. Some browsers, such as the Mozilla Firefox client (particularly the windows version), may have trouble loading the plug in. Much of this data is provided in alternate formats, namely gif or JPG images and tabular data, but we encourage serious researchers to look at the native SVG format files.
How are web graphs generated?
Web graphs displayed at the web graph sociology Research Initiative are visualizations of network data retrieved by the Issue Crawler, a server-based software application that crawls and analyzes the link structures of targeted web pages and then produces 2-dimensional cluster maps of the resulting network. Web pages are chosen based on recognition of their significance within a particular issue space.

Co-link Analysis from URL seeds
The crawler retrieves pages indicated by the outgoing links of the seed URLs and then performs co-link analysis on the resulting data. Pages which are linked by at least two of the original seed URLs are kept within the data set.

This process can be performed in one, two or three iterations. All other links are thrown out. Each subsequent iteration process produces a larger, denser network. A single iteration can be thought of as an immediate "social network" of the original actors defined in the seed URL. This kind of network is more likely to display a set of actors who have close semantic relationships than will networks of two or three iterations. Actors defined in a single iteration "social network" are likely to be more focused on the issue in question than those actors in a two iteration network. A two iteration "issue network" is likely to show associations between nodes at a broader ontological or categorical level.

The crawler is also told how deep it should crawl at an individual domain. If for example, the crawler were to retrieve the home page of a given site but store only outward links, it would be crawling at a depth of one. If the crawler were instructed to retrieve pages within the domain that were immediately linked from the home page, it would be crawling at depth two. All iterations are crawled at the space specified depth.

Association Matrix and Visual Display
For the resulting data set, an association matrix is generated. This two by two table consists of two cells (one column, one row) for each URL in the data set. One cell contains information about outgoing links to the network and the other contains information about incoming links from the network.

Using the association matrix, all unique URLs in the data set can be displayed on a two-dimensional image. This image depicts each URL or node as a circle of varying size. In these graphs, the size of the node indicates the number of links that node receives from the network relative to the other nodes in the network. The relative positioning for each node is determined by their network characteristics. Nodes which interconnect more often will tend to appear together, in clusters. The specific method used is the ReseauLu method, developed by Aguidel, Paris. Arrows between each node indicate the direction of each link. Nodes that are from the same top level domain have the same color.
How dependent on the specific URL seeds is the resulting web graph?

This is an important question for which we are unable to provide a definitive answer. Web graphs are reflections of the seed URLs used to generate them, so there is an obvious dependency relationship between them. To find a network behind an issue, the researcher must be able to start from relevant seeds URLs. This does not mean, however, that variations within a seed will produce entirely different graphs, to the point where results are so subjective that they are meaningless. Generally speaking, if an issue network is "out there" in the graph, it will show up, in fairly consistent form, as long as two or more key actors in the network appear in the seed.

Annenberg Ph.D. student Kenneth Farrall has been conducting research specifically targeting the seed dependency question. Recently, researchers studying electronic voting questioned results of an electronic voting issue network crawl which indicated that the Electronic Privacy Information Center (epic.org) had the most central position in the network. One researcher suggested that EPIC's centrality in the network graph was due solely to its presence in the original seed. Ken ran the issue crawl again, removing EPIC from the seed. Even without being present in the seed, EPIC retained its dominant position within the graph. For complete results of this seed dependency test and others as they are completed, please see the "seed dependency" research topic page.

What is issue drift?
Issue drift occurs when a crawl retrieves web sites that focus on a broader range of issues than the researcher is actually targeting. For example, a recent crawl of the RFID issue space returned a web graph which drifted from the RFID issue to more general electronic privacy and human rights issues such as electronic voting and spam. Although the issue drift phenomenon may be seen as a non-ideal result, it can provide valuable information regarding your targeted issue's broader position within important issues of debate in civic society.

Home | FAQ | Site Map | Privacy Policy | Contact | Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License.