How to Use the Issue Crawler
The following documentation for the Issue Crawler web site is reproduced directly from the Govcom.org Foundation web site.Issuecrawler.net
Instructions of Use
1.
FAQ
Welcome to the Issue Crawler, the network
mapping software by the Govcom.org Foundation,
Amsterdam. This is the online documentation.
1.1
Before you begin
Download the svg viewer plug-in at http://www.adobe.com/svg.
If you are using MS-Windows and Internet Explorer,
version 6.0+ is recommended.
Broadband connectivity is advised. If you are using Mozilla with MS-Windows, use this plug-in: http://www.adobe.com/svg/viewer/install/beta.html. For Linux and more info: http://www.w3.org/Graphics/SVG/SVG-Implementations. SVG soon will be native in browsers.
1.2
Quick start
Enter at least two related
URLs in the Issue Crawler, harvest, name your
crawl and launch your crawl.
The IssueCrawler
is web network location software.
It consists of a crawler, a co-link analysis
engine and two visualisation modules. It is
server-side software that crawls specified
sites, captures the outlinks from the specified
sites, performs co-link analysis on the outlinks,
returns densely interlinked networks, and
visualises them in circle and cluster
maps. For user tips, see also scenarios
of use, available at http://www.govcom.org/scenarios_use.htm.
For a list of articles resulting from the
use of the Issue Crawler, see http://www.govcom.org/publications.html.
The following is
a step by step guide to software use.
2.
Log in
Enter Username
and Password
Remember me? Checking the
box has the software remember your username
and password for future use. (A cookie is
used.) Your browser also is able to remember
your log-in's.
Forgot password? Type username
or email address into username field, press
login. A new password is sent to your email
address, if you are a valid user.
3. The
Lobby
The Lobby is so named for
the area where one waits for crawls to complete.
Crawl completion time varies
between 10 minutes and 8 hours, depending
on the number of servers from which the crawler
requests pages. The Crawler also may crash
should the machine on which it is hosted run
out of memory. Care is taken to use machines
with specifications that result in the fewest
crashes.
Whilst waiting users may read news
about the software and the results people
have generated. (News is posted by the administrators
of the software.) Users also may view maps
in the archive as well as launch additional
crawls.
To the right is the listing of current
crawls. Crawls are either crawling
or queued (i.e., ‘waiting to be launched’).
Crawls run sequentially. You may view the
author, email address, and settings of the
current crawl, as well as a live view of the
crawl. You also may view the progress of the
current crawl, including an estimated completion
time, based on current crawl conditions. Estimated
completion time may change significantly should
net congestion increase or decrease.
The User Manager is below
the listing of current crawls. Users may change
their username, password and email address.
4. Issue
Crawler
The Issue Crawler is the
crawler itself. There are two steps
before launching a crawl.
4.1
The Harvester. (Step one)
The Harvester is so named for it strips
URLs from text dumped into the space.
For example, one may copy and paste a page
of search engine returns into the Harvester.
The Harvester strips away the text, leaving
only URLs. It is a generally useful tool in
itself.
Type or paste at least two different
URLs into the harvester, and press
harvest. These harvested URLs will be crawled.
Tip:
If you find a list of URLs on the Web with
only pointer text and without URLs, view page
source, copy the code containing the URLs,
paste into the Harvester and press Harvest.
The Harvester will strip out the code leaving
only URLs.
4.2
The Crawler Settings. (Step two)
Your harvested URLs appear in the
box. You may edit and remove URLs. You may
save your harvested results. This is also
the stage where you provide the Crawler with
instructions (the crawler settings), and where
you name and launch your crawl.
Tips:
Once you have harvested:
Remove double entries by clicking
on a URL, and pressing remove.
View starting points to ensure they
are correct by clicking on a URL, and pressing
view.
Should the URL be incorrect, edit
the starting point by clicking the URL and
pressing edit. Once edited, press
update.
You may save your harvested results
by pressing save results.
A text file is created.
Should you wish to add URLs,
save your results, return to the Harvester,
and paste your saved results into the Harvester.
Add URLs. Press Harvest.
4.3 Explanation
of General Crawler Operation.
The Issue Crawler crawls the specified starting
points, captures the starting points’
outlinks, and performs co-link analysis to
determine which outlinks at least two starting
points have in common. The Issue Crawler performs
these two steps (crawling and co-link analysis)
once, twice or three times. Each performance
of these two steps is called an iteration.
Each iteration has the same crawl depth. The
crawler respects robot exclusion files.
Tip:
1. Avoid crawling big media sites, blogs,
search engines, pdf files, image files and
pages, more generally, without specific outgoing
links.
4.4 Crawler Settings in Detail
There are 4 settings. The default
settings suffice to ensure a crawl.
You must name your crawl
before launching the crawler.
Privilege Starting Points:
This setting keeps your starting points in
the results after the first iteration. Privileging
starting points (and using one iteration of
method) are suggested for social network mapping.
The software understands a social network
as the starting points plus those organizations
receiving at least two links from the starting
points.
Perform co-link analysis by page or
by site. Performing co-link analysis
by page analyses deep pages, and returns networks
consisting of pages. Performing co-link analysis
by site returns networks consisting of sites
or homepages only. Analysis by page is suggested,
for the results are more specific, and the
clickable nodes on the map are often 'deep
pages' as opposed to homepages.
Set iterations. One may
set the number of iterations of method (crawling
and co-link analysis) to one, two or three
iterations. One iteration is suggested for
social network mapping, two for issue network
mapping and three for establishment network
mapping. For a longer description of the distinction
between networks, see also scenarios of use,
http://www.govcom.org/scenarios_use.htm.
Crawl depth. One may crawl
sites one, two or three layers deep.
Here is a strict definition of how
depth is calculated.
The pages fetched from the starting point
URLs are considered to be
depth 0. The pages fetched from URL links
from those pages are considered to be depth
1. In general, the pages found from URL links
on a page of depth N are considered to be
depth N+1. If you set a depth of 2, then no
pages of depth 2 will be fetched. Only pages
of depth 0 and 1 will be fetched (ie. two
levels of depth). {Text by David Heath at
Oneworld.}
Tips:
1. Use links pages as starting
points. Links pages are the URLs where hyperlinks
are listed, e.g., http://www.freeburmacoalition.org/educational_resources/links/fbc_links.htm.
Occasionally sites, using frames or other
structures, are so designed that visitors
may have the impression that they are always
on the homepage. If, on the homepage, you
notice a hyperlink to ‘links’
or ‘resources’, right-mouse click
the ‘links’, copy location to
clipboard, and paste into the harvester. Use
as many links pages as possible for your starting
points.
2. Give the crawler the least amount
of work to do. Using a few links
pages as starting points, with one iteration
of method and one layer deep will provide
the quickest crawl completion.
3. Before launching a crawl, name
the crawl clearly. Name the crawl
so that others viewing the archive will understand
what it is. Viewing the archive will provide
you with an understanding of crawls that have
been named well or less so.
Exclusion list.
There is a list of URLs to be excluded from
crawling and thereby excluded from the results,
e.g., software download pages, site stats
counters, search engines and others. It is
suggested that you keep your own list. You
may edit the existing list. Please note the
list format, and edit the list using the same
format, i.e., www.google.com ; news.google.com.
Name and
Launch crawl.
Name crawl before launch. Use a name that
clearly identifies the network you seek. Once
you have launched a crawl, your crawl details
will appear. These include the name of your
crawl, and the time and date launched.
5.
Network Manager and Archive
5.1
Purpose of the Network Manager and Archive
The principle purpose of the Network Manager
as well as the Archive is to allow you to
generate, view, edit,
save and print maps.
The Network Manager provides a list of your
completed crawls. The Archive provides a list
of all users’ completed crawls.
The archive may be searched.
5.2
Features of the
Network Manager and Archive
The Network Manager and the Archive have a
number of features.
List of completed crawls. Listed
are the network names and top five organizations
in each network. Each network lists the top
5 URLs beneath the title of the network, with
an inlink count in parentheses. The inlink
count is the total number of links the organization
or site has received from the crawl. Clicking
on an organization (in the form of a shortened
URL) places it in the archive search, and
allows you to find all maps in the
archive containing that organization (according
to the homepage URL, without the www, such
as greenpeace.org). It seems that worldbank.org
currently appears in the most networks in
the archive.
Network Selection - The Scheduler.
You may schedule the network to repeat the
crawl at specified intervals using either
your original starting points or the network
results. This allows you to watch
the evolution of the network over time,
either on your terms (scheduling a crawl using
your starting points) or on the network’s
terms (scheduling a crawl using last available
network results).
Network Selection – View Map.
You may view a depiction of your network as
a circle or cluster map.
Network Selection – Publish
Map. You may annotate and publish
your map by pressing the + sign below, adding
explanatory text and pressing publish. The
annotations will appear on the map, under
Explanatory Notes.
Network Selection – Actor List.
You may view and save a list of the actors
in your network as well as the interlinkings
between them. If your crawl had the 'by page'
setting, the list may show multiple, truncated
organization name URLs.
Network Selection – xml source
file. You may view and save the source
file in xml file that is generated by the
software. The xml file may be visualised or
viewed by other software, e.g., an xml reader
or Reseaulu by aguidel.com,
the desktop software Govcom.org occasionally
uses to generate maps.
5.3
Map Viewing and Interactivity
Map Viewing
Pressing View Depiction for a cluster map
or a circle map generates a map. The map is
generated as a scalable vector graphic (svg).
The browser requires a plug-in to view an
svg file. An svg viewer plug-in is available
at http://www.adobe.com/svg.
The map shows its name, author, crawl
start and completion dates, as well
as the crawler settings. It also loads statistics
of the largest node on the map, by default. The
largest node is the node that has received the most
inlinks from the network actors.
Explanation text may be generated
through the publishing feature. The author
of the map (or site authors and site administrators)
may provide an explanation for the map that
is saved. An explanation may be provided on the
Network Details page by clicking +, typing
text and updating (or publishing).
The legend shows the top-
and second-level domains represented on the
map.
For the cluster map, the placement of the nodes on the map is significant. Placement is relative to significance of the node to other nodes, according to the ReseauLu approach.
Map Interactivity
Clickable Node Names. Each
node name on the map is clickable. Clicking
a node name will open a pop-up window and
retrieve the URL associated with the node
name. Should you have run your crawl with
the co-link analysis mode set to ‘by
page’, often the nodes are ‘deep
pages’.
Clickable Nodes
Selecting a node shows the destination URL,
the node’s crawl inlink count, as well
as its links to and from other network actors,
in the statistics.
Clickable Node Types (domains
and sub-domains)
You may turn on and off links to and from
domains and sub-domains listed in the legend.
You also may turn on and off links, using
the drop-down menu.
Zooming and Panning. To zoom
in, out and return to original view, ctl-mouse.
To pan, press alt and drag.
5.4
Saving and Printing Maps
Saving Map.
Use the save and export option on the map.
Save the interactive .svg file
for uploading to a site or for file transfer.
In order for the .svg file to load on your site, put a line in the mime-types
configuration for your webserver that recognizes svg and
outputs the correct content type to the web browser. It is standard
with Apache.
Save the .jpg or .png file as flat
image for pasting into a document
or into html. Save the .tiff flat image for
higher print quality. Save the .pdf file as document.
Printing Map.
Print from imported or saved file. Landscape
orientation is advised. Printing from the
browser is not advised.
5.5
Advanced Options - Map Generation and Editing
Circle Map - Advanced
Options
Map Generation
Retaining the default setting will
generate a map with a node count of approximately
25 or fewer nodes. You may raise or lower
the node count. A node count
reduction is equivalent to an authority threshold.
You show nodes with increasingly higher or
lower inlink counts.
Map Editing
You may edit the nodes on your map. You may
edit the names of the nodes
as well as the colors of
the nodes, either by typing in the hex numbers
for the colors or by using the color picker.
The table allows you to sort
the nodes on your map by name and domain.
Cluster Map
- Advanced Options
Map Generation
The cluster map advanced options provides
data about your network.
Choose nodes to be mapped allows you to choose
the number of nodes to be mapped
according to a significance measure, that
is, the ‘top’ nodes
according to inlink count per node.
Selection of ties by specificity
is the qualitative strength
of ties. The network clusters actors with
strongest ties to one another.
Selection of ties by frequency
is the quantitative force of ties. The network
clusters actors with the greatest quantity
of ties between them.
Color scheme by type indicates
domain type, e.g., .gov, .co.uk, .gv.at. Color
scheme by structural position indicates
type of linking behavior,
e.g., only gives links, only receives links,
give and receives links.
Size of nodes by inlinks
indicates that the size of the node is relative
to the number of links received by the site
or organization during the crawl.
Size of nodes by centrality
indicates the size of the node is relative
to number of of links given and received per cluster.
Map Editing
The advanced options for the cluster map allow
you to change the colors as well as the names
of the nodes.
