4 Jul 2003 clmformat 1.003, 03-185
1. | ||
2. | ||
3. | ||
4. | ||
5. | ||
6. | ||
7. |
clmformat - display cluster results in readable form, optionally with labels and/or cohesion and stickiness measures attached.
clmformat -icl fname (input cluster file) -imx fname (input matrix/graph file) [-tab fname (read tab file)] [-fmt fname (write results to single file)] [-dir dirname (write results to directory)] [-infix str (use after base name/directory)] [-do txt (write ascified output rather than html)] [-lump-size n (cluster size threshold)] [-lump-count n (node threshold)] [-nsm fname (output node stickiness file)] [-ccm fname (output cluster cohesion file)] [--adapt (allow domain mismatch)]
The primary function of clmformat is to display cluster results in a readable form, by listing clusters in terms of the labels associated with the indices that are used in the mcl matrix. The labels must be stored in a so called tab file; see the -tab option for more information.
By default the output is formatted using HTML. For each cluster a paragraph is output. First comes a listing of other clusters (in order of relevance, possibly empty) for which a significant amount of edges exists between the other and the current cluster. Second comes a listing of the nodes in the current cluster. For each node a small sublist is made (in order of relevance, possibly empty) of other clusters in which the node has neighbours and for which the total sum of corresponding edge weights is significant. The 'self' value is simply the projection value for the cluster to which the node belongs.
All (unqualified) values that are output are so called projection values described further below. The so-called coverage measures that are also output are described in [1]. You can safely ignore them, allthough they do sometimes explain why nodes with low 'self' projection value form a cluster. When this happens the coverage measures are usually higher than they normally are, and this signifies that a small-area cluster is efficient compared with a large-are cluster for those nodes (which may or may not be what you want).
It is possible to split output over multiple files using the -dir option. The intent is simply that for very large graphs browsing quality can still be maintained. Clusters will by default be output to file until the total node count has exceeded a threshold (refer to the -lump-count option). Alternatively, it is possible to specify a threshold such that clusters with few entries are all collected in a single file. Refer to the -lump-size option.
clmformat also shows how well each node fits in the cluster it is in and how cohesive each cluster is, using simple but effective measures (described under respectively the -nsm option and the -ccm option). This enables you to compare the quality of the clusters in a clustering relative to each other, and may help in identifying both interesting areas and areas for which cluster structure is hard to find or perhaps absent.
-icl fname (input cluster file) | ||
Name of the clustering file.
|
||
-imx fname (input matrix/graph file) | ||
Name of the graph/matrix file.
|
||
-tab fname (read tab file) | ||
The file fname should be in tab format; each line starts
with a unique number which is an index used in the matrix input
file and the cluster input file. The rest of the line contains a
descriptive string associated with the number. Lines starting with
# are considered comment and are disregarded. A single unique
line should be present for each node/index of the cluster row domain
(or the graph/matrix domain optionally specified with the -imx
option). The leading indices should be in ascending order.
|
||
-fmt fname (write results) | ||
The formatted results are written to the file fname.
|
||
-dir dirname (write results to directory) | ||
Each formatted cluster is written to a file in directory
dirname. If the directory does not exist an attempt is made
to create it. Output file names will be of the form 0-3.html
or {0-3.txt} depending on the ouput mode. If the
-infix abc option is used, the file names
will be of the form abc.0-3.html or abc.0-3.txt.
Clusters will by default be output to file until the total node count has exceeded a threshold (refer to the -lump-count option). Alternatively, it is possible to specify a threshold such that clusters with few entries are all collected in a single file. Refer to the -lump-count option. |
||
-lump-count n (node threshold) | ||
Used in conjunction with the -dir option.
Clusters are formatted and output within a single file until the node
threshold has been exceeded. A new file is then opened and the procedure
repeats itself.
|
||
-lump-size n (cluster size threshold) | ||
Used in conjunction with the -dir option.
Each clusters is output to a separate file, except for clusters for
which the size does not exceed the threshold specified. The
latter are all output to a single file with a name of the
form cut.html or cut.txt.
|
||
--adapt (allow domain mismatch) | ||
Allow the cluster domain to differ from the graph domain. Presumably
the clustering is a clustering of a subgraph. The cohesion and stickiness
measures will pertain to the relevant part of the graph only.
|
||
-nsm fname (output node stickiness file) | ||
This option specifies the name in which to store (optionally) the node
stickiness matrix. It has the following structure. The columns range over
all elements in the graph as specified by the -imx option.
The rows range over the clusters as specified by the -icl option.
The entries contain the projection value of that particular
node onto that particular clusters, i.e. the sum of the weights of
all arcs going out from the node to some node in that cluster, written
as a fraction relative to the sum of weights of all outgoing arcs.
|
||
-ccm fname (output cluster cohesion file) | ||
This option specifies the name of the file in which to store (optionally)
the cluster cohesion matrix. It has the following structure.
Both columns and rows range over all clusters in the clustering as specified
by the -icl option. An entry specifies the projection
of one cluster onto another cluster, which is simply the average
of the projection value onto the second cluster of all nodes in the
first cluster.
|
[1]
Stijn van Dongen. Performance criteria for graph clustering and Markov
cluster experiments. Technical Report INS-R0012, National Research
Institute for Mathematics and Computer Science in the Netherlands,
Amsterdam, May 2000.
http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z