Exercise 4 for the course "Parallel and distributed systems" of THMMY in AUTH university.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 

1.2 KiB

The datasets on this folder where downloaded from the website of the Stanford Network Analysis Project (SNAP), found here.

More details about the datasets can be found in the table bellow.

Dataset directory Description Nodes Edges URL link
web-Google "web-Google" 875,713 5,105,039 link
wiki-Talk "wiki-Talk" 2,394,385 5,021,410 link

Adjustments made to the datasets:

The datasets had four (4) lines of meta-data at the beginning of the files and the data were saved in a form that had one edge per line following the pattern "linkFrom\tlinkTo\n", like so:

linkFrom    linkTo
linkFrom    linkTo
...

A program in C was written to discard the meta-data lines and transform the pattern in a new one that has all the out-links of a page in a single row, like so:

page_1: linkTo_1 linkTo_2 linkTo_3 ...
page_2: linkTo_1 ...

The program is provided in this repository, under the pathname: /datasets/Stanford Large Network Dataset Collection/graphToAdjacencyList.c.