yesterday i had a road to damascus moment with the back end of this project. I finally got some really nice clean data. Ill post it later but in the meantime here is the first visualisation of the first 6 countries on my list. It has a long way to go visually- there are obvious problems with the resolution etc. Also it badly needs to be re-written and optimised (at the moment it crashes when trying to draw more than a few of the bezier curves) but its defo getting there.
In an attempt to get the text analysis as rigorous as possible I’ve had to add an extra section to the pipeline. Now after generating lists in php I run the words through a POS tagging and STEMMING sketch using riTA wordnet for processing.
POS =part of speech and basically means that I am only looking for nouns verbs, adjectives and adverbs - not stuff like articles and prepositions which I’m not interested in.
STEMMING = reducing words to their stem so running -> run this means that i the php script can recognise that running and run have the same meaning and not count them as seperate words
Anyway the code is pretty simple but I’ve posted it below because anyone doing anything with text could easily re-use this.
Cheers
import rita.wordnet.*;
RiWordnet wordnet;
PrintWriter output;
String tempString,tempString1,myWords[];
void setup()
{
size(300, 300);
//the wordnet object
wordnet = new RiWordnet(this);
//load file names from directory
File dir = new File(”/Users/tom/Documents/Processing/wordNet/data/source”);
String[] children = dir.list();
if (children == null) {
// Either dir does not exist or is not a directory
}
else {
for (int i=0; i
String filename = children[i];
}
}
//iterate through the file names //make a printwriter for this file name myWords=loadStrings("/Users/tom/Documents/Processing/wordNet/data/source/"+children[m]); //for each country go through all the words on the list //check it is either noun verb adj or adv String pos[]=wordnet.getPos(myWords[i]); //bit of a hack to check that is really a POS // for(int k=0;k<1;k++){ //get stems for the first pos in the array String tempArray[]; tempArray=wordnet.getStems(myWords[i],pos[0]); //print the first possible stem to file } //} } } void draw() }
for(int m=1;m
output = createWriter(children[m]);
println(children[m]);
println(myWords.length);
for(int i =0;i
if(wordnet.getPos(myWords[i])!=null){
println("POSLENGTH "+pos.length);
if(pos.length>0){
if(wordnet.getStems(myWords[i],pos[0])!=null){
output.println(tempArray[0]);
}
}
output.flush(); // Writes the remaining data to the file
output.close(); // Finishes the file
}
{
background(40);
Some of you may be aware I have been working since the new year on a visualisation of world constitutions. Essentially I am interested in how different countries express their common ethical values and specifically the way that they describe these as “fundamental” or “basic”. These texts ostensibly define a commonality of human values. I have several questions about this;
If these values are really “fundamental” to human experience why are there so many differences from country to country?
How is it that countries revise/amend their own constitutions so frequently - apart from minor revision to legal jargon?
Why do we need to put these things into words if they are so obvious?
What is lost in the translation from values to concepts?
How is that affected by language?
How is that language affected by fashion?
What is the relationship between people’s behaviour and the values stated publicaly. Where is the dissent to these values located? In what medium? Im what part of society?
The code and visualisations I have been posting are basically designed to analyse the texts for differences from one another. This can only be the start of this process. What really interests me is the differences between people and how they can be accommodated (rather than resolved). This post was intend to help me progress in my thinking about this topic but actually that hasn’t happened yet. What would be amazing would be if you guys could comment your thoughts. The more I discuss this with people the more interesting it gets to me and I’ve got too close to it to think about it by myself.
cheers
Hi Tom,
here are some of my brief thoughts regarding some of your questions and the use of (and need for)’constitutional’ documents:
(your blog comment section doesn’t seem to work for me - capcha need upgrade?)
Q. If these values are really “fundamental” to human experience why are there so many differences from country to country?
I do not believe that fundamental human values exist, and there will be significant differences in constitutions from country to country because a constitution is an agreement of values and is likely accommodating of local (nationwide) practicalities of climate, resources and related to historic precedent and traditional etiquette and ritual.
Q. How is it that countries revise/amend their own constitutions so frequently - apart from minor revision to legal jargon?
A constitution attempts to define the goals of the community/organisation/entity to which it applies and as such should always be in a state of evaluation and negotiation in comparison to what is actually happening.
Q. Why do we need to put these things into words if they are so obvious?
Since the constitution is, in a sense, a contractual agreement between the many concerned parties it is written down in an attempt to anchor what has been agreed. The document is ‘aspirational’ - it sets out the idealised values of the group on our best day so that we may return to it at moments of uncertainty.
(Just immediate some thoughts - hope they are helpful)
Hey guys here’s a quick snapshot of latest development.
I was having a nightmare getting the bezier curves to curve in the right direction because the handles need to be positioned at the normal (sticking straight out at right angles) to the circle and obviously since everything is spinning this is kind of a nightmare. I think this is really pretty though - thoughts?

this could be useful to some people
it looks at a bunch of texts and finds all the words which are common between them then saves it to a file
//common words cc tom schofield 2010
//Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK
//
$eachCountry=array();
$last=array();
//an incrementor for later on
$counter=0;
//where i am writing to
$target="target.txt";
//file handles/streams
$fh5 = fopen($target, 'a') or die("can't open file");
$fh1 = fopen($target, 'r') or die("can't open file");
$myFile = array(
"italy.txt", "germany.txt","iran.txt", "afghanistan.txt", "finland.txt",
"ireland.txt","andora.txt","italy.txt","japan.txt","poland.txt","belgium.txt","greece.txt",
"portugal.txt","bulgaria.txt","holland.txt","romania.txt","denmark.txt","norway.txt"
);
for($i = 0;$i $fh = fopen($myFile[$i], 'r') or die("can't open file"); //read into string from each file //make string into array of words } //go through each country //then through each word //then through each country for each word //if its in the array for a country added it to $last $last[$counter]=$eachCountry[$j][$k]; $counter++; //remove all repeated words and write to file } ?>
$tempString = fread($fh, filesize($myFile[$i]));
$eachCountry[$i] = explode(" ",$tempString);
for($j=0;$j
for($k=0;$k
for($m=0;$m
if(in_array($eachCountry[$j][$k], $eachCountry[$m])){
}
}
}
}
$last=array_unique($last);
foreach($last as $check){
echo $check;
fwrite($fh5,$check."\n");
i’ve been trying to refine this script to get rid of all the random bits of punctuation (which i thought i’d already done) and then finally to reorder everything so the most frequent words are saved first in the file then in decreasing order. I’ve spent every waking moment that I’ve not been at work this weekend trying to knock this in to shape and I’m pretty frustrated. I know most of you guys haven’t used php but its fairly well commented and you can see the structure. If anyone can see any reason this thing is still throwing me random white spaces/carriage returns and also why the ranking seems to be backwards I’d be a happy boy indeed.
as always heres the code:
//term frequency vs inverse document frequency cc tom schofield 2010
//Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK
//thanks to http://www.phpandme.net/2009/03/tfidf-weighting/ for tfidf explanation
ini_set("memory_limit","1000M");
set_time_limit (120);
$wordExists = false;
$orderedText = array();
$order=array();
$orderNum=array();
//define a file name
$myFile = array(
"italy.txt", "germany.txt","iran.txt", "afghanistan.txt", "finland.txt",
"ireland.txt","andora.txt","italy.txt","japan.txt","poland.txt","belgium.txt","greece.txt",
"portugal.txt","bulgaria.txt","holland.txt","romania.txt","denmark.txt","norway.txt"
);
$textStrings =array();
$textCount = count($myFile);
//load the files again
for($i = 0;$i
$fh4 = fopen($myFile[$i], 'r') or die("can't open file");
//read into string from each file
$tempString = fread($fh4, filesize($myFile[$i]));
//make string into array of words
$myArray4 = explode(" ",$tempString);
//replace unnecessary characters
$myArray4 = preg_replace("/([!?.-:;,\*\"]{1})/", "", $myArray4);
//make text Strings multidimensional with each array of words
$textStrings[$i]=$myArray4;
}
foreach($myFile as $myFile1){
//echo $myFile1;
//define a file handle or exit
$fh = fopen($myFile1, 'r') or die("can't open file");
$fh5 = fopen("new_".$myFile1, 'a') or die("can't open file");
$fh6 = fopen("num_".$myFile1, 'a') or die("can't open file");
//read the existing data into a string
$theData = fread($fh, filesize($myFile1));
//make everything lower case
$theData = strtolower($theData);
//replace hyphens with a space
$theData = preg_replace('/\b-\b/', " ", $theData);
$theData = str_replace("\t",' ',$theData);
$theData = str_replace("\r",' ',$theData);
$theData = str_replace("\v",' ',$theData);
$theData = str_replace("\n",' ',$theData);
$theData = str_replace("\f",' ',$theData);
//$theData = preg_replace('/\b\t\b/', " ", $theData);
//$theData = preg_replace('/\b\v\b/', " ", $theData);
$theData = preg_replace('/[[:punct:]]/', " ", $theData);
//i still need to get rid of the numbers
//$theData = str_replace('1234567890', ' ', $theData);
//break into an array
$myArray = explode(" ",$theData);
$myArray = preg_replace('/\s\s+/', ' ', $myArray);
$unique = array();
//this bit is really crucial:
//count_values returns a new array with the words as the array keys as the frequency as the value
$unique = (array_count_values($myArray));
$count =count($unique);
//a new array of just the keys since the array has been made associative
$keys =array_keys($unique);
//an incrementor to go through the array of keys
$inc=0;
//the & symbol means the array is not copied as it iterates
foreach($unique as &$temp){
//echo $temp,'
‘;
//replaces term freq with term freq over number of tokens
$temp=$temp/$count;
//echo $temp,’
‘;
$df=0;
foreach($textStrings as $that){
//check to see if this array key is in any of the other texts
if(in_array($keys[$inc],$that)){
//if it is, increment the document frequency counter
$df++;
}
}
//echo $keys[$inc],” “,$temp*($textCount/$df), ‘
‘;
//$keys[$inc].” “.$temp*($textCount/$df).’
‘
//echo $myFile1, ‘
‘;
//write to file
//the money shot: $temp*($textCount/$df) is tf*idf
$temp =$temp*($textCount/$df);
if(strlen($keys[$inc])>4){
//$orderedNums = array();
//$orderedText[$inc][0]=$keys[$inc];
//$orderedText[$inc][1]=$temp;
$order[$inc]=$keys[$inc];
$orderNum[$inc]=$temp;
//echo $orderedText[$inc][0],’
‘;
//echo $orderedText[$inc][1],’
‘;
//echo $myFile1,”THIS “,’
‘;
//if($keys[$inc]!=NULL){
//these are the important write functions
//fwrite($fh5,$keys[$inc].”\n”);
//fwrite($fh6,$temp.”\n”);
//}
}
$inc++;
}
//var_dump($order);
array_multisort($orderNum, $order);
//var_dump($orderNum);
//var_dump($order);
for($m=0; $m
fwrite($fh5,$order[$m].”\n”);
fwrite($fh6,$orderNum[$m].”\n”);
}
}
/*for($m=0; $m
fwrite($fh6,$orderedText[$m][1]."\n");
//echo $orderedText[$m][0],"THIS ",'
‘;
//echo $writer[][1];
}*/
fclose($fh5);
fclose($fh);
fclose($fh6);
//this bracket marks the end for each country
}
?>