My Southampton Graduation Ceremony
Jul. 19, 2018 in Photos, University
Four years later, and my time at university has come to a close. I have earned my Masters. My time here is at a close. It is a bitter-sweet day.
Jul. 19, 2018 in Photos, University
Four years later, and my time at university has come to a close. I have earned my Masters. My time here is at a close. It is a bitter-sweet day.
It’s done, it’s over! Months in the making, my dissertation is finished an available from lect.me. My advice for future students, is to start early. Projects like these always take longer then you expect.
May. 19, 2017 in Csharp, Games, Innovation, Unity, University
Having previously created games in my spare time and in competitions, I chose to team up with three different partners to create games focusing on gameplay, narrative experiences, and innovative technology using Unity. It was hard, took a lot of work, but in the end it was one of the most satisfying modules I ever took at University. Shout out to Rikki Prince, Dave Millard, and Tom for running such and excellent module.
A fast paced, Quake inspired, local multi-player, little planet deathmatch infinite arena shooter. Hone your skills, then compete against your friends to see who can dominate the playing field. Supports up to 4 player split-screen, bring an Xbox controller. A student game created at the University of Southampton by Matthew Consterdine and Ollie Steptoe.
Featuring a number of classic weapons:
A fully narrated re-telling of the fairy tale classic. Single player, play with a mouse/keyboard or Xbox 360 controller. A student game created at the University of Southampton by Matthew Consterdine and Jeff Tomband. Download and play.
Using your flame-thrower, wrack up points and burn the forest down. Single player, play with a mouse/keyboard or Xbox 360 controller. A student game created at the University of Southampton during the Southampton Code Dojo. Burn down everything!.
Last The Night is a procedurally generated first person survival game in which the player fights for their life after having crash landed on a mysterious, unknown planet. Armed with only a pistol, the player must fight off the various monsters inhabiting the planet, and only once the sun rises will they be safe.
With seed based world generation, there are literally millions of planets to explore with no two being the same, and with the addition of Easy, Medium and Hard difficulties, advanced players can challenge themselves whilst beginners can get a feel for the game. Last The Night features 17 different types of monsters, keeping the player guessing at all times.
A student game created at the University of Southampton by Matthew Consterdine and Ed Baker. Do you think you’re brave enough to last the night?
Nov. 24, 2016 in Machine-learning, Matlab, University
I decided to investigate Machine Learning using MATLAB.
To compute the posterior probability, I started by defining the following two Gaussian distributions, they have different means and covariance matrices.
Using the definitions, I iterated over a N×N matrix, calculating the posterior probability of being in each class, with the function
mvnpdf(x, m, C); To display it I chose to use a mesh because with a high enough resolution, a mesh allows you to see the pattern in the plane, and also look visually interesting. Finally, I plotted the mesh and rotated it to help visualize the class boundary. You can clearly see that the boundary is quadratic, with a sigmodal gradient.
Next, I generated 200 samples with the definitions and the function
mvnrnd(m, C, N);, finally partitioning it half, into training and testing sets. With the first of the sets, I trained a feedforward neural network with 10 hidden nodes; with the second, I tested the trained neural net, and got the following errors:
These values are both small, and as the testing error is marginally larger than the training error, to be expected. This shows that the neural network has accurately classified the data.
I compared the neural net contour (At 0.5) to both a linear and quadratic Bayes’ optimal class boundary. It is remarkable how significantly better Bayes’ quadratic boundary is. I blame both the low sample size, and the low number of hidden nodes. For comparison, I have also included Bayes’ linear boundary, it isn’t that bade, but still pales in comparison to the quadratic boundary. To visualize, I plotted the neural net probability mesh. It is interesting how noisy the mesh is, when compared to the Bayesian boundary.
Next, I increased the number of hidden nodes from 10, to 20, and to 50. As I increased the number of nodes I noticed that the boundary became more complex, and the error rate increased. This is because the mode nodes I added, the more I over-fitted the network. This shows that it’s incredibly important to choose the network size wisely; it’s easy to go to big! After looking at the results, I would want to pick somewhere around 5-20 nodes for this problem. I might also train it for longer.
|Training Error||Testing Error|
I was set the task of first generating a number of samples from the Mackey-Glass chaotic time series, then using these to train and try to predict their future values using a neural net. Mackey-Glass is calculated with the equation:
For the samples, I visited Mathworks file exchange, and downloaded a copy of Marco Cococcioni’s Mackey-Glass time series generator: https://mathworks.com/matlabcentral/fileexchange/24390. I took the code, and adjusted it to generate N=2000 samples, changing the delta from 0.1 to 1. If I left the delta at 0.1, the neural network predicted what was essentially random noise between -5 and +5. I suspect this was due to the network not getting enough information about the curve, the values given were too similar. You can see how crazy the output is in the bottom graph. Next, I split the samples into a training set of 1500 samples, and a testing set of 500 samples. This was done with p=20. I created a linear predictor and a feedforward neural network to look at how accurate the predictions were one step ahead.
This shows that the neural network is already more accurate, a single point ahead. If you continue, feeding back predicted outputs, sustained oscillations are not only possible, the neural net accurately predicts values at least 1500 in the future. In the second and third graphs, you can notice the error growing very slowly, however even at 3000, the error is only 0.138
Using the FTSE index from finance.yahoo.com, I created a neural net predictor capable of predicting tomorrows FTSE index value from the last 20 days of data. To keep my model simpler and not overfitted, I decided to use just the closing value, as other columns wouldn’t really affect the predictions, and just serve to overcomplicate the model.
Feeding the last 20 days into the neural net produces relatively accurate predictions, however some days there is a significant difference. This is likely due to the limited amount of data, and simplicity of the model. It’s worth taking into account that the stock market is much more random and unpredictable than Mackey-Glass.
Next I added the closing volume to the neural net inputs, and plotted the predictions it made. Looking at the second graph, it’s making different predictions, which from a cursory glance, look a little more inline.
However, I wasn’t sure so I plotted them on the same axis, and, nothing really. It just looks a mess. Plotting the different errors again gives nothing but a noisy, similar mess. Finally, I calculated the total area, the area under the graph and got:
This is nothing, a different of 0.011×10^5 is nothing when you are sampling 1000 points. It works out to an average difference of 1.131, or 0.059%. From this I, can conclude that the volume of trades has little to no effect on the closing price, at least when my neural network is concerned. All that really matters is the previous closing values.
Overall, there is certainly an opportunity to make money in the stock market, however using the model above, I wouldn’t really want to make big bets. With better models and more data, you could produce more accurate predictions, but you still must contest with the randomness of the market. I suggest further research before betting big.
Apr. 28, 2016 in University
This is the user manual for the Aqua programming language created as part of Programming Languages and Concepts. Visit the project on Github.
Aqua is a Clike imperative language, for manipulating infinite streams. Statements are somewhat optionally terminated with semicolons, and supports both block
( /* ... */) and line comments
( // ...).Curly brackets are used optionally to extend scope. Example code can be found in the Appendices.
Before continuing, it’s helpful to familiarise yourself with Extended BNF. Special sequences are used to escape.
Once the interpreter has been compiled using the
make command, you can choose to run an interactive REPL or a stored program. Executing
./mysplinterpreterwith no arguments will start in an interactive REPL . You should save your program files as
<filename>.spl and pass the location as the first argument to
./mysqlinterpreter. As data is read from standard in, you can pipe files in using the
< operator, or pipe programs in using the
| operator, allowing you to create programs that manipulate infinite streams.
./mysplinterpreter <file> [ < <input> ]
<program> | ./mysplinterpreter <file>
Programs are executed in multiple stages:
Nov. 26, 2015 in Algorithms, Java, University
Block world is a simple 2D sliding puzzle game taking place on a finite rectangular grid. You manipulate the world by swapping an agent (In this case the character: ☺) with an adjacent tile. There are up to 4 possible moves that can be taken from any tile. As you can imagine, with plain tree search the problem quickly scales to impossibility for each of the blind searches.
It is very similar to the 8/15 puzzles, just with fewer pieces, meaning it’s simpler for the algorithms to solve. It’s unlikely any of my blind searches could solve a well shuffled tile puzzle with unique pieces, but I suspect my A* algorithm could. However, before doing so I would want to spend time improving my Manhattan distance heuristic, so it gave more accurate results over a larger range.
I decided to use Java to solve this problem, as I’m familiar with it and it has a rich standard library containing Queue, Stack, and PriorityQueue. These collections are vital to implementing the 4 search methods. You can implement the different searches differently, but the data structures I listed just deal with everything for you.
Oct. 29, 2015 in Cybersecurity, Networks, University, Web
Using Nmap, I was tasked with scanning an IP range, to evaluate and report vulnerabilities.
Oct. 19, 2015 in Bash, Networks, University, Web
I produced a series of bash scripts to automate the process of pinging the list of websites. I choose bash as it is trivial to pipe the output from ping into various other command line programs such as: sed, gawk and wget. As it was completely automated I decided to start early and just let it run. In total I pinged the top 100,000 websites up to 100 times each using script.sh (See Appendix A), recording useful statistics.
The script is very simple; a while loop to iterate over each site, and ping/ping6 piped into gawk to process the result. Gawk is very good at processing this sort of data, and more than fast enough to perform the task. The result was output to a large (11.1MiB) CSV file, with the IPv4 and IPv6 of each site, on separate lines.
As part of creating this script, I assumed that every site, if it exists would be able to respond within 10 seconds if it was online. If not, my script would timeout and assume the site was down. I feel this is a reasonable assumption as any remotely popular site should respond quickly, unless it’s currently being DDOSed. Another assumption I made is that any site I scanned would be able to withstand 5 requests per second. Even a raspberry pi is capable of serving 43 static pages per second1. As I sent a maximum of 50 requests, the brief period of slightly increased load should be negligible for any of the sites I scanned.
In hindsight, I would have combined both IPv4 and IPv6 into a single line from the start, as manipulating the data in excel is significantly easier to do if it is all on a single line. By that time I had already scanned the top 100,000 site’s so simply regathering the data was impossible. To fix this, I created combine.sh (See Appendix B) which simply echo’s the IPv4 line without a newline, then the IPv6 line with one. This is the reason I have some duplicated columns in my combined output. These are removed in Appendix D.
Whilst looking through the IPv6 column I noticed a very common prefix: “2400:cb00:”. After some research I discovered that this prefix belongs to cloudflare2. Using the prefixes I found on whatmyip.co3, I created a table mapping the hosting company to the number of sites it hosts. The results are impressive.
I decided to lookup the geolocation of each website. Looking around for a convenient database or API, I stumbled upon freegeoip.net4. It allows you to easily gather geolocation information for a specified IP in CSV form, perfect for my coursework. To retrieve this information using lookup.sh (See Appendix C) I self-hosted my own instance, then used cURL and a simple while loop to request and printf all the location information about each site to a file. I decided to record all the information given, to keep the script simple and retain all the information, to ensure I didn’t need to re-run the script.
Once the data was collected, it was time to head to Excel to analyze the data and draw conclusions. Having a large dataset let me create very good graphs, and draw good conclusions but was tedious to work with in Excel. Certain formulas, such as the ones used to create the Average Response Time per Country over Distance graph, managed to crash Excel numerous times and it even ran out of memory every now and then. In future when dealing with similarly sized amounts of data I would need to look into other graphing tools.
I decided to plot all 100,000 points as an X-Y scatter. In this, and subsequent graphs, IPv4 is blue and IPv6 is red. Immediately I noticed rather obvious bands of pings, which are shown in the Histogram, below. There are large peaks in the graph above. It’s interesting to note that despite the lower adoption of IPv6, the initial peak is half the height of IPv4. Past that, the frequency is low, indicating lower IPv6 response times.
Thanks to CDNs and local sites, the top 1000 sites are concentrated around the 4-6ms mark with both averages trending slowly upwards. IPv6 is always significantly lower than IPv4 in the graph above. From a glance, you can see regions such as Africa, the Caribbean, and the Middle East without any IPv6 deployment. Sites are concentrated around the U.S.A, Europe and East Asia, with barren areas in-between.
Average response times for IPv4 only and IPv6 only sites is roughly the same at about 104ms. The average min and max are largely identical as well; nothing surprising. On the other hand, the averages for sites running both IPv4 and IPv6 is very low in comparison – only 25ms compared with over 100ms! Now the question is, why are sites running both IPv4 and IPv6 significantly faster?
A Vast majority of 90% of the internet is IPv4 only, with only 4.5% of sites providing both. In fact, more sites provide neither than both! It’s impossible to not have an IPv4 address and be truly connected.
A large number of the top 100,000 websites are either blocking ICMP echo requests (ping) or are simply offline. Alternatively they could only be only listening to a specific sub-domain. I didn’t check for this.
One reason sites that provide both IPv4 and IPv6 are faster, is that 65% of them are behind cloudflare, or google. Both have worldwide CDNs, and cloudflare provides a free IPv6 gateway; allowing IPv4 only sites to be connected to using IPv6.
The U.S.A. is the world leader in number of hosted sites with 43% of the market. Comparatively, every other country is trailing behind with Canada at 9%, Germany at 6%, and Hong Kong at 6%. This is despite the existence of global CDNs.
To test the geolocation accuracy, I potted estimated distance, over average response time. Sites above the diagonal are likely closer than their IP suggests. Sites below the diagonal simply have a poor connection. Besides a colourful graph, this shows the grouping of sites in different countries; explaining why there are so many peaks in the Histogram. There is a minimum amount of time it takes to connect to distant hosts.
I only ran the script once, and as the script took easily a week to analyze the top 100,000 sites, chances are the sites at the top had changed since the start. I could have gotten around this by parallelizing the script, or running it multiple times on a smaller set and taking an average. Parallelizing seemed too complex for the task at hand, and I didn’t really consider running it multiple times before it had almost finished analyzing. As I didn’t want to discard all the data, I decided to go ahead with the data I had.
The geolocation database I used isn’t 100% accurate - that is virtually impossible. You can see that it isn’t on the Average Response Time over Distance graph. Many sites are so significantly above the diagonal that the only way that the time would be possible is by breaking the speed of light; and indication that they are located much closer than their IP suggests. This is likely as common as it is, due the how exhausted IPv4 is, organizations are trading the limited number of IPv4’s that they have access to. This doesn’t really matter too much, due to the size of my data set. In future I could remove the outliers to produce cleaner results.
It is clear to see that sites that serve both IPv4 and IPv6 traffic are, on average, significantly faster than those that don’t (4 times faster on average). Every single graph I have produced shows this simple fact. I feel that this can be attributed to several factors:
However, despite the fact that average response time for IPv6 is significantly faster than IPv4, it’s unlikely you’ll see any speed increase switching between IPv4 and IPv6 on a host that supports both. IPv6 is faster because the hosts that server both are well connected with fast response time, regardless of protocol.
Despite the best efforts of organizations such as worldipv6launch.org, cloudflare, and Google; IPv6 access is an afterthought. Despite claims of 500% growth since 2012, it’s 2015 and only 4.5% of sites support IPv6. As we go into the future, IPv6 deployment will surely grow as larger populations and the Internet of Things will strain the exhausted IPv4 pool even further. Until IPv6 is widespread, anyone with only an IPv6 address will be unable to connect directly to IPv4 only hosts, without the aid of a tunnel.
IPv6 isn’t evenly geographically distributed, compared with IPv4. If you’re in Africa, the Caribbean, or the Middle East, virtually no sites support IPv6. This suggests to me that the infrastructure required to support IPv6 just isn’t there.
Bigger sites are more likely to support and have a fast IPv4 and IPv6 connection than smaller sites. As you go through the different sites, the further down you get, the slower the site is to respond, on average.
Jan. 9, 2015 in History, University, Web
From an underground Swiss bunker to all around the world; the World Wide Web has transformed from an experiment in academic distribution to the massively interconnected strength that we know today. While the current Web is relatively new, it has not only revolutionised the world, but promises to continue as our society strives towards the Internet of Things.
This report will untangle not only the history of the World Wide Web, but its many predecessors: designed and implemented, successful and not. It shall be accomplished by studying some of the attempts at webs from the past century; starting with the Mundaneum and ending with the current Web.
As the report approaches the end, it will look at the future of Web. Devices are becoming connected with smart devices such as phones, televisions and even thermostats sharing data with themselves, and their manufacturers. Privacy has been and will continue to be an important issue as greater amounts of data is shared.
Oct. 1, 2014 in Photos, University