data-ing

Like many Americans and non-Americans alike, I've been glued to the terrifying coverage of the US 2016 Presidential Election. With only a day to go, I decided to join the large array of blogs and news outlets that have performed a review of the vast quantities of polling data available.

The source

First of all I had to get hold of some polling data. Of course there are plenty of sources available on the internet, so I had a look at the following, but decided to go with the poll results collated by electionprojection.com, mainly because of the ease of obtaining. Other such sources I found useful:

In order to get hold of the state by state data I wrote a small python script to parse the data from html tables on the electionprojection website:

# Get poll data
data = []
# 
for loop in range(0,len(electoral_college)):
    state = electoral_college['State'].iloc[loop]
    state = state.replace(" ","-")
    if state.lower() == 'district-of-columbia':
        continue
    url = 'http://www.electionprojection.com/latest-polls/'+state.lower()+\
    '-presidential-polls-trump-vs-clinton-vs-johnson-vs-stein.php'
    the_content = urlopen(url)
    soup = BeautifulSoup(the_content,"html.parser")
    table = soup.find('table',attrs={'class':'mb pollblock'})
    table.prettify()
    rows = table.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        cols.insert(0,state)
        data.append([ele for ele in cols if ele])

data = pd.DataFrame(data)
data.columns=['State','Firm','Dates','Sample','Clinton','Trump','Johnson','Stein','Spread']
poll_data = data[data.Firm.notnull()] # remove null lines
poll_df = poll_data[poll_data.Firm != 'EP Poll Average']

Given the US system uses the electoral collage for determining the next president I needed to get the a list of votes for each state. Wikipedia could help:

The simulation

To determine tomorrow's winner I took the poll results for each state, applied a random variable for each candidate, and ran it a tonne of times (Monte Carlo). This approach estimates the winning candidate of each state which can then be attributed to a number of electoral votes. Simply summing up the votes gives us the winner.

This chart shows the percentage chance of victory based on the number of simulations run. It's clear to see this quickly converges to 74% for Clinton, and 26% for Trump. My random variable wasn't quite random enough to give Johnson or Stein a chance.

The conclusion

Well Nate Silver of fivethirtyeight.com estimates a 72% chance for a Clinton victory... So look at that.. Maybe I should be a look for a career change.

Analysing stars used to be my bread and butter, but it's still an area of great interest. Therefore when I came across the Open Exoplanet Catalogue and found out it had host star information, I couldn't resist plotting out a Hertzsprung-Russell diagram:

I'm curious about those hot stars (blue cluster), their effective temperatures and luminosities are a bit odd. I'd love to think they were blue dwarf stars, but as none have been seen before, I doubt it.

A note before we start, this is by no means an original analysis, just my attempt at learning something new. With that said:

A slow period in work, and a desire to learn something has had me fumbling around with Python for data analysis over the last few weeks. I've dabbled in Python before, mainly for parsing the rugby data collected in a previous post. However this time I wanted to see what I could do with Python from start to finish.

Therefore I trawled the internet for resources and training material, and came across an excellent blog by yhat using their Rodeo IDE for data science. Be sure to check it out. So Rodeo downloaded, and my iPhone Health app's step counter data extracted, I decided to repeat their analysis and train my Python skills.

The source

First of all I extracted my Health app data from my iPhone and loaded into Rodeo to get a sense of the data it collects. Depending on how much you utilise the Health app features there's actually quite a bit of data you can pull from it e.g. heart rates, distances, steps.

Ok, so a quick time series doesn't show much, just that there was a period of time (around day 550) when either I didn't walk at all, or maybe just didn't have my phone. The flat part of the time series, up to day 80 or so, is the period of time before I received the iPhone, and the first point is actually 9 steps; they must have been messing around with my phone in the factory.

The interesting

I'm pretty sure I walk much less during the weekend than the weekdays...

Yup- that definitely confirms it, the above plot shows a frequency distribution for my steps during the weekends (purple) and weekday (teal), that's a clear conclusion that I need to walk more during the weekends. One can clearly see a bi-modal distribution in weekdays, and I believe this is due to the 'weekend' effect i.e. on bank holidays, I do very little walking, just like the weekends.

Now currently I'm using two phones, and tend to have them with me most days. My second phone is an Android device, the WileyFox Swift. I have the Google Fit app installed on this, therefore I could load up the step data from this device and perform a quick comparison between the two:

The negative step difference is due to Android, so it looks like there's actually plenty of occasions where I've forgotten to carry my iPhone.

For this final plot, I've taken the max number of steps each day, for either iPhone or Android and then cut them based on whether the date falls before or after 14th July 2016. Why this date? That's when PokemonGo came out in the UK...

One can clearly see the bi-modal distribution in the pre- and post-PokemonGo step distributions, however the standard deviation of the post-PokemonGo distribution is clearly larger.

The conclusion

Well what does all this tell me:

I need to walk more on the weekends
PokemonGo is helping with that, just not enough

So I'm going out for a walk.

For those unaware, tomorrow is a big day for British science and in particular space exploration. Tim Peake, the first ESA astronaut selected from the UK, will be heading up for a 6 month tour on the ISS. Those that know me will understand my fascination with such an event.

I decided it would be nice to mark this occasion, in my own way, by pulling together another Tableau dashboard centred around the ISS. Well in particular the astronauts. A friend of mine was curious as to how long the astronauts are spending on the ISS, so I decided to look at this in terms of how far mankind could have travelled from the Earth in this time.

Now I took some major assumptions here: The ISS travels on average at 27,600 km/h, and will execute 15.54 orbits of the Earth in a single 24 hour period. Therefore with these assumptions, and knowing how long each expedition to the ISS has lasted (according to NASA), I calculated a very rough (back of the envelope, if you will) calculation to illustrate how far these astronauts could have gone from the Earth.

As you can see from above, I've included the Moon, Mars, and Jupiter as a bit of a reference, and it's clear that only the most experienced of astronauts like Gennady Padalka, has made it to the Red Planet at its average distance ~225 million km.

I'm really looking forward to launch tomorrow, and seeing some of the experiments Tim will be running while on board the ISS. And of course adding Tim's data to the graphic above!

Edit: to include Jupiter, and a slight issue with my scale.. That's a reminder to have someone check your work before publishing

I'm a fan of rugby. Not the biggest fan in the world, but I enjoy watching a game here or there, whenever I've the time really. Ulster games, however, I do try and make time for. Probably because of being from Northern Ireland and living a stones throw from Ravenhill (Kingspan stadium).

Many rugby fans will know It's a common occurrence during a game to see at least one yellow card get branded by the referee. And it's almost more common to hear the commentator state something along the lines of "a yellow card is worth 10 points to the opposition". Well is it?

I wanted to explore this commentator statistic to see if this has been the case for Ulster over the last few seasons. The first Tableau graphic illustrates the average number of points each opponent loses during a yellow card, whilst playing Ulster. These data have been accumulated for all Ulster games from the 2012/2013 season up to November 2015, but I decided to only go with Pro12 data for this graphic. What I found of immediate interest is that Zebre, often considered the worst team in the Pro12, are only one of two teams that tend to come out the better side during a yellow card. Granted this is a somewhat bias analysis only considering Ulster as the opposition.

If I just take a simple average across all the teams it's clear Ulster don't benefit from the full 10 point average, but they still tend to see around a 7 point advantage.

Ok so that was a quick and simple analysis and I have plenty of yellow card data to play with, so I'll have a look and consider who's the biggest culprit for getting on the wrong side of the ref?

In this Tableau graphic I wanted to continue the theme of looking at the impact a yellow card can have on a team's performance. This time I specifically looked at Ulster and how each of their yellow cards has affected a game. I also wanted to highlight which players were more prone to getting on the wrong side of the ref.

As an Ulster fan it might be a bit of a surprise to see Iain Henderson and Rory Best near the top of the list. Or that four of the top ten are second rows, namely Iain Henderson, Dan Tuohy, Franco Van Der Merwe, and Lewis Stevenson. I also find it curious that Ulster really suffer with Dan Tuohy in the sin bin.

Although there isn't anything difficult in my analysis, it was an exercise I found enjoyable and something I haven't really seen done before.

I still have plenty of yellow card data to look at, but I feel it would be more beneficial to have a full population - specifically all yellow card instances for all teams, and all major leagues. I'm sure this will be something I'll return to soon enough.

data-ing

Joining the mass of election reviewers

The source

The simulation

The conclusion

Back to my roots

Python, Pandas, and other Data Analysis animals

The source

The interesting

The conclusion

To the ISS or beyond

Ulster yellow cards and their impact

Popular Posts

data-ing

Joining the mass of election reviewers

The source

The simulation

The conclusion

Back to my roots

Python, Pandas, and other Data Analysis animals

The source

The interesting

The conclusion

To the ISS or beyond

Ulster yellow cards and their impact

Follow

Popular Posts