Analyzing Topics and Subjects

Peter Grimmett – u3163211

Martin Ruckschloss – U3114720

Issues encountered with data:

One major problem we had with the data was that words were being grouped together in phrases such as “Aboriginal Australians”. This was problematic for some aspects of visualisation, as if we wanted to see the frequency of the term “Australians” the phrase “Aboriginal Australians” would not be included in this count. To solve this, a modification of the Jupyter notebook was necessary to split words by the spaces in between them. The result was a more workable data set in which we could conduct further analysis into the nature of the data.

Another major issue with the data set is that a large amount of the collection has unique words or phrases as their identifiers. The sheer vastness of the data set we were working with became apparent. When split into single words the the statistical issues of the collection are as follows:

Descriptions:

-6,600 unique words used

-There are approximately 3,080 words with a count of 1. (46.6% of descriptions)

-There are 1,170 single words in the description count with a count of 2. (17.7% of descriptions)

-2,010 with frequency counts ranging from 3 to 20 (30% of descriptions)

Places:

-282 unique words used

-188 places with a count of 1-2 (66.6%)

Topics:

-845 unique words used

-482 topics with a count of 1-2 (57%)

-243 topics with a count of 3-10 (28%)

Titles:

-3000 unique words used

-2327 with a count from 1-3 (77%)

As can be seen from some of the above statistics, a majority of the data from the collection has very unique terms used in the titles, descriptions, places and topic fields. Such data is difficult to visualize due to there being no further relations between the data and no categories to further group data together into.

One solution is to disregard or filter away terms with little to no use and only visualize terms with high counts. An example of this may be to not graph any result with a count of less than 20. Another method utilized was to graph the first segment of results with high usage counts, then separately graph the lesser used terms. The resulting visualizations from such filtering were much more meaningful to the viewer.

Visualization:

Upon testing various methods of visualization and graphs, we found the the single best method of visualization was a basic bar chart with the X-axis containing phrases or words and the Y axis measuring the frequency. As stated above, various different filtering methods were applied in attempts to make the visualization more meaningful. The results are as follows:

https://drive.google.com/open?id=1fOa4mYCYi3QjEHlTV_M0ALsJxKdJCyLodqLVefwxKlU

Word clouds:

Another useful method of visualization we explored were word clouds. Similar filtration methods were used as in the bar charts, taking various sizes and applying them to the word cloud.

https://drive.google.com/open?id=1iMpNghu_cQ4Sm9UDCnNfjveQWhQutNxuEOvp9myq0LQ

further Analysis of data:

The following graph is a good indicator of the disproportional spread of topics and just how vast that list is. The numbers on the bottom indicate how many times a word appears in the collection. So the bar labeled “5” means there are 28 words in the collection that each appear only five times throughout the collection.
With 351 unique topics, that is topics that only have one use in the collection, that totals nearly half of cleaned topic count.
However, of of the total 8624 words, 98-926 makes up the largest portion, consisting of 3670 words even though it only consists of 17 unique topics. (Where 1 only consists of 351, as they are only single use).

https://drive.google.com/open?id=11954aDiG5ii7vYJ60K4RpdpPu04sRhqdi46rdHIJfFo

These links should provide a more interactive look at both the topics list and subjects list.
https://plot.ly/~PeterGrimm/13/

Topics split by count

https://plot.ly/~PeterGrimm/15/

Subjects split by count