This class was an interesting experiment for me. I certainly didn’t mind switching up the usual class dynamic in history, by throwing in some research theory and computer skills […and a bit of a break on my weekly readings]. I also liked that Dr. Turkel was very willing to answer questions mid-class and work through problems that weren’t necessarily part of the lecture during class time. I also appreciated the in-class activity time: I would not have learned much without being able to replicate lecture examples on my own, and having the activities set up for us made that much easier; having class time to work through them made sure that we didn’t run into snags that prevented us from getting all the way through it, and ensured that we did, in fact, do them regularly.
Despite enjoying the class, for the most part, I am unsure how to apply the skills we learned [about] to my own research. For a historian who might be looking at more recent phenomena, such as social media, celebrity culture, or things that are frequently housed online (notable works of literature, for example), the tools that Mathematica brings to the table are very useful. If one is looking to manipulate images in various ways, I could see a strong case for the program being made there, as well. So far, my research has not taken me to any of those places, and so my use of Mathematica would have limited to machine-readable primary sources that I have access to online, and that are not part of an existing database with a competent search function. I can’t see an immediate use for the program to me, but that doesn’t mean that it won’t be a tool I can have in the back of my mind in the future. Dr. Turkel has been enthusiastic about future applications of text recognition, and, if his predictions regarding this technology are true, I can definitely see programs like Mathematica being used increasingly for research purposes in the future.
Some final suggestions: It may be useful in the future to work activities around more texts that are not in Mathematica’s database, or at least fewer works of fiction. While they may be convenient for demonstrating the method, I think more historically relevant texts might give a better sense of how these methods would be useful to historians (the state of the union address examples were better, in my opinon, than Alice in Wonderland or The Raven). It might be helpful to require or encourage students to suggest sources themselves, either that they have used in previous papers or courses, or that they are making use of for a current project. These sources could then be analyzed using the methods we learned about in-class. Take these with a grain of salt seeing as I’m just a student, but these were my impressions of the class from this side of the glass.
Our second last class was on identifying images, photogrammatry, and georectification. The first section had some predictable faults; facial identification was wonky, as was image identification and machine learning. As someone who isn’t particularly familiar with how computers do what they do, it gave me the impression that using tools like this simply did not have a high enough accuracy rate, as of right now, to be of particular use, especially when using these techniques on large groups of images (as I assume is the main benefit of digitizing research efforts). This is a similar complaint I had regarding OCR. While they seem useful, I’m not sure how comfortable I feel using them for serious work given the current state of the technology.
A more interesting experiment was in the second half of the lecture. We took an image of Edinburgh from earlier in the 20th century, and an image from today, and superimposed them upon one another. This method allows the user to quickly and conveniently see the differences between old photos and new photos. This obviously has its limitations: you require photos from a similar angle from a time period you are interested in, as well as a later or early period, in order to evaluate change. Still, I can see the value. In undergrad, I worked on a project with a local museum conducting research into local buildings in Dundas, Ontario, along King St. There are far more of these old photos of currently standing buildings than one might think. This method would have been very useful for my project. Some of the building I looked into had changed significantly, and others in more minor ways: being able to see the differences at a glance would have helped, as I was looking at various photos from various time period of a half dozen separate buildings. Speaking of museums, small interactive displays like this have applications for the public, as it can give them a quick glance at how things have changed, landscape-wise, in a display area. My own experimentation at home with this section of the lecture went awry quite quickly: I attempted to download and import images from the same flickr blog which was used in the lecture, but the standardize image command threw me a number of errors (it seemed to think the image was not an image, but a time entity, even though the image showed up fine during the import process). I’ve no idea what went wrong.
For the final bit of class we looked at a similar method of comparing maps; I won’t comment on this too much, but I can see the value in comparing, at a glance, how landscapes have changed, or how our perceptions of landscapes have changed, similar to the discussion above. All in all, an interesting week for me.
On Nov 15 we looked at page images and OCR, or Optical Character Recognition. This is perhaps the most exciting part of the course, if merely in its potential applications in the future. This is, perhaps unfortunately, the most relevant part of the lesson for me. While the examples we went through in class were not particularly mangled by the OCR software, it was an example which used relatively easy text to read, and the recognition was off significantly enough that I wasn’t left with the impression that the tool would be particularly useful at the current moment in time. That said, Dr. Turkel has said multiple times that this is an up-and-coming field in computer science, which, if results are forthcoming, could have massive benefits for researchers working with large numbers of primary sources. If it can be made more accurate, I believe this to be true: it is not the method I am critical of, but the current state of the technology.
Later in the class we looked at image processing. This was a fun exercise; I particularly enjoyed the comparison of doctored images of Josef Stalin. ImageDifference (which we used to look at the Stalin photos) and the unblurring tool were most interesting to me. Seeing differences between similar images is a relatively simple tool to apply. Unblurring is perhaps more useful because of its potentially wider usage; many online sources I have used are blurred to some degree, whether because of damaged original documents or poor quality scans, and so I can see this tool being useful. My question is how well the tool works when given text that has not been blurred digitally (i.e. by a poor quality scan, or intentionally on a computer, as the example in class was)? If it can be used to make physical documents clearer that would be a very valuable tool.
Our class for Nov 3rd dealt with TF-IDF, which I felt to be a much more precise tool than we had used in previous weeks. Rather than looking at simple word or phrase frequencies, TF-IDF assigns a ‘rating’ to a string which gives the user an idea of how important that string is to the section of the particular text with which the user is working. I have mixed feelings about this tool. On the one hand, it can be useful to ascertain which sections of the text discuss certain terms at length; however, the example of the word ‘selection’ in The Origin of Species made me think that it was, sometimes, a poor tool to identify generally important words that are used throughout a text. If a word is used frequently throughout a text, it’s TF-IDF score will be low. If it’s use is clumped into a single section, its score will be high in that section.
Our example in class looked at a TF-IDF score between sections of a book. I wonder whether or not it would be possible to take TF-IDF and apply it to entire texts, compared to one another. Rather than comparing the relative importance of the phrase ‘natural selection’ within Origin, the user could then figure out which of Darwin’s works this phrase is more important to. Given we all know who Darwin is, this wouldn’t be useful for him, per se, but someone looking at another body of works by another author or group of authors, with little knowledge but several key terms, might be able to get a better idea of where to start their analysis by using this tool in this way. I may be off base with this criticism/thinking, but it is what struck me when I re-read the lecture for this week.
Our class on Sept 27 dealt with pattern matching, an interesting and flexible tool. Using a variety of methods, Mathematica can identify patterns within texts and pull them out for the user. StringCases is the most relevant command, and can be modified to search for words ending in certain suffixes, containing certain strings of letters, as well as other scenarios. This can be useful when searching for various permutations of words. Older documents in particular can have imprecise spellings, and spelling has changed over time, making a standard word search less useful than being able to identify any and all variations of a particular key word. It can also be used to help identify broader things about a text; you could identify a subject area you wish to look at, and search for variations of a word or words related to that subject.
The pattern matching tool made messing around with something resembling the StringCases command much easier than other lessons. It has some drawbacks (for example, you have to scroll through the text to find the location of the word or phrase you were searching for), but I found it to be manageable. For context, I pulled up a very long document (The US Constitution) and found the process to still be very easy. Searching for terms like ‘Arm’ would yield all words containing that string of letters; from that search the user can easily find all mentions of the US army in the constitution (it is mentioned only twice, and a third term ‘arms’ is found in the second amendment, as one would expect).
For this post I would like to revisit a previous week, in which we discussed keywords in context, or KWIC, as well as n-grams. The basic idea of the lesson was quite straightforward: it is important to be able to retain or obtain context when searching for key terms within a text. N-grams are one way to search for keywords with context, and one that I was already familiar with via Google’s N-grams. N-grams themselves can be used in a similar way to word frequency, to get a sense of the main subject material in a text. However, because they retain context (depending on how many words are in the N-gram), they can theoretically also be useful to ascertain how certain subjects are discussed. An easy way to visualize this is a concordance, which we used in conjunction with 7-grams. A concordance shows a list of all terms with 3 words on both sides of the term as they appear in the text, in alphabetical order. While the objective of a concordance might be to retain some context, I felt that it was too limiting to provide meaningful information. For example, I replaced the example alice text with the US constitution; the concordance for that text yielded very little usable information, with virtually unintelligible sentence fragments making up most of the concordance. This tells me two things: that a) N-gram length should be tailored to the text (and is therefore difficult to apply to groups of texts) and b) that N-grams are more useful for searching for certain phrases within texts (as one might do in a search engine, or as we did in searching for both social and security rather than either or) than retaining context.
It has been a while; I’d like to revisit some lessons from past weeks, to catch up. For this blog, I’ll focus on our lesson on ‘Maps’, as I was away for that class. Going through the lesson, I replaced various parts of the notebook with alternative terms to see what else Mathematica knew; for example, it was interesting to compare various map projections at once, to visualize how they are different from one another. The common Mercator projection is quite odd looking when compared to the Robinson, Bonne, or Equal Cylindrical Area projections. In addition to this, the Wolfram documentation showed me how to apply these various projections to certain areas of the world; coincidentally, Canada was used in the example, using a custom ‘mapCanada’ function which allowed the user to see Canada on various types of maps. Canada is a country which is large, but distorted heavily in many projections (particularly the mercator), so using it for a comparison of projections seemed favourable compared to the entire world. The documentation also contained information on the GeoPath and GeoPosition commands, which can be used to plot on a projection, and define extra parameters respectively. GeoPath is a command which we used in the lesson, and so was quite clear – although I did find out you could use it to, for example, highlight a particular parallel [such as the 49th]. GeoPosition could be used to limit a projection; on an orthographic projection, one can virtually simulate ‘turning a globe’.
The activity for this week was quite straightforward. The Bermuda Triangle was our focus, as we first determined its endpoints along with their respective longitudes: Miami, at 25.78, ~-80.2, San Juan at 18.44, -66.13, and Bermuda, at 32.3, –64.79. San Juan and Bermuda gave me some trouble as Mathematica does not always adhere to the ‘city, state, country’ format [Bermuda was ‘Hamilton, Bermuda, Bermuda’, though it is a British possession, and Puerto Rico was also considered its own entity though it is an American territory]. After a great deal of trying (I ended up missing a pair of brackets…) I managed to plot the points on the map as the instructions laid out, which also allowed me to easily see that the HMS British Consul shipwreck was not inside the triangle, the final part of the activity.