Remember Sara Baker, a “fake” patient created by MedSeek, a health IT company? See “Nobody Knows You’re a Fake Patient on the Internet!

Identifying fake patients who post to health forums is pretty straight forward. I have been very successful doing it for years (read “Web 2.0 Pharma Marketing Tricks for Dummies,” for example).

Some time ago, PatientsLikeMe had to take action against a “fake patient,” which was actually a software bot trolling the site reading and analyzing posts. “This user was not a patient,” said Ben Heywood, co-founder of PatientsLikeMe, “but rather a computer program that scrapes (i.e. reads and stores) forum information” (see “Data Mining in the Deep, Dark Social Networks of Patients“).

But, how easy easy is it to learn the identity of a REAL patient who posts anonymously to health forums such as PatientsLikeMe?

It may be easier than many patients think.

“Authorship Attributor” systems that use text analysis techniques to crawl health forums can automatically correlate messages written by the same authors, which,  according to authors of a study recently published in the Journal of Medical Internet Research (JMIR), “makes an automated identification of the author of an online post possible” (see below and read the entire open access article here).

The implications of the research reported in JMIR is more scary than dealing with fake online patients. “Given that individuals may be reluctant to share personal health information on online forums, they may choose to post anonymously,” said the authors. “The ability to determine the identity of anonymous posts by analyzing the specific features of the text raises questions about health consumers using anonymous posts as a method to control what is known publicly about them.”

The authors summarize the practical implications of their research thus:

The main implication of our results is that they should caution users from posting sensitive information anonymously. Managers of online properties that encourage user input should also alert their users about the strength of anonymity. Our experiments show that a character-based method can be more effective than word-based methods in authorship attribution. These are novel results for forum analysis because the usual methods of text analysis are based on semantics and analyze the use of words, phrases, and other text segments. We propose that to improve security of forum members, the forum organizers pay more attention to the character-based characteristics of the posts.

Does this mean that posting anonymously is futile and that all consumers should just use their real identity? Moving forward, this is not necessarily the case. Future work can extend tools such as Authorship Attributor to (1) alert anonymous posters about the ease of determining their identity so they can then make a more informed decision about the content of their posts (eg, by informing consumers with many posts on the same topic that they will have a higher chance of being reidentified through their posts than those with fewer posts on many diverse topics), and (2) automatically modify the text to adjust its features to make it correlate less with other text from the same author and, hence, frustrating tools such as Authorship Attributor.