SRI taps AI to hunt for linguistic DNA that proves authorship

With bad information running rampant, the need to assign authorship is essential. SRI is developing the tools to make it happen.

In an age where unreliable or deceptive information threatens public health, economic well-being, and, perhaps, even democracy itself, verifying who wrote a given piece of content has become more valuable than ever.

Against that backdrop, a team of experts in linguistics and artificial intelligence led by SRI International is developing new ways of examining written text for its “linguistic DNA” that can positively identify an author’s identity. The program is known as Signature.

Every writer has certain tendencies—ways of phrasing things, common grammatical or spelling mistakes, capitalization choices, and even ways of laying out their arguments. “These patterns are linguistic DNA—unmistakable and unchangeable,” said Dayne Freitag, the principal investigator for Signature and technical director of SRI’s Artificial Intelligence Center. “Aided by AI, we can spot these patterns to determine who the author of a given text is with near-absolute accuracy.”

Signature works in both directions, Freitag says. It can say affirmatively that the named author of a piece did indeed write the piece. And it can take an anonymous piece and identify who wrote it.

Next-generation tools

Signature’s ability to examine and evaluate patterns in an anonymous author’s written style, punctuation, phrasing, average word length and vocabulary size is part of a field of linguistic research known as stylometrics, Freitag says. But stylometrics can only go so far in a world of 140-character tweets.

Freitag and team will push Signature beyond stylometrics and recently received a multi-year contract from the Intelligence Advanced Research Projects Activity (IARPA) HIATUS program to make it a reality.

Freitag will lead a team that includes SRI’s Aaron Lawson, Prof. Chris Reed of the University of Dundee, Profs. Alan Ritter and Wei Xu at the Georgia Institute of Technology, Prof. Yulia Tsvetkov at the University of Washington, linguist Natalie Schilling at Georgetown University, Dr. Adam Bradley of Uncharted Software, and attribution expert Prof. Benno Stein, founder of the PAN conference.

The Signature team will capitalize on a new approach that probes an author’s argumentative habits in addition to stylistic features. It’s an area of research Freitag calls discourse analysis and is being done in collaboration with colleagues at the University of Dundee, a subcontractor on the IARPA contract.

Patterns revealed by discourse analysis are more habitual and less conscious than stylistic patterns and are, therefore, more reliable than stylometrics in attributing authorship. They are also harder to consciously manipulate or mask without harming the text’s meaning.

“To our knowledge, Signature is the first time these high-level discourse features have been used in author identification, and we have barely scratched the surface of what is possible,” Freitag said. In that respect, the team at the University of Dundee has already identified at least 700 rhetorical patterns, some dating to the time of Aristotle, that may benefit discourse analysis. While not all those patterns are computable, Freitag believes other as-yet-unexplored patterns could increase its feature set into the many thousands.

Decisive edge

As with many new avenues of AI, the more data one is given, the higher the probability of accuracy becomes, but Freitag and team are finding ways to make accurate predictions with shorter segments of text—and with less content to compare it with. In a world where the pool of potential authors numbers in the millions, these tools could prove decisive.

Such technologies could be a boon in law enforcement, in matters of copyright infringement and plagiarism, and privacy protection. On the privacy front, Freitag thinks that a project like Signature could lead to applications that identify an author’s tendencies and suggest changes to mask their identity, all while maintaining the original intent of the piece. On a more topical front, Signature might even be used to identify content “written” by generative AI apps, like ChatGPT, to help spot falsified essays in academic settings and on college entrance applications.

“In the age of social media and instantaneous, anonymous global communication, author attribution is more critical than ever,” Freitag says. “Potential applications for Signature are wide open.”

Read more from SRI