Experiments Using ChatGPT + Wikipedia For Media Cataloging

ChatGPT offers a general purpose summarization and Q&A workflow that can be applied to any textual input sources, with especial potential for long, flowery, technical detailed text of the kind that appears in encyclopedias like Wikipedia. Recently we explored the idea of using ChatGPT to answer basic questions about the news outlets we monitor across the world to supplement and/or correct the codified data in Wikipedia's infoboxes. For example, the infoboxes for many media outlets contain outdated or simply wrong information about the outlets. Sometimes the entry itself is incorrect, but othertimes the correct information is present in the textual portion of the entry, raising the question of whether we could use ChatGPT to compare the infobox contents against the article text to identify inconsistencies and conflicts. Our early experiments suggest that in the majority of these cases the English textual entry itself is either equally wrong or contains sufficient conflicting information that ChatGPT was unable to identify the conflict, but that when applied to the native language Wikipedia it is often able to flag inconsistencies between the native text and infobox.

Take for example the English entry for Habertürk. The infobox states that the newspaper ceased publication on July 5, 2018. It lists a URL for the outlet, but so do the entries for the majority of ceased publications we reviewed, leading often to archived versions of the site or to a new owner of the site (the infobox is often not updated even after the domain has been purchased by another company and used for a new purpose). The English textual description similarly states "It ceased publication on 5 July, 2018" and uses past tense to refer to the outlet in the original description, though present tense is adopted later in the entry. This mixed case is common in the entries of ceased publications, where the intro text is updated, but later details are not rewritten after the publication ceases. Worse, all three external citations in the article are dated prior to the date the publication allegedly ceased operating, lending further credence to it no longer publishing. After all, if it had been purchased and restarted, one would expect more recent citations covering the purchase and restarting.

Feeding the English-language article into ChatGPT and asking it "Is the the Habertürk newspaper still active based on the following description", ChatGPT answers each time "No, based on the description provided, Habertürk newspaper ceased publication on July 5, 2018, and it is no longer active."

On the other hand, if we look at the Turkish language edition of Wikipedia for Habertürk, we see that the infobox similarly lists the publication as ceased on July 5, 2018. Just like the English edition, all of the references in the bibliography section have dates prior to its alleged cessation of publication, with a single 2023 capture that leads to a page that contains no internal date metadata, suggesting it is merely a recent snapshot of an archived site. However, in this case, as a Turkish newspaper, the Turkish edition of Wikipedia has a very different opening summary (translated into English): "Habertürk was a daily newspaper that started its publication life on 1 March 2009 . The last issue came out on July 5, 2018. The newspaper continues its publication life on the internet at haberturk.com. Yavuz Barlas is the editor-in-chief." and "Ciner Yayın Holding announced that the newspaper will bid farewell to its printed publication life with its last issue to be published on July 5, 2018, and will continue its digital publication as haberturk.com from now on."

Indeed, local media experts confirmed these details.

Given that ChatGPT is able to natively process Turkish-language text, we repeated the above experiment, using the prompt "Is the the Habertürk newspaper still active based on the following description" and handing it the Turkish-language introduction text from the Turkish entry. This time ChatGPT answered in English: "According to the description provided, Habertürk newspaper ceased its printed publication on July 5, 2018. However, it continues to operate on its digital platform at haberturk.com, and Yavuz Barlas is still the editor-in-chief."

Putting this all together, both the English and Turkish Wikipedia infoboxes codify the publication as having ceased operations five years ago and both have citation lists that contain only coverage prior to the alleged cessation of publication. Using only the codified and easily machine-processable metadata in the two entries, we would conclude that this is a historical publication that is no longer in operation and any content we might gather from that domain is likely no longer by that news outlet (the domain has been purchased and repurposed by a new entity). Thus, traditional metadata-based analysis will incorrectly catalog this as a defunct publication. Applying a textual Q&A tool like ChatGPT to the English-language Wikipedia entry similarly confirms that the publication has ceased. Only by applying a multilingual LLM to the Turkish edition of Wikipedia do we finally conclude that the publication is alive and well and that the cessation date refers to the date it ceased its print edition, while it continues on as a digital-only publication.

In the end, this reminds us A) of the significant limitations of codified metadata, even on heavily edited sites like Wikipedia with legions of contributors, B) of the criticalness of looking at local language sources rather than relying only on English and C) the power of LLMs to reconcile and correct codified metadata.