Lee's Stego Research Notes: Arabic Diacritics Based Steganography

Mohammed A. Aabed, Sameh M. Awaideh, Abdul-Rahman M. Elshafei and Adnan A. Gutub, 'Arabic Diacritics Based Steganography,' 2007 IEEE International Conference on Signal Processing and Communications (ICSPC 2007), 24-27 November 2007, Dubai, United Arab Emirates. [ PDF ]

Abstract

New steganography methods are being proposed to embed secret information into text cover media in order to search for new possibilities employing languages other than English. This paper utilizes the advantages of diacritics in Arabic to implement text steganography. Diacritics - or Harakat - in Arabic are used to represent vowel sounds and can be found in many formal and religious documents. The proposed approach uses eight different diacritical symbols in Arabic to hide binary bits in the original cover media. The embedded data are then extracted by reading the diacritics from the document and translating them back to binary.

Diacritics 就是音標的意思, 換句話說, 就是出現在文字旁邊用來表示不同發音的發音符號。作者在 Table 1 中列出了阿拉伯文的 8 種主要的音標。阿拉伯人用這些音標來改變發音或是用來區別不同字義但拼法卻相似的文字(to alter the pronunciation of a phoneme or to distinguish between words of similar spelling)。由於這些音標在文章中是可有可無的(optional), 因此, 作者提出了本篇利用音標來嵌入機密訊息到阿拉伯文章的論文。

From StegoRN

根據作者的統計分析指出, 上述 8 個音標, 出現頻率最高的是 Fatha, 幾乎等於其他 7 種音標出現的總合。因此, 機密訊息 1 被指定嵌入於 Fatha 所出現的位置中, 機密訊息 0 則被指定嵌入於其他 7 種音標所出現的位置。

本篇論文所提出的嵌入程序(embedding process) 如下:

選定一篇所有文字都標有音標的掩護文章(a fully diacritized Arabic text), 然後從文章的開頭開始向下搜尋,

如果要嵌入的機密訊息是 1, 掩護文章中所遭遇的文字的音標剛好也是 Fatha, 那就完全不更改掩護文章上的文字, 如果所遭遇的文字的音標不是 Fatha, 而是其他 7 種音標, 則刪去該音標; 然後, 繼續在掩護文章中尋找下一個文字, 如果還是其他 7 種音標, 繼續刪去音標, 直到遭遇帶有 Fatha 的文字;

如果要嵌入的機密訊息是 0, 掩護文章中所遭遇的文字的音標並不是 Fatha, 而是其他 7 種音標, 那就完全不更改掩護文章上的文字, 如果所遭遇的文字的音標是 Fatha, 則刪去 Fatha 音標; 然後, 繼續在掩護文章中尋找下一個文字, 如果還是 Fatha , 則繼續刪去, 直到遭遇其他 7 種音標為止。

萃取程序(extracting process)

要取出機密訊息則是從文章的開頭開始向下搜尋帶有音標的文字, 只要判斷該音標為何? 就可以知道此位置所藏的機密訊息為何?

嵌入容量

這篇論文所提出的嵌入技術, 嵌入容量其實非常容易估計, 只要去算算整篇文章最後留下多少音標, 就是嵌入機密訊息的長度。理論上, 假使音標是隨機出現, 藏入的 0, 1 機密訊息也是隨機的, 那就是有 50% 的音標會被保留下來; 然而, 作者在論文中指出, 平均嵌入容量約為 26.16% (= 50% * 3.27% / 6.25% ), 個人覺得應該是阿拉伯文中存在某種的發音特質, 使得音標出現的次序並不是隨機的, 造成嵌入容量無法達到理論值的緣故。

這篇論文最有趣的一點是作者指出所提出的隱藏技術可能會引起別人的注意: