Making iText work with Indic scripts

Why iText does not work properly for Indic Scripts?

There are a number of threads floating around as to why iText does not render Indian languages properly. The reason is because iText does not handle Ligature Substitution.

What is Ligature Substitution?

In Indian languages like Bangla and Hindi, two or more characters sometimes merge to form a single Glyph.

Bangla Example:

+ + = ক্ষ

+ + + + = ক্ষ্ম

+ + = ল্ল

Hindi example

+ + = क्ष

+ + + = क्ष्म

+ + = ल्ल

This essentially means that whenever we get these composite characters, we need to replace them with a single glyph.

In case you see only boxes above, click here. Upgrade your browser to one that can handle Unicode.

Where do we get the information about which Glyphs are to be substituted?

This information is available in the OpenTypeFont file(note that OpenTypeFonts can have the extension .ttf, which is also used for TrueTypeFonts). The OpenTypeFont has a table called the GlyphSubstitutionTable (GSUB). Its pretty cryptic and obfuscated, and you have to basically go on a wild goose chase. But after that, you can get a list of the Glyphs that should be replaced by a single Glyph. These specifications can be found here: http://www.microsoft.com/typography/otspec/gsub.htm

Inner workings of iText

The best part about iText is its Open Source. This is the svn: svn://svn.code.sf.net/p/itext/code/trunk

At the heart of converting text to PDF is the TrueTypeFont class. This parses the actual FontFile and reads various information like the Character to Glyph mappings (cmap), the Glyph metrics, etc. Then, we have the convertToBytes() method in the FontDetails class, which actually converts each character into the Glyph code and writes it to PDF.

Integration of the GlyphSubstitutionTable data with iText

  1. The GlyphSubstitutionTableReader class parses the FontFile and gleans the Glyph substitution information, and returns a Map<String, Glyph>, where the key is the String of composite characters and value is the Glyph object.
  2. Then, in the FontDetails::convertToBytes() method, tokenise the input String based on the composite characters.
  3. Replace the composite characters by their respective Glyphs.
  4. For characters that do not need substitution, proceed normally and replace them with their corresponding Glyph.

Test Harness

The following is the test harness for testing out my fix.

Before Fix

BeforeFix

 

After Fix

AfterFix

 

Source

The changes are done on itextpdf-5.4.0-SNAPSHOT, revision 5638.

Next Steps

If you notice the i-kar, e-kar and o-kar are still not displaying in their proper position. I am convinced that this is because we need to read the Positioning data from the GPOS – The Glyph Positioning Table. That is my next task. Stay tuned!

Update: Why is the latest iText still not working?

My code is commented out in the latest iText, as it seems to be interfering with some of their core functionalities.

How do I make it work?

Download the iText source from sourceforge:

http://sourceforge.net/p/itext/code/HEAD/tree/trunk/itext/

After getting the source, just uncomment the below line in the TrueTypeFontUnicode.java:

 

Building it with maven should be pretty straight forward. Cheers!

 

16 thoughts on “Making iText work with Indic scripts

    1. My fix should work with Hindi as well. But as far as I remember, the iText team has commented the code out. So you need to dig into their code by comparing it with my patch, and see if that works. Again, as far as I remember, my fix should work on Hindi as well.

  1. Looks like a great solution but did not work for me . PDF not generated getting

    The lookup type 6 is not yet handled
    The lookup type 6 is not yet handled
    The lookup type 6 is not yet handled
    The lookup type 6 is not yet handled
    The lookup type 6 is not yet handled
    Exception in thread “main” java.lang.IllegalArgumentException: No corresponding character or simple glyphs found for GlyphID=0
    at com.itextpdf.text.pdf.GlyphSubstitutionTableReader.getTextFromGlyph(GlyphSubstitutionTableReader.java:77)
    at com.itextpdf.text.pdf.GlyphSubstitutionTableReader.getGlyphSubstitutionMap(GlyphSubstitutionTableReader.java:54)
    at com.itextpdf.text.pdf.TrueTypeFont.process(TrueTypeFont.java:687)
    at com.itextpdf.text.pdf.TrueTypeFontUnicode.(TrueTypeFontUnicode.java:87)
    at com.itextpdf.text.pdf.BaseFont.createFont(BaseFont.java:697)
    at com.itextpdf.text.pdf.BaseFont.createFont(BaseFont.java:615)
    at com.itextpdf.text.pdf.BaseFont.createFont(BaseFont.java:450)

    1. Dima, I would suggest that you take the latest code from iText’s svn repository. Actually, as I mentioned in my post, this is still a half baked solution. I only made it work for a subset of fonts out there. If you try this with Lohit or SolaimanLipi fonts for Bengali, this should work.
      Thanks,
      Palash.

  2. hey thanks Palash. But it seems e-kar, i-kar is not working.
    You have also coded for the same.
    can it be used readily to resolve them?

  3. Palash I am working on malayalam font this solution is working for me. but it is not working for Consonent Diacritics.How to fix that? Do you have any idea regarding this?

  4. Palash I am working on malayalam font and this solution is working for me. But it is not working for Consonent Diacritics.
    Have you worked on Consonent Diacritics?

    1. Ruchika,

      I am not familiar with Malayalam, but for Hindi and Bangali, I had made it to work with Diacritics. But it is a bit tricky. I dont remember exactly how I did it, but look at *IndicGlyphRepositioner* and *BanglaGlyphRepositioner*. I believe these were written to handle the same issue. I think these are in the *com.itextpdf.text.pdf.languages* package.

      Thanks,
      Palash.

  5. I m working on Hindi Mangal fonts. when I am generating PDF I am getting an error-

    com.itextpdf.text.pdf.fonts.otf.FontReadingException: No corresponding character or simple glyphs found for GlyphID=549
    at com.itextpdf.text.pdf.fonts.otf.GlyphSubstitutionTableReader.getTextFromGlyph(GlyphSubstitutionTableReader.java:117)
    at com.itextpdf.text.pdf.fonts.otf.GlyphSubstitutionTableReader.getGlyphSubstitutionMap(GlyphSubstitutionTableReader.java:95)
    at com.itextpdf.text.pdf.TrueTypeFontUnicode.readGsubTable(TrueTypeFontUnicode.java:537)
    at com.itextpdf.text.pdf.TrueTypeFontUnicode.process(TrueTypeFontUnicode.java:99)
    at com.itextpdf.text.pdf.TrueTypeFontUnicode.(TrueTypeFontUnicode.java:73)
    at com.itextpdf.text.pdf.BaseFont.createFont(BaseFont.java:704)
    at com.itextpdf.text.pdf.BaseFont.createFont(BaseFont.java:622)
    at com.itextpdf.text.pdf.BaseFont.createFont(BaseFont.java:565)
    Do I need to change the fonts?

Leave a Reply

Your email address will not be published. Required fields are marked *