« Blogger need help. Too much spam | Main | Mosaic of Photos with MyBlogLog users »
Unicode and ASCII: Busy Day again
By maurizio | August 30, 2007
If you are new here, you may want to subscribe to my RSS feed. Feel free to leave comments and questions too.Thanks for visiting!
Sorry, no post yesterday and no post today (apart for this one).
I am really busy looking for a job & improving my Java Crawler.
You’ll never imagine how difficult is to read thousand of different websites and try to get precise stuff from them. In those days I’m fighting with page encodings.
Most of the web pages that you see are encoded with UTF-8 which is Unicode. Unicode is an international standard to define how to “write” characters. Do you remember ASCII? Unicode is something like that but more powerful, which means it can represent more characters.
What do you mean “Write” and “Represent” ? Easy. Imagine that the letter “a” has to be saved on a file. When you open that file you see an “a” but inside the file the a is represented with bytes (do you know that? the 0s and 1s?) because everything in the computers world can be divided in bit (the atom of the information on the digital world). So your “a” becomes a list of 0 and 1 which could be translated in the number 97. This is an old definition made long time ago and they called the whole set of definition ASCII. As you maybe know 97 can be represented with 1byte (which is a collection of 8 bit –> eight 0 or 1). The problem is that 255 (1 byte) is not enough to define all the letter of all existing alphabets. You don’t have enough space for European languages..imagine if you have to put Cyrillic or Kanji characters too. Because of that they created Unicode, which use 2 or more bytes to define all the characters. With Unicode you can represent Chinese or Japanese characters too.
Unicode looks like the perfect solution for representing web pages, because you are sure that if every browser is using that encoding, everyone can see what you write. The problem is that not everyone is using it. I’ve found a lot of pages using special Chinese encoding, some other using Korean and so on.
The problem with my application is that I have to extract those informations, but before getting them I need to know the character set used by those pages. Hopefully html is helping me with the tag
< meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
But then I have to parse the site twice if the charset is not UTF8.
Topics: Internet, Programming |
Read other related posts:


August 30th, 2007 at 8:49 am
When are you planning to show us some results?