« Clicky and Performancing widgets | Main | Blog Research: Wordpress vs. Blogger »
Gathering information from blogs
By maurizio | July 2, 2007
Lately I have been busy building a Java application to get data from sites. I’m browsing sites around the net to create a database of sites, mainly blogs. The goal is to build some kind of statistic about the blogs or whatever comes to my mind.
Right now I have around 160.000 sites on my Database. I was thinking what kind of information I can fetch from them. The first that came to my mind is to read some specific html tag.
I don’t want to tell you now what I’m collecting because I would like you to suggest something I have to look for. Note that:
- I’m not collecting email addresses.
- I don’t follow links and I’m not planning to follow links on every site I visit. I don’t have the horsepower to do that on 150000 sites.
- I am looking for some simple thing that I can implement in Java, so don’t tell me to look for something too difficult to implement.
Feel free to send me an email or just comment this post.
Topics: Programming | 3 Comments »
Read other related posts:


July 2nd, 2007 at 7:18 am
I don’t know any Java coding to save my life, so I’m not sure what you can and can’t do with it.
But maybe the number of subscribers, then on your database you can list the average number of subscribers.
July 2nd, 2007 at 7:34 am
You don’t need to know Java to make suggestions. Knowing Html will greatly help.
For what you are asking I have to read the page and find something that could tell me that.
For example I can look for the html image for feedburner. If I find it I have to read the content of the image. This is the hardest part, even if I know that some spammer tool is able to read number from gifs.
Or maybe there is another way to know how many subscribers a site has.
Good idea! A bit tricky but maybe there is an easy solution for it!
Remember: a computer should do it, so it should be doable for it.
July 4th, 2007 at 6:42 pm
[...] at Nafurai has been working on a Java application to scrape blogs to discover what blogging software is being [...]