Board logo

標題: java HTML Parser [打印本頁]

作者: 阿水    時間: 2008-12-22 17:02     標題: java HTML Parser

any good one
作者: hdvd-rom    時間: 2008-12-22 17:53

se裏面個個試左未?

javax.swing.text.html.parser
作者: thinkpanda    時間: 2008-12-23 00:51

原帖由 阿水 於 2008-12-22 17:02 發表
any good one


nekoHTML
作者: astray    時間: 2008-12-23 00:54

jtidy
作者: 阿水    時間: 2008-12-23 11:14

please help
http://mozillaparser.sourceforge.net/quickstart.html
作者: 阿水    時間: 2008-12-26 18:26

原帖由 thinkpanda 於 2008-12-23 12:51 AM 發表


nekoHTML


trying this parser but I found a problem

String html="<html><head><title>test</title></head><body><a href="xx">aa</a></body>";
InputSource i = new InputSource(new StringReader(html));
DOMParser parser = new DOMParser();
try {
        parser.parse(i);
} catch (Exception ex) {
        ex.printStackTrace();
}
Document document = parser.getDocument();
Node body = document.getElementsByTagName("A");
System.out.println(body.getLength());

But the result obtain is 0
there should be many hyperlink
作者: chiefumpire    時間: 2008-12-26 18:54

del....

[ 本帖最後由 chiefumpire 於 2008-12-26 18:56 編輯 ]
作者: hdvd-rom    時間: 2008-12-26 20:40

因為invalid html?
作者: 阿水    時間: 2008-12-26 20:57


一時import 錯了 DOMParser
作者: 阿水    時間: 2008-12-26 21:06

原帖由 thinkpanda 於 2008-12-23 12:51 AM 發表


nekoHTML


HttpURL url = new HttpURL("http://www.kmb.hk/english.php?bus_type=A&page=search&prog=bus_type.php");

but it doesn't do it job perfectly

getting the full bus list from KMB (interested in href att of a element)

[ 本帖最後由 阿水 於 2008-12-26 21:06 編輯 ]
作者: thinkpanda    時間: 2008-12-27 11:22

原帖由 阿水 於 2008-12-26 21:06 發表


HttpURL url = new HttpURL("http://www.kmb.hk/english.php?bus_type=A&page=search&prog=bus_type.php");

but it doesn't do it job perfectly

getting the full bus list from KMB (interested in ...


多左一行? 你睇下佢既parent 係乜野?
作者: 阿水    時間: 2008-12-27 22:54

原帖由 thinkpanda 於 2008-12-27 11:22 AM 發表


多左一行? 你睇下佢既parent 係乜野?


不是多了一行
而是個source 有 illegal character

if you click http://validator.w3.org/check?ur ... =Inline&group=0

   1. Error

      Sorry, I am unable to validate this document because on line 5990 it contained one or more bytes that I cannot interpret as big5 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

      The error was: big5-eten "\x90" does not map to Unicode
作者: thinkpanda    時間: 2008-12-27 23:07

原帖由 阿水 於 2008-12-27 22:54 發表


不是多了一行
而是個source 有 illegal character

if you click http://validator.w3.org/check?ur ... Dbus_type.php&c ...


你不如試下download 左個page 番黎先, 放落一個String, replace 左個character, 然後先用個parser 黎處理.
作者: 阿水    時間: 2008-12-27 23:21

原帖由 thinkpanda 於 2008-12-27 11:07 PM 發表


你不如試下download 左個page 番黎先, 放落一個String, replace 左個character, 然後先用個parser 黎處理.


but which character should I replace



is that java internally use what encoding
作者: 望月小妖    時間: 2008-12-27 23:53

原帖由 阿水 於 2008-12-27 23:21 發表
but which character should I replace

is that java internally use what encoding


stupid method: read through all the char in the String, skip all control character by using Character.isISOControl()
作者: 阿水    時間: 2008-12-29 22:17

原帖由 望月小妖 於 2008-12-27 11:53 PM 發表


stupid method: read through all the char in the String, skip all control character by using Character.isISOControl()


OK
you win





歡迎光臨 電腦領域 HKEPC Hardware (https://h1.hkepc.com/forum/) Powered by Discuz! 7.2