作者: 阿水 時間: 2008-12-22 17:02 標題: java HTML Parser
any good one
作者: hdvd-rom 時間: 2008-12-22 17:53
se裏面個個試左未?
javax.swing.text.html.parser
作者: thinkpanda 時間: 2008-12-23 00:51
nekoHTML
作者: astray 時間: 2008-12-23 00:54
jtidy

作者: 阿水 時間: 2008-12-23 11:14
please help
http://mozillaparser.sourceforge.net/quickstart.html
作者: 阿水 時間: 2008-12-26 18:26
trying this parser but I found a problem
String html="<html><head><title>test</title></head><body><a href="xx">aa</a></body>";
InputSource i = new InputSource(new StringReader(html));
DOMParser parser = new DOMParser();
try {
parser.parse(i);
} catch (Exception ex) {
ex.printStackTrace();
}
Document document = parser.getDocument();
Node body = document.getElementsByTagName("A");
System.out.println(body.getLength());
But the result obtain is 0
there should be many hyperlink
作者: chiefumpire 時間: 2008-12-26 18:54
del....
[ 本帖最後由 chiefumpire 於 2008-12-26 18:56 編輯 ]
作者: hdvd-rom 時間: 2008-12-26 20:40
因為invalid html?
作者: 阿水 時間: 2008-12-26 20:57
頂
一時import 錯了 DOMParser
作者: 阿水 時間: 2008-12-26 21:06
HttpURL url = new HttpURL("http://www.kmb.hk/english.php?bus_type=A&page=search&prog=bus_type.php");
but it doesn't do it job perfectly
getting the full bus list from KMB (interested in href att of a element)
[ 本帖最後由 阿水 於 2008-12-26 21:06 編輯 ]
作者: thinkpanda 時間: 2008-12-27 11:22
原帖由 阿水 於 2008-12-26 21:06 發表
HttpURL url = new HttpURL("http://www.kmb.hk/english.php?bus_type=A&page=search&prog=bus_type.php");
but it doesn't do it job perfectly![]()
getting the full bus list from KMB (interested in ...
多左一行? 你睇下佢既parent 係乜野?
作者: 阿水 時間: 2008-12-27 22:54
不是多了一行
而是個source 有 illegal character
if you click http://validator.w3.org/check?ur ... =Inline&group=0
1. Error
Sorry, I am unable to validate this document because on line 5990 it contained one or more bytes that I cannot interpret as big5 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: big5-eten "\x90" does not map to Unicode
作者: thinkpanda 時間: 2008-12-27 23:07
原帖由 阿水 於 2008-12-27 22:54 發表
不是多了一行
而是個source 有 illegal character![]()
if you click http://validator.w3.org/check?ur ... Dbus_type.php&c ...
你不如試下download 左個page 番黎先, 放落一個String, replace 左個character, 然後先用個parser 黎處理.
作者: 阿水 時間: 2008-12-27 23:21
原帖由 thinkpanda 於 2008-12-27 11:07 PM 發表
你不如試下download 左個page 番黎先, 放落一個String, replace 左個character, 然後先用個parser 黎處理.
but which character should I replace
is that java internally use what encoding
作者: 望月小妖 時間: 2008-12-27 23:53
原帖由 阿水 於 2008-12-27 23:21 發表
but which character should I replace
![]()
is that java internally use what encoding![]()
stupid method: read through all the char in the String, skip all control character by using Character.isISOControl()
作者: 阿水 時間: 2008-12-29 22:17
原帖由 望月小妖 於 2008-12-27 11:53 PM 發表
stupid method: read through all the char in the String, skip all control character by using Character.isISOControl()
OK
you win

