Question
Last week, i have installed the latest version of the Plucene Search-engine (cpan) and the add-on for Dakar. After the installation of the cpan-modul the engine seems to be working. But a day ago i have problems with the search. The indexing of the topics works fine. After that i'm searching for a topic with an attachment. This works great. But if i'm looking for another topic, for example without an attachment, the search is empty. Although all topics are indexed, some topics are never found. In order to solve this problem, i changed PLUCENEINDEXEXTENSIONS to .pdf and so on. Now the plucindex create a huge amount of errors while he is indexing the attachments. The Error is
Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.8/i586-linux-thread-multi/HTML/Parser.pm line 102
After two weeks of problems, there is a another problem. While the script is indexing the data, i get the message.
Indexing Attachments ...
Error: Copying of text from this document is not allowed.
I need help for this problem. If there is an idea, don't hesitate to write.
Environment
--
MichaelWeber - 25 Oct 2006
Answer
If you answer a question - or someone answered one of your questions - please remember to edit the page and set the status to answered. The status selector is below the edit box.
You are asking two questions:
- Not all topics discovered by Plucene Plugin while searching
- Error while indexing PDF type documents.
Answer [1] - Topics can be searched by partial name query - e.g. If topic name is MyLatestNameofThisTopic then this should be listed if query is any of these strings - "My", "Latest", "Nameof", "This" "Topic", Please try doing this.
If this is not the case, then please let me know What's the topic name and query you are trying.
Answer [2] - The Error
Error: Copying of text from this document is not allowed.
is because of Xpdf, I advise you to install the latest version of Xpdf. It should solve the problem. Try xpdf version 3.00, it should be able to convert all your xpdf documents to text.
--
SopanShewale - 15 Nov 2006
Hello. Answer [2] The version of Xpdf is 3.01. When i ran the index-script two weeks ago, there wasn't problems like:
Error: Copying of text from this document is not allowed.
Answer [1] For example i'm looking for "Mercury". There are a lot of topics with "Mercury" in there body and the topicname, for example
KategorieMercury, Mercury_QC, ... . Now i enter "Mercury" or "text:Mercury" in the
PluceneSearch topic and there is no result. Naturally i choose the web, where the topics are located. In addition, i'm looking in all public webs for "Mercury", but there are no results too. When i run the index, all topics are indexed. At least it is observed in the log.
There are a lot of topics with this problem, but not all, so the search-engine works partially.
I really don't know, what this problem is about.
--
MichaelWeber - 15 Nov 2006
Hi,
I have changed the order, pushed your comments down-twiki community appends comments down so I did that, hope you dont mind for that.
About Q.2, My advice was wrong-sorry for that. One can create the PDF with "Read Only" permitions. Also other features like "Password Required for Opening it for Reading", disabling it for Printing can also be set using some encryption tools. Or conversion to Text Format can also be disabled.
Example of Such PDF -Have a look at
http://www.stellent.com/en/news/in_the_news/INTJRNL1004OLY_038927
, You wont be able to print this document.
Also if you try to convert this into text or html, you will get the error which you are noticing on your documents.
e.g. Try
[sopan@km exp]$ /usr/local/bin/pdftotext -htmlmeta intjrnl1004oly_038927.pdf myoutput.html
Error: Copying of text from this document is not allowed.
[sopan@km exp]$
How do I handle this kind of stuff on my site?
I just ignore such documents from indexting-we can ask authors not to create secured documents, right?
I just use "-q" option so that pdftotext does not give any error and our indexing does not stop-the indexing continues for other documents.
About Q.2 -
I will play with such topics-in next week on my test setup. If required, i can provide patch.
--
SopanShewale - 16 Nov 2006
Hi,
thanks for the early reply.
About Q2. Now i have to look through the latest pdf-files, in order to find out, which files are locked, e.g. not printing and so on. This will help to understand, why this problem happens.
About Q1: It would very good of you, when you provide us with a patch. This search-engine is very important for our TWiki.
--
MichaelWeber - 16 Nov 2006
Hello everybody,
is there an answer for the problem, i asked a month ago? The problem was about the partially searching for topics.
Don't hesitate to answer.
--
MichaelWeber - 21 Dec 2006
No activity for over 30 days, sorry, closing this...
--
PeterThoeny - 01 Feb 2007
MichaelWeber, did you solve this problem? I'm having a similar problem here. I've tried to search "plucene" and I got too much results - every page with this text and some others, like a text file attached to a document that containsonly the text "This is a simple text file".
--
GuilhermeGarnier - 21 Nov 2007
--
GuilhermeGarnier - 21 Nov 2007