Tags:
search1Add my vote for this tag create new tag
view all tags

Question

Last week, i have installed the latest version of the Plucene Search-engine (cpan) and the add-on for Dakar. After the installation of the cpan-modul the engine seems to be working. But a day ago i have problems with the search. The indexing of the topics works fine. After that i'm searching for a topic with an attachment. This works great. But if i'm looking for another topic, for example without an attachment, the search is empty. Although all topics are indexed, some topics are never found. In order to solve this problem, i changed PLUCENEINDEXEXTENSIONS to .pdf and so on. Now the plucindex create a huge amount of errors while he is indexing the attachments. The Error is

Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.8/i586-linux-thread-multi/HTML/Parser.pm line 102

After two weeks of problems, there is a another problem. While the script is indexing the data, i get the message.

Indexing Attachments ...
Error: Copying of text from this document is not allowed.
I need help for this problem. If there is an idea, don't hesitate to write.

Environment

TWiki version: TWikiRelease04x00x05
TWiki plugins: DefaultPlugin, EmptyPlugin, InterwikiPlugin
Server OS: Suse Linux 10.1
Web server: Apache 2.2.0
Perl version: 5.8.8
Client OS: MS Windows XP Pro
Web Browser: MS IE 6, Firefox 1.5.0.7
Categories: Search, Plugins, Add-Ons

-- MichaelWeber - 25 Oct 2006

Answer

ALERT! If you answer a question - or someone answered one of your questions - please remember to edit the page and set the status to answered. The status selector is below the edit box.

You are asking two questions:

  1. Not all topics discovered by Plucene Plugin while searching
  2. Error while indexing PDF type documents.

Answer [1] - Topics can be searched by partial name query - e.g. If topic name is MyLatestNameofThisTopic then this should be listed if query is any of these strings - "My", "Latest", "Nameof", "This" "Topic", Please try doing this. If this is not the case, then please let me know What's the topic name and query you are trying.

Answer [2] - The Error

Error: Copying of text from this document is not allowed.

is because of Xpdf, I advise you to install the latest version of Xpdf. It should solve the problem. Try xpdf version 3.00, it should be able to convert all your xpdf documents to text.

-- SopanShewale - 15 Nov 2006

Hello. Answer [2] The version of Xpdf is 3.01. When i ran the index-script two weeks ago, there wasn't problems like:

Error: Copying of text from this document is not allowed.

Answer [1] For example i'm looking for "Mercury". There are a lot of topics with "Mercury" in there body and the topicname, for example KategorieMercury, Mercury_QC, ... . Now i enter "Mercury" or "text:Mercury" in the PluceneSearch topic and there is no result. Naturally i choose the web, where the topics are located. In addition, i'm looking in all public webs for "Mercury", but there are no results too. When i run the index, all topics are indexed. At least it is observed in the log. There are a lot of topics with this problem, but not all, so the search-engine works partially.

I really don't know, what this problem is about.

-- MichaelWeber - 15 Nov 2006

Hi,

I have changed the order, pushed your comments down-twiki community appends comments down so I did that, hope you dont mind for that.

About Q.2, My advice was wrong-sorry for that. One can create the PDF with "Read Only" permitions. Also other features like "Password Required for Opening it for Reading", disabling it for Printing can also be set using some encryption tools. Or conversion to Text Format can also be disabled.

Example of Such PDF -Have a look at http://www.stellent.com/en/news/in_the_news/INTJRNL1004OLY_038927 , You wont be able to print this document. Also if you try to convert this into text or html, you will get the error which you are noticing on your documents. e.g. Try

[sopan@km exp]$ /usr/local/bin/pdftotext  -htmlmeta intjrnl1004oly_038927.pdf myoutput.html
Error: Copying of text from this document is not allowed.
[sopan@km exp]$

How do I handle this kind of stuff on my site?

I just ignore such documents from indexting-we can ask authors not to create secured documents, right? I just use "-q" option so that pdftotext does not give any error and our indexing does not stop-the indexing continues for other documents.

About Q.2 -

I will play with such topics-in next week on my test setup. If required, i can provide patch.

-- SopanShewale - 16 Nov 2006

Hi,

thanks for the early reply.

About Q2. Now i have to look through the latest pdf-files, in order to find out, which files are locked, e.g. not printing and so on. This will help to understand, why this problem happens.

About Q1: It would very good of you, when you provide us with a patch. This search-engine is very important for our TWiki.

-- MichaelWeber - 16 Nov 2006

Hello everybody,

is there an answer for the problem, i asked a month ago? The problem was about the partially searching for topics.

Don't hesitate to answer.

-- MichaelWeber - 21 Dec 2006

No activity for over 30 days, sorry, closing this...

-- PeterThoeny - 01 Feb 2007

MichaelWeber, did you solve this problem? I'm having a similar problem here. I've tried to search "plucene" and I got too much results - every page with this text and some others, like a text file attached to a document that containsonly the text "This is a simple text file".

-- GuilhermeGarnier - 21 Nov 2007

-- GuilhermeGarnier - 21 Nov 2007

Change status to:
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r10 - 2007-11-21 - GuilhermeGarnier
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.